Towards a Query Optimizer for Text-Centric Tasks

Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Joint work with Luis Gravano, Eugene Agichtein, Pranay Jain

Text-Centric Task I: Information Extraction • Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus)

Text-Centric Task II: Metasearching • Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately Friday June 16, NEW YORK (Forbes) - Starbucks Corp. may be next on the target list of CSPI, a consumer-health group that this week sued the operator of the KFC restaurant chain Content Summary of Forbes.com Content Summary Extractor

Text-Centric Task III: Focused Resource Discovery • Identify web pages about a given topic (multiple techniques proposed: simple classifiers, focused crawlers, focused querying,…) Web Pages about Botany Web Page Classifier

For the rest of the talk An Abstract View of Text-Centric Tasks Text Database Extraction System Retrieve documents from database Process documents Extract output tokens

Executing a Text-Centric Task Text Database Extraction System Similar to relational world Retrieve documents from database Process documents Extract output tokens Two major execution paradigms • Scan-based: Retrieve and process documents sequentially • Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results →underlying data distribution dictates what is best • Indexes are only “approximate”: index is on keywords, not on tokens of interest • Choice of execution plan affects output completeness (not only speed) Unlike the relational world

Execution Plan Characteristics Question: How do we choose the fastestexecution plan for reaching a targetrecall ? Text Database Extraction System Retrieve documents from database Process documents Extract output tokens Execution Plans have two main characteristics: • Execution Time • Recall (fraction of tokens retrieved) “What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”

Outline • Description and analysis of crawl- and query-based plans • Scan • Filtered Scan • Iterative Set Expansion • Automatic Query Generation • Optimization strategy • Experimental results and conclusions Crawl-based Query-based (Index-based)

Scan Text Database Extraction System • Scanretrieves and processes documentssequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Retrieve docs from database Process documents Extract output tokens Question: How many documents does Scan retrieve to reach target recall? Time for processing a document Time for retrieving a document

S documents Estimating Recall of Scan <SARS, China> Modeling Scan for Token t: • What is the probability of seeing t (with frequency g(t)) after retrieving S documents? • A “sampling without replacement” process • After retrieving S documents, frequency of token t follows hypergeometric distribution • Recall for tokent is the probability that frequency of t in S docs > 0 Probability of seeing token t after retrieving S documents g(t) = frequency of token t

Estimating Recall of Scan <SARS, China> <Ebola, Zaire> Modeling Scan: • Multiple “sampling without replacement” processes, one for each token • Overall recall is average recall across tokens → We can compute number of documents required to reach target recall Execution time = |Retrieved Docs| · (R + P)

Classifier Filter documents σ Classifier selectivity (σ≤1) Scan vs. Filtered Scan Text Database Extraction System • Scanretrieves and processes all documents (until reaching target recall) • Filtered Scanuses a classifier to identify and process only promising documents(e.g., the Sports section of NYT is unlikely to describe disease outbreaks) Execution time = |Retrieved Docs| * ( R + P) filtered Retrieve docs from database Process documents Extract output tokens F + Time for processing a document Question: How many documents does (Filtered) Scan retrieve to reach target recall? Time for retrieving a document Time for filteringa document

Documents rejected by classifier decrease effective database size Tokens in rejected documents have lower effective token frequency Estimating Recall of Filtered Scan Modeling Filtered Scan: • Analysis similar to Scan • Main difference: the classifier rejects documents and • Decreases effective database size from |D| to σ·|D| (σ: classifier selectivity) • Decreases effective token frequencyfrom g(t) to r·g(t)(r: classifier recall)

Outline • Description and analysis of crawl- and query-based plans • Scan • Filtered Scan • Iterative Set Expansion • Automatic Query Generation • Optimization strategy • Experimental results and conclusions Crawl-based Query-based

Iterative Set Expansion Text Database Extraction System Query Generation Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Process retrieved documents Extract tokensfrom docs Augment seed tokens with new tokens Query database with seed tokens (e.g., <Malaria, Ethiopia>) (e.g., [Ebola AND Zaire]) Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for processing a document Time for answering a query Time for retrieving a document

Querying Graph Tokens Documents • The querying graph is a bipartite graph, containing tokens and documents • Each token (transformed to a keyword query) retrieves documents • Documents contain tokens t1 d1 <SARS, China> d2 t2 <Ebola, Zaire> t3 d3 <Malaria, Ethiopia> t4 d4 <Cholera, Sudan> t5 d5 <H5N1, Vietnam>

Using Querying Graph for Analysis Tokens Documents We need to compute the: • Number of documents retrieved after sending Q tokens as queries (estimates time) • Number of tokens that appear in the retrieved documents (estimates recall) To estimate these we need to compute the: • Degree distribution of the tokens discovered by retrieving documents • Degree distribution of the documents retrieved by the tokens • (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) t1 d1 <SARS, China> d2 t2 <Ebola, Zaire> t3 d3 <Malaria, Ethiopia> t4 d4 <Cholera, Sudan> t5 d5 <H5N1, Vietnam> Elegant analysis framework based on generating functions – SIGMOD06

Recall Limit: ReachabilityGraph ReachabilityGraph Tokens Documents t1 t1 d1 t2 t3 d2 t2 t3 d3 t4 t5 t4 d4 t1retrieves document d1that contains t2 t5 d5 Upper recall limit: determined by the size of the biggest connected component

Iterative Set Expansion has recall limitation due to iterative nature of query generation Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens Automatic Query Generation

Automatic Query Generation OfflineQueryGeneration Text Database Extraction System Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Process retrieved documents Extract tokensfrom docs Query database Generate queries that tend to retrieve documents with tokens Time for answering a query Time for retrieving a document Time for processing a document

Estimating Recall of Automatic Query Generation Text Database • Query q retrieves g(q) docs • Query has precision p(q) • p(q)·g(q) useful docs • [1-p(q)]·g(q) useless docs • We compute total number of useful (and useless) documents retrieved • Analysis similar to Filtered Scan: • Effective database size is |Duseful| • Sample size S is number of useful documents retrieved p(q)·g(q) q (1-p(q))·g(q) Useful Doc Useless Doc

Outline • Description and analysis of crawl- and query-based plans • Optimization strategy • Experimental results and conclusions

Summary of Cost Analysis • Our analysis so far: • Takes as input a target recall • Gives as output the time for each plan to reach target recall(time = infinity, if plan cannot reach target recall) • Time and recall depend on task-specific properties of database: • Token degree distribution • Document degree distribution • Next, we show how to estimate degree distributions on-the-fly

Estimating Cost Model Parameters Token and document degree distributions belong to known distribution families Can characterize distributions with only a few parameters!

Parameter Estimation • Naïve solution for parameter estimation: • Start with separate, “parameter-estimation” phase • Perform random sampling on database • Stop when cross-validation indicates high confidence • We can do better than this! • No need for separate sampling phase • Sampling is equivalent to executing the task: →Piggyback parameter estimation into execution

Initial, default estimation Updated estimation Updated estimation On-the-fly Parameter Estimation Correct (but unknown) distribution • Pick most promising execution plan for target recall assuming “default” parameter values • Start executing task • Update parameter estimates during execution using MLE • Switch plan if updated statistics indicate so Important • Only Scan acts as “random sampling” • All other execution plan need parameter adjustment (ACM TODS, Dec 2007)

Outline • Description and analysis of crawl- and query-based plans • Optimization strategy • Experimental results and conclusions

Correctness of Theoretical Analysis • Solid lines: Actual time • Dotted lines: Predicted time with correct parameters Task: Disease Outbreaks Snowball IE system 182,531 documents from NYT 16,921 tokens

Experimental Results (Information Extraction) • Solid lines: Actual time • Green line: Time with optimizer (results similar in other experiments – see ACM TODS paper)

Conclusions • Common execution plans for multiple text-centric tasks • Analytic models for predicting execution time and recall of various crawl- and query-based plans • Techniques for on-the-fly parameter estimation • Optimization framework picks on-the-fly the fastestplan for target recall

Future Work • Incorporate precision and recall of extraction system in framework using ROC curves • Create non-parametric optimization (i.e., no assumption about distribution families) • Examine other text-centric tasks and analyze new execution plans

Thank you! ありがとう

Overflow Slides

Experimental Results (IE, Headquarters) Task: Company Headquarters Snowball IE system 182,531 documents from NYT 16,921 tokens

Experimental Results (Content Summaries) Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens

ISE is a cheap plan for low target recall but becomes the most expensive for high target recall Experimental Results (Content Summaries) Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens

Experimental Results (Content Summaries) Underestimated recall for AQG, switched to ISE Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens

Experimental Results (Information Extraction) OPTIMIZED is faster than “best plan”: overestimated F.S. recall, but after F.S. run to completion, OPTIMIZED just switched to Scan

Focused Resource Discovery Focused Resource Discovery 800,000 web pages 12,000 tokens

Towards a Query Optimizer for Text-Centric Tasks