1 / 21

To search or to crawl?: Towards a query optimizer for text-centric tasks

To search or to crawl?: Towards a query optimizer for text-centric tasks. Presented by Avinash S Bharadwaj. How can data be extracted from the web?. Execution plans for text-centric tasks follow two general paradigms for processing a text database :

chava
Télécharger la présentation

To search or to crawl?: Towards a query optimizer for text-centric tasks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. To search or to crawl?: Towards a query optimizer for text-centric tasks Presented by Avinash S Bharadwaj

  2. How can data be extracted from the web? • Execution plans for text-centric tasks follow two general paradigms for processing a text database: • The entire web can be crawled or scanned for the text automatically • Search engine indexes can be used to retrieve the documents of interest using carefully constructed queries depending on the task.

  3. Introduction • Text is ubiquitous and many applications rely on the text present in web pages for performing a variety of tasks. • Examples of text centric tasks • Reputation management systems download web pages to track the buzz around the companies. • Comparative shopping agents locate e-commerce web sites and add the products offered in the pages to their own index.

  4. Examples of text centric tasks • According to the paper there are mainly three types of text centric tasks • Task 1: Information Extraction • Task 2: Content Summary Construction • Task 3: Focused Resource Discovery

  5. Task 1: Information Extraction • Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus) Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan

  6. Task 2: Content Summary construction • Many text databases have valuable contents “hidden” behind search interfaces. • Metasearchers are used to search over multiple databases using a unified query interface. • Generation of content summary.

  7. Task 3: Focused Resource Discovery • This task considers building applications based on a particular resource. • Simplest approach is to crawl the entire web and classify the web pages accordingly • Much more efficient approach is to use a focused crawler. • The focused crawlers depend documents and hyperlinks that are on-topic, or likely to lead to on-topic documents, as determined by a number of heuristics.

  8. An Abstract View of Text-Centric Tasks Text Database Extraction System Retrieve documents from database Process documents Extract output tokens

  9. Execution Strategies • The paper describes four execution strategies. • Scan • Filtered Scan • Iterative Set Expansion • Automatic Query Generation Crawl Based Query or Index Based

  10. Execution Strategy: Scan • Scan methodology processes each document in a database exhaustively until the number of tokens extracted satisfies the target recall. • The Scan execution strategy does not need training and does not send any queries to the database. • Execution time = |Retrieved Docs| · (R + P) • Prioritizing the documents based may help in improving efficiency. Time for retrieving a document Time for processing a document

  11. Execution Strategy: Filtered Scan • Filtered scan is an improvement over the basic scan methodology. • Unlike scan filtered scan uses a classifier for a specific task to check whether the article contributes at least one token before parsing the article. • Execution time = |Retrieved Docs| · (R + P + C) Time for retrieving a document Time for classifying a document Time for processing a document

  12. Execution Strategy: Iterative Set Expansion Text Database Extraction System Query Generation Process retrieved documents Extract tokensfrom docs Augment seed tokens with new tokens Query database with seed tokens (e.g., <Malaria, Ethiopia>) (e.g., [Ebola AND Zaire])

  13. Execution Strategy: Iterative Set Expansion contd… • Iterative Set Expansion has been successfully applied in many tasks. • Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for answering a query Time for processing a document Time for retrieving a document

  14. Execution Strategy: Automatic Query Generation • Iterative Set Expansion has recall limitation due to iterative nature of query generation • Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens. • Automatic Query Generation works in two stages: • In the first stage, Automatic Query Generation trains a classifier to categorize documents as useful or not for the task • In the execution stage, Automatic Query Generation searches a database using queries that are expected to retrieve useful documents.

  15. Estimating Execution plan costs: Scan <SARS, China> Modeling Scan for Token t: • What is the probability of seeing t (with frequency g(t))after retrieving Sdocuments? • A “sampling without replacement” process • After retrievingSdocuments, frequency of token t follows hypergeometric distribution • Recall for tokent is the probability that frequency of t in S docs > 0 Probability of seeing token t after retrieving S documents g(t) = frequency of token t

  16. Estimating Execution plan costs: Iterative Set Expansion Tokens Documents • The querying graph is a bipartite graph, containing tokens and documents • Each token (transformed to a keyword query) retrieves documents • Documents contain tokens t1 d1 <SARS, China> d2 t2 <Ebola, Zaire> t3 d3 <Malaria, Ethiopia> t4 d4 <Cholera, Sudan> t5 d5 <H5N1, Vietnam>

  17. Estimating execution plan costs: Iterative Set Expansion contd… We need to compute the: • Number of documents retrieved after sending Q tokens as queries (estimates time) • Number of tokens that appear in the retrieved documents (estimates recall) To estimate these we need to compute the: • Degree distribution of the tokens discovered by retrieving documents • Degree distribution of the documents retrieved by the tokens • (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) Tokens Documents t1 d1 <SARS, China> d2 t2 <Ebola, Zaire> t3 d3 <Malaria, Ethiopia> t4 d4 <Cholera, Sudan> t5 d5 <H5N1, Vietnam>

  18. Experimental Results

  19. Experimental Results contd…

  20. Conclusions • Common execution plans for multiple text-centric tasks • Analytic models for predicting execution time and recall of various crawl- and query-based plans • Techniques for on-the-fly parameter estimation • Optimization framework picks on-the-fly the fastestplan for target recall

  21. Thank you!

More Related