1 / 67

Towards a Query Optimizer for Text-Centric Tasks

Towards a Query Optimizer for Text-Centric Tasks. Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain, Luis Gravano. Presenter: Avinandan Sengupta. Session Outline. Text Centric Tasks. Methods Employed. A More Disciplined Approach. Experimental Setup. Proposed Algorithm.

clay
Télécharger la présentation

Towards a Query Optimizer for Text-Centric Tasks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Query Optimizer for Text-Centric Tasks Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano Presenter: Avinandan Sengupta

  2. Session Outline Text Centric Tasks Methods Employed A More Disciplined Approach Experimental Setup Proposed Algorithm Results Conclusion

  3. Scenario I sample tuples Construction of a table of disease outbreaks from a newspaper archive Task 1 Information Extraction

  4. Scenario II Tabulating the number of times an organization’s name appears on a particular web site Task 2 Content Summary Construction

  5. Scenario III Discovering pages on Botany on the Internet Task 3 Focused Resource Discovery

  6. Text-centric tasks Types Information Extraction Focused Resource Discovery Content Summary Construction

  7. Performing Text-Centric Tasks

  8. Recall – In Text Centric Tasks Documents Processed Strategy Set of tokens that the document processor P extracts from the corpus Corpus

  9. General flow Start Retrieve documents from corpus Done Y Recall ≥ Target Recall Relevant? Document Classifier Y Process document Document Processor optional Document Retrieval Token Extraction Check

  10. What are the available method for retrieval? Scan Iterative Set Expansion Filtered Scan Automatic Query Generation ISE Crawl Query AQG Execution Strategies

  11. Execution Time – Generic Model Strategy Corpus

  12. Execution Time – Simplified

  13. Scan (SC) = Time(SC,D) = |Dretr| . (tR+tP)

  14. Filtered Scan (FS) : selectivity of C fraction of database documents that C judges useful one time, offline Time(FS,D) = |Dretr| . (tR + tF+ Cσ . tP)

  15. Iterative Set Expansion (ISE) = Time(ISE,D) = |Qsent| . tQ+ |Dretr| . (tR+tP)

  16. Automatic Query Generation (AQG) = Time(AQG,D) = |Qsent| . tQ+ |Dretr| . (tR+tP)

  17. Which strategy to use? Text centric tasks Select a strategy based on heuristics/intuition Querying Crawling

  18. A More Disciplined Approach

  19. Can we do better? Filtered Scan Scan Define execution models Estimate cost s ISE Select appropriate technique based on cost Revisit technique selection AQG

  20. Formalizing the problem Given a target recall value τ , the goal is to identify an execution strategy S among S1, . . . , Sn such that: Recall(S, D) ≥ τ Time(S, D) ≤ Time(Sj , D) if Recall(Sj , D) ≥ τ

  21. Degrees g(d) # of distinct tokens extracted from d using P degree of a document Duseful g(t) # of distinct documents in D from which P can extract t Duseless degree of a token g(q) # of documents from D retrieved by query q degree of a query

  22. Cost of Scan - 1 Time(SC,D) = |Dretr| . (tR+tP) SC retrieves documents in no particular order and does not retrieve the same document twice. SC is doing multiple token sampling from a finite population in parallel over D Probability of observing a token t k times in a sample of size S follow hypergeometric distribution

  23. Cost of Scan - 2 # of documents in which the token does not appear # of ways to select S documents from |D| - g(t) docs # of ways to select S documents from |D| docs Probability that token t does not appear in the sample Probability that token t appears in at least one document Expected number of tokens retrieved after processing S documents

  24. Cost of Scan - 3 We do not know the exact g(t) for each token But, we know the form of the token degree distribution [power law distribution] Thus by using estimates for the probabilities Pr{g(t) = i} |Tokens| * { Pr{g(t) = 1}*[1 – (|D| - 1)!(|D| - S)!/(|D| - 1 – S)!|D|!] + Pr{g(t) = 2}*[1 – (|D| - 2)!(|D| - S)!/(|D| - 2 – S)!|D|!] + ... + Pr{g(t) = ∞}*[1 – (|D| - ∞)!(|D| - S)!/(|D| - ∞– S)!|D|!]} Estimated # of documents retrieved to achieve a target recall

  25. Cost of Filtered Scan Classifier selectivity Classifier recall Cr : the fraction of useful documents in D that are also classified as useful by the classifier. A uniform recall is assumed across tokens Cr* g(t) : # times each token appears (on average)

  26. Cost of Filtered Scan Estimated # of documents retrieved to achieve a target recall When Cσ is high, almost all documents in D are processed by P, and the behavior tends towards that of Scan

  27. Cost of ISE - Random Graph Model A random graph is a collection of points, or vertices, with lines, or edges, connecting pairs of them at random The presence or absence of an edge between two vertices is independent of the presence or absence of any other edge, so that each edge may be considered to be present with independent probability p.

  28. Cost of ISE – Querying Graph Querying Graph: A bipartite graph with (V,E) V = {Tokens, t} U {Documents, d} E1= {edge: d->t, such that tokens t can be extracted from d} E2= {edge: t->d, such that a query with t retrieves document d} E= E1 U E2

  29. Cost of ISE – With Generating Functions Degree distribution of a randomly chosen document Degree distribution of a randomly chosen token pdk is the probability that a randomly chosen document d contains k tokens ptk is the probability that a randomly chosen token t retrieves kdocuments

  30. Cost of ISE – With Generating Functions degree distribution for a document chosen by following a random edge degree distribution for a token chosen by following a random edge

  31. Cost of ISE – Properties of Generating Functions

  32. Cost of ISE - Evaluation Consider: ISE has sent a set Q of tokens as queries These tokens were discovered by following random edges on the graph The degree distribution of these tokens is: Gt1(x) By the Power property, the distribution of the total number of retrieved documents (which are pointed to by these tokens) Gd2(x) = [Gt1(x)]|Q| Implies - |Dretr| is a random variable whose distribution is given by Gd2(x) Documents are retrieved by following random edges on the graph Hence, the degree distribution of these documents is described by Gd1(x) Time(ISE,D) = |Qsent| . tQ+ |Dretr| . (tR+tP)

  33. Cost of ISE - Evaluation By Composition property, the distribution of the total number of tokens retrieved |Tokensretr| by the Dretr documents: Using Moments property, the expected values for|Dretr| and |Tokensretr|, after ISE sends Q queries the number of queries |Qsent| sent by Iterative Set Expansion to reach the target recall τ

  34. Cost of AQG

  35. Algorithms

  36. Global Optimization

  37. Local Optimization

  38. Probablity, Distributions, Parameter Estimations

  39. Scan - Parameter Estimation This relies on the characteristics of the token and document degree distributions. After retrieving and processing a few documents, we can estimate the distribution parameters based on the frequency of the initially extracted tokens and documents. Specifically, we can use a maximum likelihood fit to estimate the parameters of the document degree distribution. For example, the document degrees for Task 1 tend to follow a power-law distribution, with a probability mass function: ζ (β) is the Riemann zeta function (serves as a normalizing factor) Goal: Estimate the most likely value of β, for a given sample of document degrees g(d1), . . . , g(ds) Use MLE to identify the value of β that maximizes the likelihood function:

  40. Scan - Parameter Estimation Find the maxima: we can estimate the value of β using numeric approximation

  41. Scan – Token Distribution Estimation To maximize the above, we take log, (eliminate factorials by Stirling’s approximation, and equate the derivative to zero to find the maxima

  42. Filtered Scan – Parameter Estimation

  43. ISE – Parameter Estimation

  44. AQG – Parameter Estimation

  45. Experimental Setting and Results

  46. Details of the Experiments • Tuple extraction from New York Times archives • Categorized word frequency computation for Usenet newgroups • Document retrieval on Botany from the Internet

  47. Task 1a, 1b – Information Extraction 1a: Extracting a Disease-Outbreaks relation, tuple(DiseaseName, Country) 1b: extracting a Headquarters relation, tuple(Organization,Location) Document Classifier: RIPPER Document Processor: Snowball Token: a single tuple of the target relation Document: a news article from The New York Times archive Corpus: Newspaper articles from The NewYork Times, published in 1995 (NYT95) and 1996 (NYT96) g(d):power-law distribution g(t): power-law distribution NYT95 documents for training NYT96 Features 182,531 documents, 16,921 tokens (Task 1a) 605 tokens (Task 1b) NYT96 documents for evaluation of the alternative execution strategies

  48. Task 1a, 1b – Information Extraction RIPPER trained with a set of 500 useful documents and 1500 not useful documents from the NYT95 data set FS: Rule Based Classifier (RIPPER) ISE: construct queries using the AND operator of the attributes of each tuple(tupletyphus, Belize -> [typhus AND Belize]) ISE/AQG: maximum # of returned documents - 100 AQG: 2000 documents from the NYT95 data set as a training set to create the queries required by Automatic Query Generation

  49. Task 2 - Content Summary Construction Extracting words and their frequency from newsgroup Document Processor: Simple Tokenizer Token: words and its frequency Document: A Usenet message Corpus: 20 Newgroups data set from the UCI KDD Archive. Contains 20,000 messages FS: not applicable (all documents useful) g(d):lognormal distribution g(t): power-law distribution ISE: queries are constructed using words that appear in previously retrieved documents ISE/AQG: maximum # of returned documents - 100 AQG Modus operandi • Separate documents into topics based on high-level name of the newsgroup (comp, sci) • Train a rule-based classifier using RIPPER; creates rules to assign documents into categories • Final queries contain the antecedents of the rules, across all categories

  50. Task 3 – Focused Resource Discovery Retrieving document on Botany from the Internet Document Processor: Multinomial Naïve Bayes Classifier Token: URL of page on Botany Document: Web page Corpus: 800,000 pages with 12,000 relevant to Botany g(d):lognormal distribution g(t): power-law distribution ISE/AQG: maximum # of returned documents - 100 AQG Modus operandi • Separate documents into topics based on high-level name of the newsgroup (comp, sci) • Train a rule-based classifier using RIPPER; creates rules to assign documents into categories • Final queries contain the antecedents of the rules, across all categories

More Related