1 / 19

Effective Phrase Prediction

Effective Phrase Prediction. VLDB 2007 : Text Databases Presented By Arnab Nandi, H. V. Jagadish University of Michigan 2008-03-07 Summerized By Jaeseok Myung. Motivation. Pervasiveness of Autocompletion Typical autocompletion is still at word level Phrase Prediction

yukio
Télécharger la présentation

Effective Phrase Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effective Phrase Prediction VLDB 2007 : Text Databases Presented By Arnab Nandi, H. V. Jagadish University of Michigan 2008-03-07 Summerized By Jaeseok Myung

  2. Motivation • Pervasiveness of Autocompletion • Typical autocompletion is still at word level • Phrase Prediction • Words provide much more information to exploit for prediction • Context, Phrase Structures • Most text is predictable and repetitive in many applications • Email Composition • Prob(“Thank you very much” | “Thank”) ~= 1 Center for E-Business Technology

  3. Challenges • Number of phrases is large • n(vocabulary) >> n(alphabet) • n(phrases) = O(vocabulary phrase length) • => FussyTree structure • Length of phrase is unknown • “word” has a well-defined boundary • => Significance • How to evaluate a suggestion mechanism? • => Total Profit Metric (TPM) Center for E-Business Technology

  4. Problem Definition • R = query(p) • Need data structure that can • Store completions efficiently • Support fast querying Center for E-Business Technology

  5. An n-gram Data Model • R = query(p) : r ∈ R, prob (p, r) is maximized • mth order Markov model • m: # of previous states that we are using to predict the next state • n-gram model is equivalent to an (n-1)th order Markov model w7,1 10 20 w7,2 30 w7,3 Prefix length p = 5 frequency for rank Center for E-Business Technology

  6. Fundamental Data Structures • Basic data structure to “completion” problems • TRIE or Suffix Tree • Phrase Version • Every node = word <TRIE> <Suffix Tree> Center for E-Business Technology

  7. Pruned Count Suffix Tree(PCST) • Construct a frequency based phrase Tree • Prune all nodes with frequency < threshold τ • Problems • PCST including infrequent phrases is constructed as an intermediate result => does not perform well for large sized data [16] Estimating alphanumeric selectivity in the presence of wildcards Center for E-Business Technology

  8. FussyTree Construction • Filter out infrequent phrases even before adding to the tree training sentence size N = 2, τ = 2 threshold Tokenizing window size = 4 the size of the largest frequent phrase (please, call, me, asap) Ignoredphrases (call, me, asap, -end-) Center for E-Business Technology

  9. Significance • A node in the FussyTree is “significant” if it marks a phrase boundary Example : “please call” “please call”(3) > “please”(0) * “call”(1) “please call”(3) > ½ * “please”(0) “please call”(3) > 3 * “please call me”(1) … Z and Y are considered tuning parameters Assume, z=2, y=3 Center for E-Business Technology

  10. Significance – cont. • All leaves are significant • due to END node (frequency = 0) • Some internal nodes are significant too • Intuitively, suggestions ending on significant nodes will be better • No need to store counts END Center for E-Business Technology

  11. Online Significance Marking • (Offline) Significance requires an additional pass • Compare against tree generated by FussyTree with offline significance A A B B Add “ABCXY” The branch point is considered for promotion C C The immediate descendant significant nodes are considered for demotion D D X E E Y Center for E-Business Technology

  12. Evaluation Metrics • Precision & Recall • Refer to the quality of the suggestions themselves • For ranked results : Center for E-Business Technology

  13. Total Profit Metric (TPM) • Total Profit Metric • TPM measures the effectiveness of suggestion mechanism • Counting number of keystrokes saved by suggestions • d is the distraction parameter • TPM(0) corresponds to a user who does not mind distraction at all • TPM(1) is an extreme case where we consider every suggestion(right or wrong) to be a blocking factor that costs us one keystroke • The distraction value would be closer to 0 than 1 Center for E-Business Technology

  14. Total Profit Metric – An Example Center for E-Business Technology

  15. Experiments • Multiple Corpora • Enron Small : 1 user’s “sent” (366 emails, 250KB) • Enron Large : multiple users (20,842 emails, 16MB) • Wikipedia (40,000 documents, 53MB) • Data Structures • (1) PCST, (2) FussyTree with Count, (3) FussyTree with Significance • Parameters • Significance : z(comparability) = 2, y(uniqueness) = 2 • Training Sentence Size N = 8 • Prefix Size P = 2 Center for E-Business Technology

  16. Prediction Quality Center for E-Business Technology

  17. Tuning Parameters (1) Center for E-Business Technology

  18. Tuning Parameters (2) Center for E-Business Technology

  19. Conclusion • Phrase level autocompletion is challenging, but can provide much greater savings beyond word-level autocompletion • A technique to accomplish this based on “significance” • New evaluation metrics for ranked autocompletion • Possible Extensions • Part of Speech Reranking • Semantic Reranking • Using Wordnet • Query Completion for structured data • XML, .. Center for E-Business Technology

More Related