1 / 31

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News Hang Cui Min-Yen Kan Tat-Seng Chua {cuihang, kanmy, chuats} @ comp.nus.edu.sg School of Computing, NUS, Singapore. Problem. To answer “Who is Bob Woodward ” and “What is SARS ” questions.

rod
Télécharger la présentation

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News Hang CuiMin-Yen KanTat-Seng Chua{cuihang, kanmy, chuats} @ comp.nus.edu.sgSchool of Computing, NUS, Singapore Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  2. Problem • To answer “Who is Bob Woodward” and “What is SARS” questions. • A large portion of queries in search logs (Voorhees 2001). • Where to get definitions • Dictionaries, encyclopedias, online glossaries …… • Online news – “new terms” (e.g. Sasser) • In this paper, we • deal with recently popular terms and people. • identify definition sentences from online news. • distill search engine results to definitions. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  3. “In the News” from Google (Apr 23, 2004) In the News Bob Woodward SARS Vietnam War Yasser Arafat George W. Bush Marine Corps Gaza Strip Kofi Annan Mitsubishi Motors Alan Greenspan First Quarter Maurice Clarett Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  4. “In the News” from Google (Apr 23, 2004) A list of relevant documents rather than a direct answer Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  5. Our Solution – DefSearch Bob Woodward Woodward, an Office of Naval Intelligence (ONI) asset, interviewed over 75 Bush Cabal insiders. (CNN) Woodward, who had previously endeared himself to the Bush Administration with his pandering portrait of the President in "Bush at War", has launched a blistering assault on White House credibility with his new book, "Plan of Attack". (NY Times) People close to Mr. Powell said Sunday that they had no doubt he would weather any criticism from within over his apparent cooperation with Mr. Woodward, an assistant managing editor at The Washington Post. (CNN) The book, called Plan of Attack, is written by Bob Woodward, the respected journalist who helped break open the Watergate scandal.The book is based on interviews with 75 people, including Bush, and is due for release Tuesday. (REUTERS) Bob Woodward, the famous Watergate reporter has interviewed President Bush and other Whitehouse "insiders". As a result of the interview, Woodward might have done more damage to the Presidents re-election cause than anyone since Richard Clarkes interview on the same program and the recent events in Spain might be an indication as to how the world is beginning to view President Bush. (ABC News) Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  6. Behind DefSearch Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  7. Outline • How Do Current Systems Identify Definitions? • What are Soft Patterns? • Matching Soft Patterns • Unsupervised Learning of Soft Patterns • Evaluations • Conclusion and Future Work Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  8. How Do Current Systems Identify Definitions? • Most of current systems use hand-crafted patterns • Appositive • e.g. Gunter Blobel , a cellular and molecular biologist,… • Copulas • e.g. Battery is a kind of electronic device … • Predicates (relations) • e.g. TB is usually caused by … • Current work on definition sentence identification • Domain-specific definition generation systems • e.g. topic-specific definitions on the Web and biographies. • Definitional QA Task at TREC 2003 Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  9. Weaknesses of Current Pattern Matching Methods • Lack of Flexibility – Hard Matching • Pattern: <SCH_TERM> , also known as TB , also known as Tuberculosis , … TB ( also known as Tuberculosis ) … • Variations make hard matching fail • Introduce Soft Patterns with greater flexibility • Manual labor • Introduce unsupervised learning by Group Pseudo-Relevance Feedback (GPRF). mismatch Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  10. Outline • How Do Current Systems Identify Definitions? • What are Soft Patterns? • Matching Soft Patterns • Unsupervised Learning of Soft Patterns • Evaluations • Conclusion and Future Work Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  11. What are Soft Patterns? • Soft patterns allow partial matching TB ( also known as Tuberculosis ) … P( ( |Slot1) = 0.001, P(also|Slot2) = 0.21, P(known|Slot3) = 0.33, P(as|Slot4) = 0.13 P(Matching) = 0.23 : still better than non-definition sentences. • How does it work? • Training – accumulating pattern instances in a vector. • Derive pattern instances from labeled definition sentences. • Matching with a probabilistic model, not regular expressions. • Using statistical information from all pattern instances, not generalized rules. • Instance-based learning. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  12. Preparing Pattern Instances The channel Iqra is owned by the Arab Radio and Television company and is the brainchild of the Saudi millionaire, Saleh Kamel. Step 1 POS tagging and noun phrase chunking. The_DT channel_NN Iqra_NNP is_VBZ owned_VBN by_IN NNP company_NN and_CC is_VBZ the_DT brainchild_NN of_IN NNP. Step 2 Selective substitution – replace those specific words with more general tags. Other tokens remain unchanged. DT$ NN <SEARCH_TERM> BE$ owned by DT$ NNP and BE$ DT$ NN of NNP. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  13. Pattern Instance Preparing Pattern Instances – Cont’d DT$ NN <SCH_TERM> BE$ owned by Step 3 Crop a text window around the tag “<SCH_TERM>” (window size = 3 for each side) Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  14. Illustration of Soft Pattern Generation …… The channel Iqra is owned by the … …… severance packages, known as golden parachutes, included …… A battery is a cell which can provide electricity. DT$ NN <Search_Term> BE$ owned by known as <Search_Term> , VB <Search_Term> BE$ DT$ …… <Slot-2> <Slot-1> <Search_Term> <Slot1> <Slot2> …… NN 0.12 NN 0.11 , 0.40 DT$ 0.2 known 0.09 as 0.20 BE$ 0.2 VB 0.1 DT$ 0.04 owned 0.09 • <Slot-w, ……, Slot-2, Slot-1, SEARCH_TERM , Slot1, Slot2, …… Slotw : Pa> Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  15. Outline • How Do Current Systems Identify Definitions? • What are Soft Patterns? • Matching Soft Patterns – Addressing Flexibility • Unsupervised Learning of Soft Patterns • Evaluations • Conclusion and Future Work Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  16. Matching Soft Patterns • Test sentences are reduced to a vector S using the same strategy. <token-w, …, token-1, SEARCH_TERM, token1, …, tokenw : S> • Matching Soft Patterns – similarity between the pattern vector Pa and the test vector S. • Independent slot content similarity. • Slot sequence fidelity. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  17. Probabilistic Matching Degree • Individual slot similarity – independent assumption • Sequence fidelity – bigram model • Combined to get the matching degree Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  18. Outline • How Do Current Systems Identify Definitions? • What are Soft Patterns? • Matching Soft Patterns – Flexibility • Unsupervised Learning of Soft Patterns – Addressing Manual Labor • Evaluations • Conclusion and Future Work Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  19. Unsupervised Labeling of Definition Sentences using GPRF • Pattern instances obtained from labeled definition sentences. • Manual labeling is too expensive. • Pseudo-relevance Feedback in document retrieval • Take the top n ranked documents as relevant. • We employ Group pseudo-relevance feedback (GPRF) • Statistical ranking – centroid based method. • Perform PRF over a group of questions (top 10 sentences for each question). • Generate soft patterns from all auto-labeled sentences for all questions. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  20. Analysis of GPRF • Assumption 1 – some definition sentences can be ranked high using statistical method. • Word co-occurrence metrics can well model descriptive sentences. • Over 33% of top ranked sentences are definitional. • Noise introduced in each question’s top list can be mitigated by the group strategy. • Assumption 2 – definition patterns are general and can be used across questions. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  21. Outline • How Do Current Systems Identify Definitions? • What are Soft Patterns? • Matching Soft Patterns – Flexibility • Unsupervised Learning of Soft Patterns • Evaluations • Conclusion and Future Work Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  22. Evaluation Setup • Two experiments • To evaluate the effectiveness of our method on a community-standard corpus. • TREC QA corpus - About 1M news articles. • 50 definitional questions with answer nuggets. • To assess the adaptability of the system to actual online news and recent questions. • 26 questions from Lycos. • Up to 200 news articles from each of eight news sites (e.g. CNN and BBC) for each question. • Comparison Systems • Baseline system – centroid based ranking (IR). • A top ranked definitional question answering system at TREC2003 – HCR • Hand-crafted definition patterns (a man-month of time to construct). Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  23. Evaluation Metrics • Based on given answer nuggets. • The most essential information about the target. • Judged by human assessors. • Nugget Precision (NP) • Penalty to longer answers. • Nugget Recall (NR) • Proportion of returned nuggets to vital nuggets. • F5-measure (weighting NR 5 times as NP) Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  24. Evaluations on TREC Corpus • Pattern matching has significant impact on definition sentence identification. • Soft patterns are more effective for news text. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  25. Evaluations on the Web Corpus • Using two sets of soft patterns. • More pattern instances lead to better performance (683 from TREC vs. 375 from Lycos). • Soft patterns are general enough to be applied to other corpora. • Makes offline training possible. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  26. Outline • How Do Current Systems Identify Definitions? • What are Soft Patterns? • Matching Soft Patterns – Flexibility • Unsupervised Learning of Soft Patterns • Evaluations • Conclusion and Future Work Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  27. Conclusions and Future Work • Current definition pattern matching has weaknesses • Lack of flexibility • Manual labor • We address them by • Soft patterns • Unsupervised learning by Group PRF • Soft patterns prove to be effective in Web-based definition generation systems. • Future work • Soft patterns in information extraction and factoid question answering. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  28. Q & A Try our online demo at http://www-appn.comp.nus.edu.sg/~cuihang/DefSearch/DefSearch.htm! Thanks! Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  29. Statistical Ranking – Centroid Word Weighting • Weighting the words by their co-occurrences with the search target. • Words with the centrality weights beyond a predefined threshold form a centroid vector. • Cosine similarity with the centroid vector used to rank the sentences. • Top Ranked sentences by the centroid vector are deemed as definition sentence candidates. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  30. Sentence Selection • We adopt a variation of Maximal Marginal Relevance (MMR) to summarize the definition sentences. • To ensure relevance and to avoid redundancy. • Examine only the top ranked sentences and stop when the length of the definition is reached. • Different from MMR, which examines all sentences. • Due to the noisy input sentences. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

  31. Compared to HMM • Both address individual slot content and sequence fidelity. • Soft patterns perform instance-based learning – can deal with • Small training set • Noisy data from group pseudo-relevance feedback • Online training • HMM needs • More training data and time • Explicit transition paths between states Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

More Related