1 / 21

Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Noun Homograph Disambiguation Using Local Context in Large Text Corpora. Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004. Outline. Introduction Motivations of Algorithm Feature Selection Crucial Problem and Detail Algorithm Experiment Results Conclusions & Discussions.

orly
Télécharger la présentation

Noun Homograph Disambiguation Using Local Context in Large Text Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004

  2. Outline • Introduction • Motivations of Algorithm • Feature Selection • Crucial Problem and Detail Algorithm • Experiment Results • Conclusions & Discussions

  3. Introduction • What is Homograph? • One or two or more words spelled alike but different in meaning • What is Noun Homograph Disambiguation? • Determine which of a set of pre-determined senses should be assigned to that noun • Why Noun Homograph Disambiguation is useful?

  4. Noun Compound Interpretation

  5. Improve Information Retrieval Results Noun Compound Interpretation ORG ORG stick stick stick stick stick ORG

  6. Extend key words? stick ORG ORG ORG ORG

  7. How to do? -- Motivations • Intuition1 • Human can identify word sense by local context • Intuition2 • Human’s identification ability comes from familiarity with frequent contexts • Intuition3 • Different senses can be distinguished by: -- different high-frequency context -- different syntactic, orthographic, or lexical features • Combine Intuition 1, 2, 3 Similar-sense terms will tend to have similar contexts!

  8. Neighbor word not enough  Need syntactic information! Feature Selection • Principles: Selective & General • Example: “bank” • Numerous residences, banks, and libraries parallel buildings • They use holes in trees, banks, or rocks for nests parallel nature objects • are found on the west bank of the Nile [“direction”] bank of the “proper name” • Headed the Chase Manhattan Bank in New York Name + Capitalization

  9. Feature Set

  10. Crucial Problem: need large annotated data? • Problem: Cost of manual tagging is high • The size of corpus is usually large • Statistics vary a great deal across different domains • Automating the tagging of the training corpus will result in “Circularity problem” ( Dagan and Itai, 1994) • Solution: Construct the training corpus incrementally • An initial model M1, is trained using small corpus C1 • M1 is used to disambiguate the rest of ambiguous words • All words that can be disambiguated with strong confidence will be combined with C1 to form C2 • M2 is trained using C2; and repeat.

  11. Algorithm Training Manually label a small set of samples Record context features Test Input Segmented into phrases & POS tagging Compare Evidence Check context feature of target noun Samples with high Comparative Evidence Choose sense with most evidence Output

  12. Comparative Evidence • Definition Max (CE) where: and CE: Comparative Evidence; n: number of senses m: number of evidence features found in test sentences fij: frequency (feature j is recorded in a sentence containing sense i) • Procedure • Choose sense with maximum comparative evidence • If the largest CE is not larger than the second largest CE by threshold  the sentence cannot be classified! (Margin)

  13. Experiment Result – “tank”

  14. Experiment Result – “bank”

  15. Experiment Result – “bass”

  16. Experiment Result – “country”

  17. Experiment Result – “Record” Record1: “archived event”  “pinnacle achievement” Record2: “archived event”  “musical disk”

  18. Conclusions and future work • Most advantage: using bootstrapping to alleviate tagging bottleneck; No sizable sense-tagged corpus is needed • Results show the method is successful • Unsupervised Learning • helps to improve general words • has limitations on difficult words like “country”. • also helps to reduce work amount • Use of partial syntactic information: richer than common statistics techniques • Proposed Improvements • Bootstrapping from Bilingual Corpora • Improve Evidence Metric (adjust weight automatically; weight on the entire corpus and each sense; add more types) • Integrate WordNet

  19. Discussion 1: Initial Training • A good training base need to be already obtained, Namely initial hand tagging is required. But once the training is complete, Noun Homograph Disambiguation is fast; • This initial set is still large(20-30 occurrences for each sense)  the cost of tagging is still high!

  20. Discussion 2: Resources • Advantage of unrestricted corpus • compared to dictionaries, includes sufficient contextual variety • Can automatically integrate unfamiliar words • Assumption • The context around an instance of a sense of the homograph is meaningfully related to that sense • Need Semantic Lexicon? • Numerous residences, banks, and libraries parallel buildings • They use holes in trees, banks, or rocks for nests parallel nature objects

  21. References • Marti A. Hearst(1991). Noun Homograph Disambiguation Using Local Context in Large Text Corpora • Yarowsky(1992). Word-Sense Disambiguation Using Statistical Models of Roget’s.. • Chin(1999). Word Sense Disambiguation Using Statistical Techniques • Peh, Ng(1997). Domain-Specific Semantic Class Disambiguation Using WordNet • Dagan, I. and Itai(1994). Word Sense Disambiguation using a second language monolingual corpus

More Related