Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004

Outline • Introduction • Motivations of Algorithm • Feature Selection • Crucial Problem and Detail Algorithm • Experiment Results • Conclusions & Discussions

Introduction • What is Homograph? • One or two or more words spelled alike but different in meaning • What is Noun Homograph Disambiguation? • Determine which of a set of pre-determined senses should be assigned to that noun • Why Noun Homograph Disambiguation is useful?

Noun Compound Interpretation

Improve Information Retrieval Results Noun Compound Interpretation ORG ORG stick stick stick stick stick ORG

Extend key words? stick ORG ORG ORG ORG

How to do? -- Motivations • Intuition1 • Human can identify word sense by local context • Intuition2 • Human’s identification ability comes from familiarity with frequent contexts • Intuition3 • Different senses can be distinguished by: -- different high-frequency context -- different syntactic, orthographic, or lexical features • Combine Intuition 1, 2, 3 Similar-sense terms will tend to have similar contexts!

Neighbor word not enough  Need syntactic information! Feature Selection • Principles: Selective & General • Example: “bank” • Numerous residences, banks, and libraries parallel buildings • They use holes in trees, banks, or rocks for nests parallel nature objects • are found on the west bank of the Nile [“direction”] bank of the “proper name” • Headed the Chase Manhattan Bank in New York Name + Capitalization

Feature Set

Crucial Problem: need large annotated data? • Problem: Cost of manual tagging is high • The size of corpus is usually large • Statistics vary a great deal across different domains • Automating the tagging of the training corpus will result in “Circularity problem” ( Dagan and Itai, 1994) • Solution: Construct the training corpus incrementally • An initial model M1, is trained using small corpus C1 • M1 is used to disambiguate the rest of ambiguous words • All words that can be disambiguated with strong confidence will be combined with C1 to form C2 • M2 is trained using C2; and repeat.

Algorithm Training Manually label a small set of samples Record context features Test Input Segmented into phrases & POS tagging Compare Evidence Check context feature of target noun Samples with high Comparative Evidence Choose sense with most evidence Output

Comparative Evidence • Definition Max (CE) where: and CE: Comparative Evidence; n: number of senses m: number of evidence features found in test sentences fij: frequency (feature j is recorded in a sentence containing sense i) • Procedure • Choose sense with maximum comparative evidence • If the largest CE is not larger than the second largest CE by threshold  the sentence cannot be classified! (Margin)

Experiment Result – “tank”

Experiment Result – “bank”

Experiment Result – “bass”

Experiment Result – “country”

Experiment Result – “Record” Record1: “archived event”  “pinnacle achievement” Record2: “archived event”  “musical disk”

Conclusions and future work • Most advantage: using bootstrapping to alleviate tagging bottleneck; No sizable sense-tagged corpus is needed • Results show the method is successful • Unsupervised Learning • helps to improve general words • has limitations on difficult words like “country”. • also helps to reduce work amount • Use of partial syntactic information: richer than common statistics techniques • Proposed Improvements • Bootstrapping from Bilingual Corpora • Improve Evidence Metric (adjust weight automatically; weight on the entire corpus and each sense; add more types) • Integrate WordNet

Discussion 1: Initial Training • A good training base need to be already obtained, Namely initial hand tagging is required. But once the training is complete, Noun Homograph Disambiguation is fast; • This initial set is still large(20-30 occurrences for each sense)  the cost of tagging is still high!

Discussion 2: Resources • Advantage of unrestricted corpus • compared to dictionaries, includes sufficient contextual variety • Can automatically integrate unfamiliar words • Assumption • The context around an instance of a sense of the homograph is meaningfully related to that sense • Need Semantic Lexicon? • Numerous residences, banks, and libraries parallel buildings • They use holes in trees, banks, or rocks for nests parallel nature objects

References • Marti A. Hearst(1991). Noun Homograph Disambiguation Using Local Context in Large Text Corpora • Yarowsky(1992). Word-Sense Disambiguation Using Statistical Models of Roget’s.. • Chin(1999). Word Sense Disambiguation Using Statistical Techniques • Peh, Ng(1997). Domain-Specific Semantic Class Disambiguation Using WordNet • Dagan, I. and Itai(1994). Word Sense Disambiguation using a second language monolingual corpus

Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Presentation Transcript

Text and Context in Translation

Author-Topic Models for Large Text Corpora

Text Corpora and Lexical Resources

Disambiguation of Biomedical Text

Homograph

URI Disambiguation in the Context of Linked Data

The Text in its Context

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Using corpora in critical discourse analysis

Using Mentor Text to Teach Grammar in Context

Large-scale organisations in context

Large Models for Large Corpora: preliminary findings

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text

Text in Context

Using Corpora in Linguistics and Lexicography

Issues: Large Corpora

Large-scale organisations in context

Understanding Text Corpora with Multiple Facets

The Text in its Context