1 / 27

Word Sense Disambiguation

Word Sense Disambiguation. MAS.S60 Catherine Havasi Rob Speer. Banks?. The edge of a river “I fished on the bank of the Mississippi.” A financial institution “Bank of America failed to return my call.” The building that houses the financial institution

rufus
Télécharger la présentation

Word Sense Disambiguation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Word Sense Disambiguation MAS.S60 Catherine Havasi Rob Speer

  2. Banks? • The edge of a river • “I fished on the bank of the Mississippi.” • A financial institution • “Bank of America failed to return my call.” • The building that houses the financial institution • “The bank burned down last Thursday.” • A “biological repository” • “I gave blood at the blood bank”.

  3. Word Sense Disambiguation • Most NLP tasks need WSD • “Played a lot of pool last night… my bank shot is improving!” • Usually keying to WordNet “I hit the ball with the bat.”

  4. Types • “All words” • Guess the WN sysnet • Lexical Subset • A small number of pre-defined words • Course Word Sense • All words, but more intuitive senses

  5. Types • “All words” • Guess the WN sysnet • Lexical Subset • A small number of pre-defined words • Coarse Word Sense • All words, but more intuitive senses IAA is 75-80% for all words task with WordNet 90% for simple binary tasks

  6. What is a Coarse Word Sense? • How many word senses does the word “bag” have in WordNet?

  7. What is a Coarse Word Sense? • How many word senses does the word “bag” have in WordNet? • 9 noun senses, 5 verb senses • Coarse WSD: 6 nouns, 2 verbs • A Coarse WordNet: 6,000 words (Navigli and Litkowski2006) • These distinctions are hard even for humans (Snyder and Palmer 2004) • Fine Grained IAA: 72.5% • Coarse Grained IAA: 86.4%

  8. “Bag”: Noun • 1. A coarse sense containing: • bag (a flexible container with a single opening) • bag, handbag, pocketbook, purse (a container used for carrying money and small personal items or accessories) • bag, bagful (the quantity that a bag will hold) • bag, traveling bag, travelling bag, grip, suitcase (a portable rectangular container for carrying clothes) • 2. bag (the quantity of game taken in a particular period) • 3. base, bag (a place that the runner must touch before scoring) • 4. bag, old bag (an ugly or ill-tempered woman) • 5. udder, bag (mammary gland of bovids (cows and sheep and goats)) • 6. cup of tea, bag, dish (an activity that you like or at which you are superior)

  9. Frequent Ingredients • Open Mind Word Expert • WordNet • eXtendedWordNet (XWN) • SemCor 3.0 (“brown1” and “brown2”) • ConceptNet

  10. Semcor

  11. No training set, no problem • Julia Hockenmaier’s “Psudoword” evaluation • Pick two random words • Say, “banana” and “door” • Combine them together • “BananaDoor” • Replace all instances of either in your corpora with your new pseudoword • Evaluate • A bit easier…

  12. The “Flip-flop” Method • Stephen Brown and Jonathan Rose, 1991 • Find a single feature or set of features which disambiguated the words – think the named entity recognizer

  13. An Example

  14. Standard Techniques • Naïve Bayes (notice a trend) • Bag of words • Priors are based on word frequencies • Unsupervised clustering techniques • Expectation Maximization (EM) • Yarowsky

  15. Yarowsky (slides from Julia Hockenmaier)

  16. Training Yarowsky

  17. Using OMCS • Created a blend using a large number of resources • Created an ad hoc category for a word and its surroundings in sentence • Find which word sense is most similar to category • Keep the system machinery as general as possible.

  18. Adding Associations • ConceptNet was included in two forms: • Concept vs. feature matrices • Concept-to-concept associations • Associations help to represent topic areas • If the document mentions computer-related words, expect more computer-related word senses

  19. Constructing the Blend

  20. “I put my money in the bank” Calculating the Right Sense

  21. SemEval Task 7 • 14 different systems were submitted in 2007 • Baseline: Most frequent sense • Spoiler!: Our system would have placed 4th • Top three systems: • NUS-PT: parallel corpora with SVM (Chang et al, 2007) • NUS-ML: Bayesian LDA with specialized features (Chai, et al, 2007) • LCC-WSD: multiple methods approach with end-to-end system and corpora (Novichi et al, 2007)

  22. Results

  23. Parallel Corpora • IMVHO the “right” way to do it. • Different words have different sense in different languages • Use parallel corpora to find those instances • Like Euro or UN proceedings

  24. English and Romanian

  25. Gold standards are overrated • RadaMihalcea, 2007: “Using Wikipedia for Automatic Word Sense Disambiguation”

  26. Lab: making a simple supervised WSD classifier • Big thanks to some guy with a blog (Jim Plush) • Training data: Wikipedia articles surrounding “Apple” (the fruit) and “Apple Inc.” • Test data: hand-classified tweets about apples and Apple products • Use familiar features + Naïve Bayes to get > 90% accuracy • Optional: use it with tweetstream to show only tweets about apples (the fruit)

  27. Slide Thanks • James Pustejovsky, Gerard Bakx, Julie Hockenmaier • Manning and Schutze

More Related