1 / 29

CS 479, Section 1: Natural Language Processing

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License . CS 479, Section 1: Natural Language Processing. Lecture # 19 : Word Sense Disambiguation. Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture. Announcements.

aliza
Télécharger la présentation

CS 479, Section 1: Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, Section 1:Natural Language Processing Lecture #19: Word Sense Disambiguation Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture.

  2. Announcements • Project #2, Part 1 • Early: today • Due: Wednesday • Mid-term Exam • How’d it go? • Follow-up on Friday

  3. Objectives • Review the idea behind joint, generative models. • Discuss the challenge of word sense disambiguation and approach it as a classification problem. • Explore knowledge sources (“features”) that might help us do a better job of disambiguating word senses. • Motivate the need for conditional models, trained discriminatively. • Prepare to see maximum entropy training.

  4. Joint / Generative Models • Models for text categorization (Naïve-Bayes, class-conditional LMs) • These are Joint models == Generative models: • Break complex structure down into derivation steps • “factors” or “local models” • Each step is a categorical choice (at least for our current purposes), conditioned on specified context • How to estimate parameters from labeled data? • Collect counts andsmooth • Backbone of much of statistical NLP c START w1 w2 wn . . .

  5. Today • Conditional models • Trained discriminatively • Motivated by the problem of word sense disambiguation

  6. Word Senses • http://www.youtube.com/watch?v=algmKOzTSGE

  7. Word Senses • Words have multiple distinct meanings, or senses: • plant: living plant, manufacturing plant, … • title: name of a work, ownership document, form of address, material at the start of a film, … • Many levels of sense distinctions • Homonymy: totally unrelated meanings (river bank, money bank) • Polysemy: related meanings (star in sky, star on tv) • Systematic polysemy: productive meaning extensions (organizations to their buildings) or metaphor • Sense distinctions can be extremely subtle (or not) • Granularity of senses needed depends a lot on the task • Why is it important to model word senses? • Translation, parsing, information retrieval?

  8. Word Sense Disambiguation • Example: living plant vs. manufacturing plant • How do we tell these senses apart? • “context” • Is this just text classification using words from nearby text,where each word sense represents a class? • Solution: run our naïve-bayes classifier? • Naïve Bayes classification works OK for noun senses • 90% on classic, shockingly easy examples (line, interest, star) • 80% on senseval-1 nouns • 70% on senseval-1 verbs – harder! The plant which had previously sustained the town’s economy shut down after an extended labor strike.

  9. Others’ Approaches to WSD • Supervised learning • Most WSD systems do some kind of supervised learning • Many competing classification techniques perform about the same • It’s all about the knowledge sources – “features” – you rely on • Problem: limited labeled training data available • Unsupervised learning • Bootstrapping (Yarowsky 95); remember “one sense per document”? • Clustering • Indirect supervision • From thesauri • From WordNet • From parallel corpora

  10. Resources: WordNet • Hand-built (but large) hierarchy of word senses • Basically a hierarchical thesaurus

  11. Resources • SensEval & SemEval • An ongoing WSD competition • Training / test sets for a wide range of words, difficulties, and parts-of-speech • Bake-off where lots of labs try many competing approaches • SemCor • A big chunk of the Brown corpus annotated with WordNet senses • FrameNet • Lexical database of English that is both human- and machine-readable • Annotated examples of how words are used in actual texts • Student: dictionary of more than 10,000 word senses, most of them with annotated examples that show the meaning and usage • NLP Researcher: more than 170,000 manually annotated sentences provide a unique training dataset for semantic role labeling • Linguists: a valence dictionary, with uniquely detailed evidence for the combinatorial properties of a core set of the English vocabulary • Semantic frame: a description of a type of event, relation, or entity and the participants in it. • Other Resources • “Open Mind Common Sense”: open source fact base • Open-domain Information extraction from the web • Parallel corpora • Flat thesauri (e.g., Roget’s)

  12. Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”

  13. Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”

  14. Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”

  15. Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”

  16. Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”

  17. Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”

  18. Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”

  19. Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”

  20. Hacks! Weighted Windows with NB • Distance conditioning • Some words are important only when they are nearby …. as …. point … court ………………… serve ……..… game … …. ……………………………………………… serve as…………………. • Distance weighting • Nearby words should get a larger vote: boost(i) relative position i

  21. Better Features • There are smarter features: • Argument selectional preference: • serve NP[meals] vs. serve NP[papers] vs. serve NP[country] • Subcategorization: • [function] serve PP[as] • [enable] serve VP[to] • [tennis] serve <intransitive> • [food] serve NP {PP[to]} • Can capture poorly (but robustly) with local windows • … but we can also use a parser and get these features explicitly • Other constraints (Yarowsky, 95) • One-sense-per-discourse • only true for broad topical distinctions • One-sense-per-collocation • pretty reliable when it kicks in: manufacturing plant, flowering plant

  22. c w1 w2 wn . . . Knowledge Sources …. point … court ………………… serve ……… game … • Can we use our friend, Naïve Bayes?

  23. Complex Features with Naïve Bayes? • Example: • So we have a decision to make based on a set of cues: • context:jail, context:county, context:feeding, … • local-context:jail, local-context:meals • subcat:NP, direct-object-head:meals • Not conditionally independent, given class! • Not clear how to build a generative derivation for these: • Choose topic, then decide on having a transitive usage, then pick “meals” to be the object’s head, then generate other words? • How about the words that appear in multiple features? • Hard to make this work (though maybe possible) • No real reason to try Washington County jail served 11,166 meals last month - a figure that translates to feeding some 120 people three times daily for 31 days.

  24. A Discriminative Approach • View WSD as a discrimination task • Use a conditional model: • Have to estimate categorical dist. (over senses) where there are a huge number of things to condition on • Many feature-based classification techniques exist • We tend to need and prefer methods that provide distributions over classes (why?) P(sense | context:jail, context:county, context:feeding, … local-context:jail, local-context:meals subcat:NP, direct-object-head:meals, ….)

  25. Feature Representations • Features are functions fi of ambiguous word w and its context that indicate the occurrences of certain patterns in the context • Binary (indicator / predicate) • Count Washington County jail served 11,166 meals last month - a figure that translates to feeding some 120 people three times daily for 31 days. • context:jail = 1 • context:county = 1 context:feeding = 1 • context:game = 0 • … • local-context:jail = 1 • local-context:meals = 1 • … • subcat:NP = 1 • subcat:PP = 0 • … • object-head:meals = 1 • object-head:ball = 0

  26. Linear Classifiers • For a pair (c, w), we take a weighted vote for each class:

  27. Linear Classifiers • For a pair (c, w), we take a weighted vote for each class:

  28. Linear Classifiers • For a pair (c, w), we take a weighted vote for each class: • There are many ways to set theseweights • Perceptron: • find a currently misclassified example • nudge weights in the direction of a correct classification • Other discriminative methods usually work in the same way: • try out various weights • until you maximize some objective that relies on the truth

  29. Next • Next class: • How to estimate those weights • Maximum Entropy Models

More Related