Our last topic

Knowledge-based Methods for Word Sense DisambiguationFrom a tutorial at AAAI by Ted Pedersen and Rada Mihalcea[edited by J. Wiebe]

Our last topic • NLP at a more fine-grained level; So far, we’ve only worked with document-level classification • The question of Polysemy came up in the last topic (more than one meaning of a term); word-sense disambiguation addresses the problem • Includes various measures of semantic similarity, which can be used for clustering, search, paraphrase recognition, etc. • Introduce you to resources you can use if you ever work with text • Note: Ted Pedersen’s group created: • http://wn-similarity.sourceforge.net/ • Very useful!

Definitions • Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. • Sense Inventory usually comes from a dictionary or thesaurus. • Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches • Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory. • Unsupervised techniques

Computers versus Humans • Polysemy – most words have many possible meanings. • A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human… • Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases…

Ambiguity for Humans - Newspaper Headlines! • DRUNK GETS NINE YEARS IN VIOLIN CASE • FARMER BILL DIES IN HOUSE • PROSTITUTES APPEAL TO POPE • STOLEN PAINTING FOUND BY TREE • RED TAPE HOLDS UP NEW BRIDGE • RESIDENTS CAN DROP OFF TREES • INCLUDE CHILDREN WHEN BAKING COOKIES • MINERS REFUSE TO WORK AFTER DEATH • [mixtures of part of speech, word sense, and syntactic ambiguities]

Ambiguity for a Computer • The fisherman jumped off the bank and into the water. • The bank down the street was robbed! • Back in the day, we had an entire bank of computers devoted to this problem. • The bank in that road is entirely too steep and is really dangerous. • The plane took a bank to the left, and then headed off towards the mountains.

Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Restrictions • Measures of Semantic Similarity • Heuristic-based Methods

Task Definition • Knowledge-based WSD = class of WSD methods relying (mainly) on knowledge drawn from dictionaries and/or raw text • Resources • Yes • Machine Readable Dictionaries • Raw corpora • No • Manually annotated corpora • Though combinations of these types of techniques and machine learning techniques are possible, of course

Machine Readable Dictionaries • In recent years, most dictionaries made available in Machine Readable format (MRD) • Oxford English Dictionary • Collins • Longman Dictionary of Ordinary Contemporary English (LDOCE) • Thesauruses – add synonymy information • Roget Thesaurus • Semantic networks – add more semantic relations • WordNet • EuroWordNet

WordNet definitions/examples for the noun plant • buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles“ • a living organism lacking the power of locomotion • something planted secretly for discovery by another; "the police used a plant to trick the thieves"; "he claimed that the evidence against him was a plant" • an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience MRD – A Resource for Knowledge-based WSD • For each word in the language vocabulary, an MRD provides: • A list of meanings • Definitions (for all word meanings) • Typical usage examples (for most word meanings)

MRD – A Resource for Knowledge-based WSD • A thesaurus adds: • An explicit synonymy relation between word meanings • A semantic network adds: • Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF), antonymy, entailnment, etc. WordNet synsets for the noun “plant” 1. plant, works, industrial plant 2. plant, flora, plant life WordNet related concepts for the meaning “plant life” {plant, flora, plant life} hypernym: {organism, being} hypomym: {house plant}, {fungus}, … meronym: {plant tissue}, {plant part} holonym: {Plantae, kingdom Plantae, plant kingdom}

Lesk Algorithm • (Michael Lesk 1986): Identify senses of words in context using definition overlap Algorithm: • Retrieve from MRD all sense definitions of the words to be disambiguated • Determine the definition overlap for all possible sense combinations • Choose senses that lead to highest overlap Example: disambiguate PINE CONE • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness • CONE 1. solid body which narrows to a point 2. something of this shape whether solid or hollow 3. fruit of certain evergreen trees Pine#1  Cone#1 = 0 Pine#2  Cone#1 = 0 Pine#1  Cone#2 = 1 Pine#2  Cone#2 = 0 Pine#1  Cone#3 = 2 Pine#2  Cone#3 = 0

Lesk Algorithm for More than Two Words? • I saw a man who is 98 years old and can still walk and tell jokes • nine open class words: see(26), man(11), year(4), old(8), can(5), still(4), walk(10), tell(8), joke(3) • 43,929,600 sense combinations! How to find the optimal sense combination? • Simulated annealing (Cowie, Guthrie, Guthrie 1992) • Define a function E = combination of word senses in a given text. • Find the combination of senses that leads to highest definition overlap (redundancy) 1. Start with E = the most frequent sense for each word 2. At each iteration, replace the sense of a random word in the set with a different sense, and measure E 3. Stop iterating when there is no change in the configuration of senses

Lesk Algorithm: A Simplified Version • Original Lesk definition: measure overlap between sense definitions for all words in context • Identify simultaneously the correct senses for all words in context • Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap between sense definitions of a word and current context • Identify the correct sense for one word at a time • Search space significantly reduced

Lesk Algorithm: A Simplified Version • Algorithm for simplified Lesk: • Retrieve from MRD all sense definitions of the word to be disambiguated • Determine the overlap between each sense definition and the current context • Choose the sense that leads to highest overlap Example: disambiguate PINE in “Pine cones hanging in a tree” • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness [Actually, would a WSD system be choosing between these?] Pine#1  Sentence = 1 Pine#2  Sentence = 0

Lesk Algorithm: A Simplified Version • Algorithm for simplified Lesk: • Retrieve from MRD all sense definitions of the word to be disambiguated • Determine the overlap between each sense definition and the current context • Choose the sense that leads to highest overlap Example: disambiguate PINE in “Pine cones hanging in a tree” • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness [Actually, would a WSD system be choosing between these?][Typically, no – they are different parts of speech. While POS taggers do make mistakes, they make fewer than WSD systems. Combined with a ML approach, one could assign the best overall interpretation, considering POS and sense.] Pine#1  Sentence = 1 Pine#2  Sentence = 0

Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Preferences • Measures of Semantic Similarity • Heuristic-based Methods

Selectional Preferences • A way to constrain the possible meanings of words in a given context • E.g. “Wash a dish” vs. “Cook a dish” • WASH-OBJECT vs. COOK-FOOD • Capture information about possible relations between semantic classes • Common sense knowledge • Alternative terminology • Selectional Restrictions • Selectional Preferences • Selectional Constraints

Acquiring Selectional Preferences • From annotated corpora • But sense annotated data are not plentiful • From raw corpora • Frequency counts • Information theory measures • Class-to-class relations

Preliminaries: Learning Word-to-Word Relations • An indication of the semantic fit between two words 1. Frequency counts • Pairs of words connected by a syntactic relations 2. Conditional probabilities • Condition on one of the words

From Resnik 1993 • The alternative view of selectional constraints I am proposing can be phrased as follows: rather than restrictions or hard constraints on applicability, a predicate preferentially associates with certain kinds of arguments, and these preferences constitute the effect that the predicate has on what appears in an argument position. For example, the predicate blue does not restrict itself to arguments having a tangible surface — the sky is blue, and so is ocean water even deep below any apparent surface — but its arguments are still far from arbitrary. The effect of the predicate is that its arguments tend to be physical entities and to have surfaces. Similarly, the verb admire, interpreted in the particular sense “to have a high opinion of,” has an effect on what appears as its subject; these tend to be physical, animate, human, capable of the higher psychological functions, and so forth. In some cases the effect a predicate has on its argument is quite strong: one is unlikely to ﬁnd the (numerical) predicate even applied to anything but positive integers. In other cases — e.g. the predicate smooth — the effect is less dramatic.

Bringing in Information Theory • Entropy – how uncertain the outcome is (on ave) • “The cook basted the which noun?” Entropy(which noun?) is low, since the word is likely to be one of a small set of words, such as “turkey” or “roast”. • But the entropy is much higher in the following: • “The cook enjoyed the which noun?” since a much wider range of words is likely. (The opera, the company of the butler, a certain book, a particular food, …)

Learning Selectional Preferences • Word-to-class relations (Resnik 1993) • Quantify the contribution of a semantic class using all the concepts subsumed by that class • where

Learning Selectional Preferences (2) • Determine the contribution of a word sense based on the assumption of equal sense distributions: • e.g. “plant” has two senses  50% occurrences are sense 1, 50% are sense 2 • Example: learning restrictions for the verb “to drink” • Find high-scoring verb-object pairs • Find “prototypical” object classes (high association score) These are synsets in WN; i.e., lists of words but also a sense. They are hypernymes of the words above. Lookup in wordnet in class.

Learning Selectional Preferences (3) • Other algorithms • Learn class-to-class relations (Agirre and Martinez, 2002) • E.g.: “ingest food” is a class-to-class relation for “eat chicken” • Bayesian networks (Ciaramita and Johnson, 2000) • Tree cut model (Li and Abe, 1998)

Using Selectional Preferences for WSD Algorithm: 1. Learn a large set of selectional preferences for a given syntactic relation R 2. Given a pair of words W1– W2 connected by a relation R 3. Find all selectional preferences W1– C (word-to-class) or C1– C2 (class-to-class) that apply 4. Select the meanings of W1 and W2 based on the selected semantic class • Example: disambiguatecoffeein “drink coffee” 1. (beverage) a beverage consisting of an infusion of ground coffee beans 2. (tree) any of several small trees native to the tropical Old World 3. (color) a medium to dark brown color Given the selectional preference “DRINK BEVERAGE” : coffee#1

Semantic Similarity • Words in a discourse must be related in meaning, for the discourse to be coherent (Haliday and Hassan, 1976) • Use this property for WSD – Identify related meanings for words that share a common context • Context span: 1. Local context: semantic similarity between pairs of words 2. Global context: lexical chains [Note: semantic similarity is useful in many settings; for example, recognizing paraphrases, summarization, finding opinionated words, clustering]

Semantic Similarity in a Local Context • Similarity determined between pairs of concepts, or between a word and its surrounding context • Relies on similarity metrics on semantic networks • (Rada et al. 1989) carnivore fissiped mamal, fissiped canine, canid feline, felid bear wolf wild dog dog hyena dingo hyena dog hunting dog dachshund terrier

Semantic Similarity Metrics • Input: two concepts (same part of speech) • Output: similarity measure • (Leacock and Chodorow 1998) • E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42 • (Resnik 1995) • Define information content, P(C) = probability of seeing a concept of type C in a large corpus • Probability of seeing a concept = probability of seeing instances of that concept • Determine the contribution of a word sense based on the assumption of equal sense distributions: • e.g. “plant” has two senses  50% occurrences are sense 1, 50% are sense 2 , D is the taxonomy depth

Semantic Similarity Metrics • Similarity using information content • (Resnik 1995) Define similarity between two concepts (LCS = Least Common Subsumer) • Alternatives (Jiang and Conrath 1997)

Using Hierarchical Structure LCS Target sense Seed sense

Using Hierarchical Structure LCS voice#1 (objective)

Semantic Similarity The previous method may be used to assess similarity between word senses. Another one, which often words well, is Jiang & Conrath 1997 (if you want to look it up; we won’t cover it) The next one we will look at is defined between words, not word senses. We’ll see later on a method which combines these two types of similarity, For the purpose of finding the most frequent sense of a word in a corpus.

R2 R3 I have a brown dog R1 R4 Lin’s Distributional Similarity Word R W I R1 have have R2 dog brown R3 dog . . . Lin 1998

Lin’s Distributional Similarity Word1 Word2 R W R W R W R W R W R W R W R W R W R W R W R W

Lin’s distributional similarity Much richer, robust parser than existed before; applied to large-scale corpus (by standards then) Dependency parsers lend themselves to easily extracting pair-wise relationships between words How is similarity computed? Look at the RW pairs associated with each of word1 and word2. The more overlap between the two sets, the more word1 and word2 are related. The actual metric is based on information content and mutual information between word1 and word2; we won’t go further into the details.

Semantic Similarity Metrics for WSD • Disambiguate target words based on similarity with one word to the left and one word to the right • (Patwardhan, Banerjee, Pedersen 2002) Example: disambiguate PLANT in “plant with flowers” PLANT plant, works, industrial plant plant, flora, plant life Similarity (plant#1, flower) = 0.2 Similarity (plant#2, flower) = 1.5 : plant#2

Semantic Similarity in a Global Context • Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976) • “A lexical chain is a sequence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse” Algorithmfor finding lexical chains: • Select the candidate words from the text. These are words for which we can compute similarity measures, and therefore most of the time they have the same part of speech. • For each such candidate word, and for each sense for this word, find a chain to receive the candidate word sense, based on a semantic relatedness measure between the concepts that are already in the chain, and the candidate word sense. • If such a chain is found, insert the word in this chain; otherwise, create a new chain.

Semantic Similarity of a Global Context A very long traintraveling along the railswith a constant velocityv in a certain direction… train #1: public transport #1 change location # 2: a bar of steel for trains #2: orderd set of things #3: piece of cloth travel #2: undergo transportation rail #1: a barrier #3: a small bird

Lexical Chains for WSD • Identify lexical chains in a text • Usually target one part of speech at a time • Identify the meaning of words based on their membership to a lexical chain

Most Frequent Sense • Identify the most often used meaning and use this meaning by default • Word meanings exhibit a Zipfian distribution (long tail of rare events) • E.g. distribution of word senses in SemCor • Example: “plant/flora” is used more often than “plant/factory” • - annotate any instance of PLANT as “plant/flora”

Most Frequent Sense • Method 1: Find the most frequent sense in an annotated corpus • Method 2: Find the most frequent sense using a method based on distributional similarity (McCarthy et al. 2004) 1. Given a word w, find the top k distributionally similar words Nw = {n1, n2, …, nk}, with associated similarity scores {dss(w,n1), dss(w,n2), … dss(w,nk)} 2. For each sense wsi of w, identify the similarity with the words nj, using the sense of nj that maximizes this score 3. Rank senses wsi of w based on the total similarity score

Most Frequent Sense • Word senses • pipe #1 = tobacco pipe • pipe #2 = tube of metal or plastic • Distributionally similar words • N = {tube, cable, wire, tank, hole, cylinder, fitting, tap, …} • For each word in N, find similarity with pipe#i (using the sense that maximizes the similarity) • pipe#1 – tube (#3) = 0.3 • pipe#2 – tube (#1) = 0.6 • Compute score for each sense pipe#i • score (pipe#1) = 0.25 • score (pipe#2) = 0.73 Note: results depend on the corpus used to find distributionally similar words => can find domain specific predominant senses

One Sense Per Discourse • A word tends to preserve its meaning across all its occurrences in a given discourse (Gale, Church, Yarowksy 1992) • What does this mean? • Evaluation: • 8 words with two-way ambiguity, e.g. plant, crane, etc. • 98% of the two-word occurrences in the same discourse carry the same meaning • The grain of salt: Performance depends on granularity • (Krovetz 1998) experiments with words with more than two senses • Performance of “one sense per discourse” measured on SemCor is approx. 70% E.g. The ambiguous word PLANT occurs 10 times in a discourse all instances of “plant” carry the same meaning

One Sense per Collocation • A word tends to preserver its meaning when used in the same collocation (Yarowsky 1993) • Strong for adjacent collocations • Weaker as the distance between words increases • An example • Evaluation: • 97% precision on words with two-way ambiguity • Finer granularity: • (Martinez and Agirre 2000) tested the “one sense per collocation” hypothesis on text annotated with WordNet senses • 70% precision on SemCor words The ambiguous word PLANT preserves its meaning in all its occurrences within the collocation “industrial plant”, regardless of the context where this collocation occurs

Part 4: Supervised Methods of Word Sense Disambiguation

Outline • What is Supervised Learning? • Task Definition • Single Classifiers • Naïve Bayesian Classifiers • Decision Lists and Trees • Ensembles of Classifiers

Our last topic

Our last topic

Presentation Transcript

Last Topic - National Security

Guess our new topic!!

Last Topic - Administrative Tribunals

Our last kingdom: Animalia

Last Topic - Constitution

Last Topic

Last Topic - Judicial Review

Our journey through our bridge topic

Welcome to our Topic

Last Topic – Public Interest Immunity

Last Topic - Separation of Powers

Last Topic - The Sovereignty

Our last day

Last Topic Review

Our Last Class!!

Our topic last half term was Bones, Blood and Gory Bits!!

At Our Last Meeting

Our Last Seminar

First & Last Name of Topic

Our Last Class!!

Our topic:

Our last topic

Our last topic

Presentation Transcript

Last Topic - National Security

Guess our new topic!!

Last Topic - Administrative Tribunals

Our last kingdom: Animalia

Last Topic - Constitution

Last Topic

Last Topic - Judicial Review

Our journey through our bridge topic

Welcome to our Topic

Last Topic – Public Interest Immunity

Last Topic - Separation of Powers

Last Topic - The Sovereignty

Our last day

Last Topic Review

Our Last Class!!

Our topic last half term was Bones, Blood and Gory Bits!!

At Our Last Meeting

Our Last Seminar

First &amp; Last Name of Topic

Our Last Class!!

Our topic:

First & Last Name of Topic