Natural Language Processing

Natural Language Processing Chapter 19 Computational Lexical Semantics Part 2 [Includes slides from a AAAI-2005 tutorial by Rada Mihalcea and Ted Pedersen]

Word Senses • The meaning of a word in a given context • Word sense representations • With respect to a dictionary chair= a seat for one person, with a support for the back; "he put his coat over the back of the chair and sat down" chair= the position of professor; "he was awarded an endowed chair in economics" • With respect to the translation in a second language chair = chaise chair = directeur • With respect to the context where it occurs (discrimination) “Sit on a chair” “Take a seat on this chair” “The chair of the Math Department” “The chair of the meeting”

Approaches to Word Sense Disambiguation • Knowledge-Based Disambiguation • use of external lexical resources such as dictionaries and thesauri • discourse properties • Supervised Disambiguation • based on a labeled training set • the learning system has: • a training set of feature-encoded inputs AND • their appropriate sense label (category) • Unsupervised Disambiguation • based on unlabeled corpora • The learning system has: • a training set of feature-encoded inputs BUT • NOT their appropriate sense label (category)

All Words Word Sense Disambiguation • Minimally supervised approaches • Learn to disambiguate words using small annotated corpora • E.g. SemCor – corpus where all open class words are disambiguated • 200,000 running words • Most frequent sense

Targeted Word Sense Disambiguation (we saw this earlier) • Disambiguate one target word “Take a seat on this chair” “The chair of the Math Department” • WSD is viewed as a typical classification problem • use machine learning techniques to train a system • Training: • Corpus of occurrences of the target word, each occurrence annotated with appropriate sense • Build feature vectors: • a vector of relevant linguistic features that represents the context (ex: a window of words around the target word) • Disambiguation: • Disambiguate the target word in new unseen text

Knowledge-based Methods for Word Sense Disambiguation

Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Restrictions • Measures of Semantic Similarity • Heuristic-based Methods

Task Definition • Knowledge-based WSD = class of WSD methods relying (mainly) on knowledge drawn from dictionaries and/or raw text • Resources • Yes • Machine Readable Dictionaries • Raw corpora • No • Manually annotated corpora • Scope • All open-class words

Machine Readable Dictionaries • In recent years, most dictionaries made available in Machine Readable format (MRD) • Oxford English Dictionary • Collins • Longman Dictionary of Ordinary Contemporary English (LDOCE) • Thesauruses – add synonymy information • Roget Thesaurus • Semantic networks – add more semantic relations • WordNet • EuroWordNet

WordNet definitions/examples for the noun plant • buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles“ • a living organism lacking the power of locomotion • something planted secretly for discovery by another; "the police used a plant to trick the thieves"; "he claimed that the evidence against him was a plant" • an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience MRD – A Resource for Knowledge-based WSD • For each word in the language vocabulary, an MRD provides: • A list of meanings • Definitions (for all word meanings) • Typical usage examples (for most word meanings)

MRD – A Resource for Knowledge-based WSD • A thesaurus adds: • An explicit synonymy relation between word meanings • A semantic network adds: • Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF), antonymy, entailnment, etc. WordNet synsets for the noun “plant” 1. plant, works, industrial plant 2. plant, flora, plant life WordNet related concepts for the meaning “plant life” {plant, flora, plant life} hypernym: {organism, being} hypomym: {house plant}, {fungus}, … meronym: {plant tissue}, {plant part} holonym: {Plantae, kingdom Plantae, plant kingdom}

Lesk Algorithm • (Michael Lesk 1986): Identify senses of words in context using definition overlap Algorithm: • Retrieve from MRD all sense definitions of the words to be disambiguated • Determine the definition overlap for all possible sense combinations • Choose senses that lead to highest overlap Example: disambiguate PINE CONE • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness • CONE 1. solid body which narrows to a point 2. something of this shape whether solid or hollow 3. fruit of certain evergreen trees Pine#1  Cone#1 = 0 Pine#2  Cone#1 = 0 Pine#1  Cone#2 = 1 Pine#2  Cone#2 = 0 Pine#1  Cone#3 = 2 Pine#2  Cone#3 = 0

Lesk Algorithm for More than Two Words? • I saw a man who is 98 years old and can still walk and tell jokes • nine open class words: see(26), man(11), year(4), old(8), can(5), still(4), walk(10), tell(8), joke(3) • 43,929,600 sense combinations! How to find the optimal sense combination? • Simulated annealing (Cowie, Guthrie, Guthrie 1992) • Define a function E = combination of word senses in a given text. • Find the combination of senses that leads to highest definition overlap (redundancy) 1. Start with E = the most frequent sense for each word 2. At each iteration, replace the sense of a random word in the set with a different sense, and measure E 3. Stop iterating when there is no change in the configuration of senses

Lesk Algorithm: A Simplified Version • Original Lesk definition: measure overlap between sense definitions for all words in context • Identify simultaneously the correct senses for all words in context • Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap between sense definitions of a word and current context • Identify the correct sense for one word at a time • Search space significantly reduced

Lesk Algorithm: A Simplified Version • Algorithm for simplified Lesk: • Retrieve from MRD all sense definitions of the word to be disambiguated • Determine the overlap between each sense definition and the current context • Choose the sense that leads to highest overlap Example: disambiguate PINE in “Pine cones hanging in a tree” • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness Pine#1  Sentence = 1 Pine#2  Sentence = 0

Evaluations of Lesk Algorithm • Initial evaluation by M. Lesk • 50-70% on short samples of text manually annotated set, with respect to Oxford Advanced Learner’s Dictionary • Simulated annealing • 47% on 50 manually annotated sentences • Evaluation on Senseval-2 all-words data, with back-off to random sense (Mihalcea & Tarau 2004) • Original Lesk: 35% • Simplified Lesk: 47% • Evaluation on Senseval-2 all-words data, with back-off to most frequent sense (Vasilescu, Langlais, Lapalme 2004) • Original Lesk: 42% • Simplified Lesk: 58%

Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Preferences • Measures of Semantic Similarity • Heuristic-based Methods

Unsupervised Disambiguation • Disambiguate word senses: • without supporting tools such as dictionaries and thesauri • without a labeled training text • Without such resources, word senses are not labeled • We cannot say “chair/furniture” or “chair/person” • We can: • Cluster/group the contexts of an ambiguous word into a number of groups • Discriminate between these groups without actually labeling them

Unsupervised Disambiguation • Hypothesis: same senses of words will have similar neighboring words • Disambiguation algorithm • Identify context vectors corresponding to all occurrences of a particular word • Partition them into regions of high density • Assign a sense to each such region “Sit on a chair” “Take a seat on this chair” “The chair of the Math Department” “The chair of the meeting”

Evaluating Word Sense Disambiguation • Metrics: • Precision = percentage of words that are tagged correctly, out of the words addressed by the system • Recall = percentage of words that are tagged correctly, out of all words in the test set • Example • Test set of 100 words Precision = 50 / 75 = 0.66 • System attempts 75 words Recall = 50 / 100 = 0.50 • Words correctly disambiguated 50 • Special tags are possible: • Unknown • Proper noun • Multiple senses • Compare to a gold standard • SEMCOR corpus, SENSEVAL corpus, …

Evaluating Word Sense Disambiguation • Difficulty in evaluation: • Nature of the senses to distinguish has a huge impact on results • Coarse versus fine-grained sense distinction chair= a seat for one person, with a support for the back; "he put his coat over the back of the chair and sat down“ chair= the position of professor; "he was awarded an endowed chair in economics“ bank = a financial institution that accepts deposits and channels the money into lending activities; "he cashed a check at the bank"; "that bank holds the mortgage on my home" bank = a building in which commercial banking is transacted; "the bank is on the corner of Nassau and Witherspoon“ • Sense maps • Cluster similar senses • Allow for both fine-grained and coarse-grained evaluation

Bounds on Performance • Upper and Lower Bounds on Performance: • Measure of how well an algorithm performs relative to the difficulty of the task. • Upper Bound: • Human performance • Around 97%-99% with few and clearly distinct senses • Inter-judge agreement: • With words with clear & distinct senses – 95% and up • With polysemous words with related senses – 65% – 70% • Lower Bound (or baseline): • The assignment of a random sense / the most frequent sense • 90% is excellent for a word with 2 equiprobable senses • 90% is trivial for a word with 2 senses with probability ratios of 9 to 1

References • (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. Estimating upper and lower bounds on the performance of word-sense disambiguation programs ACL 1992. • (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification. ARPA Workshop 1994. • (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995. • (Senseval) Senseval evaluation exercises http://www.senseval.org

Selectional Preferences • A way to constrain the possible meanings of words in a given context • E.g. “Wash a dish” vs. “Cook a dish” • WASH-OBJECT vs. COOK-FOOD • Capture information about possible relations between semantic classes • Common sense knowledge • Alternative terminology • Selectional Restrictions • Selectional Preferences • Selectional Constraints

Acquiring Selectional Preferences • From annotated corpora • Circular relationship with the WSD problem • Need WSD to build the annotated corpus • Need selectional preferences to derive WSD • From raw corpora • Frequency counts • Information theory measures • Class-to-class relations

Preliminaries: Learning Word-to-Word Relations • An indication of the semantic fit between two words 1. Frequency counts • Pairs of words connected by a syntactic relations 2. Conditional probabilities • Condition on one of the words

Learning Selectional Preferences (1) • Word-to-class relations (Resnik 1993) • Quantify the contribution of a semantic class using all the concepts subsumed by that class • where

Learning Selectional Preferences (2) • Determine the contribution of a word sense based on the assumption of equal sense distributions: • e.g. “plant” has two senses  50% occurrences are sense 1, 50% are sense 2 • Example: learning restrictions for the verb “to drink” • Find high-scoring verb-object pairs • Find “prototypical” object classes (high association score)

Learning Selectional Preferences (3) • Other algorithms • Learn class-to-class relations (Agirre and Martinez, 2002) • E.g.: “ingest food” is a class-to-class relation for “eat chicken” • Bayesian networks (Ciaramita and Johnson, 2000) • Tree cut model (Li and Abe, 1998)

Using Selectional Preferences for WSD Algorithm: 1. Learn a large set of selectional preferences for a given syntactic relation R 2. Given a pair of words W1– W2 connected by a relation R 3. Find all selectional preferences W1– C (word-to-class) or C1– C2 (class-to-class) that apply 4. Select the meanings of W1 and W2 based on the selected semantic class • Example: disambiguatecoffeein “drink coffee” 1. (beverage) a beverage consisting of an infusion of ground coffee beans 2. (tree) any of several small trees native to the tropical Old World 3. (color) a medium to dark brown color Given the selectional preference “DRINK BEVERAGE” : coffee#1

Evaluation of Selectional Preferences for WSD • Data set • mainly on verb-object, subject-verb relations extracted from SemCor • Compare against random baseline • Results (Agirre and Martinez, 2000) • Average results on 8 nouns • Similar figures reported in (Resnik 1997)

Semantic Similarity • Words in a discourse must be related in meaning, for the discourse to be coherent (Haliday and Hassan, 1976) • Use this property for WSD – Identify related meanings for words that share a common context • Context span: 1. Local context: semantic similarity between pairs of words 2. Global context: lexical chains

Semantic Similarity in a Local Context • Similarity determined between pairs of concepts, or between a word and its surrounding context • Relies on similarity metrics on semantic networks • (Rada et al. 1989) carnivore fissiped mamal, fissiped canine, canid feline, felid bear wolf wild dog dog hyena dingo hyena dog hunting dog dachshund terrier

Semantic Similarity Metrics (1) • Input: two concepts (same part of speech) • Output: similarity measure • (Leacock and Chodorow 1998) • E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42 • (Resnik 1995) • Define information content, P(C) = probability of seeing a concept of type C in a large corpus • Probability of seeing a concept = probability of seeing instances of that concept • Determine the contribution of a word sense based on the assumption of equal sense distributions: • e.g. “plant” has two senses  50% occurrences are sense 1, 50% are sense 2 , D is the taxonomy depth

Semantic Similarity Metrics (2) • Similarity using information content • (Resnik 1995) Define similarity between two concepts (LCS = Least Common Subsumer) • Alternatives (Jiang and Conrath 1997) • Other metrics: • Similarity using information content (Lin 1998) • Similarity using gloss-based paths across different hierarchies (Mihalcea and Moldovan 1999) • Conceptual density measure between noun semantic hierarchies and current context (Agirre and Rigau 1995) • Adapted Lesk algorithm (Banerjee and Pedersen 2002)

Semantic Similarity Metrics for WSD • Disambiguate target words based on similarity with one word to the left and one word to the right • (Patwardhan, Banerjee, Pedersen 2002) • Evaluation: • 1,723 ambiguous nouns from Senseval-2 • Among 5 similarity metrics, (Jiang and Conrath 1997) provide the best precision (39%) Example: disambiguate PLANT in “plant with flowers” PLANT plant, works, industrial plant plant, flora, plant life Similarity (plant#1, flower) = 0.2 Similarity (plant#2, flower) = 1.5 : plant#2

Semantic Similarity in a Global Context • Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976) • “A lexical chain is a sequence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse” Algorithmfor finding lexical chains: • Select the candidate words from the text. These are words for which we can compute similarity measures, and therefore most of the time they have the same part of speech. • For each such candidate word, and for each meaning for this word, find a chain to receive the candidate word sense, based on a semantic relatedness measure between the concepts that are already in the chain, and the candidate word meaning. • If such a chain is found, insert the word in this chain; otherwise, create a new chain.

Semantic Similarity of a Global Context A very long traintraveling along the railswith a constant velocityv in a certain direction… train #1: public transport #1 change location # 2: a bar of steel for trains #2: order set of things #3: piece of cloth travel #2: undergo transportation rail #1: a barrier #3: a small bird

Lexical Chains for WSD • Identify lexical chains in a text • Usually target one part of speech at a time • Identify the meaning of words based on their membership to a lexical chain • Evaluation: • (Galley and McKeown 2003) lexical chains on 74 SemCor texts give 62.09% • (Mihalcea and Moldovan 2000) on five SemCor texts give 90% with 60% recall • lexical chains “anchored” on monosemous words • (Okumura and Honda 1994) lexical chains on five Japanese texts give 63.4%

Most Frequent Sense (1) • Identify the most often used meaning and use this meaning by default • Word meanings exhibit a Zipfian distribution • E.g. distribution of word senses in SemCor • Example: “plant/flora” is used more often than “plant/factory” • - annotate any instance of PLANT as “plant/flora”

Most Frequent Sense (2) • Method 1: Find the most frequent sense in an annotated corpus • Method 2: Find the most frequent sense using a method based on distributional similarity (McCarthy et al. 2004) 1. Given a word w, find the top k distributionally similar words Nw = {n1, n2, …, nk}, with associated similarity scores {dss(w,n1), dss(w,n2), … dss(w,nk)} 2. For each sense wsi of w, identify the similarity with the words nj, using the sense of nj that maximizes this score 3. Rank senses wsi of w based on the total similarity score

Most Frequent Sense(3) • Word senses • pipe #1 = tobacco pipe • pipe #2 = tube of metal or plastic • Distributional similar words • N = {tube, cable, wire, tank, hole, cylinder, fitting, tap, …} • For each word in N, find similarity with pipe#i (using the sense that maximizes the similarity) • pipe#1 – tube (#3) = 0.3 • pipe#2 – tube (#1) = 0.6 • Compute score for each sense pipe#i • score (pipe#1) = 0.25 • score (pipe#2) = 0.73 Note: results depend on the corpus used to find distributionally similar words => can find domain specific predominant senses

One Sense Per Discourse • A word tends to preserve its meaning across all its occurrences in a given discourse (Gale, Church, Yarowksy 1992) • What does this mean? • Evaluation: • 8 words with two-way ambiguity, e.g. plant, crane, etc. • 98% of the two-word occurrences in the same discourse carry the same meaning • The grain of salt: Performance depends on granularity • (Krovetz 1998) experiments with words with more than two senses • Performance of “one sense per discourse” measured on SemCor is approx. 70% E.g. The ambiguous word PLANT occurs 10 times in a discourse all instances of “plant” carry the same meaning

One Sense per Collocation • A word tends to preserver its meaning when used in the same collocation (Yarowsky 1993) • Strong for adjacent collocations • Weaker as the distance between words increases • An example • Evaluation: • 97% precision on words with two-way ambiguity • Finer granularity: • (Martinez and Agirre 2000) tested the “one sense per collocation” hypothesis on text annotated with WordNet senses • 70% precision on SemCor words The ambiguous word PLANT preserves its meaning in all its occurrences within the collocation “industrial plant”, regardless of the context where this collocation occurs

References • (Agirre and Rigau, 1995) Agirre, E. and Rigau, G. A proposal for word sense disambiguation using conceptual distance. RANLP 1995. • (Agirre and Martinez 2001) Agirre, E. and Martinez, D. Learning class-to-class selectional preferences. CONLL 2001. • (Banerjee and Pedersen 2002) Banerjee, S. and Pedersen, T. An adapted Lesk algorithm for word sense disambiguation using WordNet. CICLING 2002. • (Cowie, Guthrie and Guthrie 1992), Cowie, L. and Guthrie, J. A. and Guthrie, L.: Lexical disambiguation using simulated annealing. COLING 2002. • (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. One sense per discourse. DARPA workshop 1992. • (Halliday and Hasan 1976) Halliday, M. and Hasan, R., (1976). Cohesion in English. Longman. • (Galley and McKeown 2003) Galley, M. and McKeown, K. (2003) Improving word sense disambiguation in lexical chaining. IJCAI 2003 • (Hirst and St-Onge 1998) Hirst, G. and St-Onge, D. Lexical chains as representations of context in the detection and correction of malaproprisms. WordNet: An electronic lexical database, MIT Press. • (Jiang and Conrath 1997) Jiang, J. and Conrath, D. Semantic similarity based on corpus statistics and lexical taxonomy. COLING 1997. • (Krovetz, 1998) Krovetz, R. More than one sense per discourse. ACL-SIGLEX 1998. • (Lesk, 1986) Lesk, M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. SIGDOC 1986. • (Lin 1998) Lin, D An information theoretic definition of similarity. ICML 1998.

References • (Martinez and Agirre 2000) Martinez, D. and Agirre, E. One sense per collocation and genre/topic variations. EMNLP 2000. • (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification. ARPA Workshop 1994. • (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995. • (Mihalcea and Moldovan, 1999) Mihalcea, R. and Moldovan, D. A method for word sense disambiguation of unrestricted text. ACL 1999. • (Mihalcea and Moldovan 2000) Mihalcea, R. and Moldovan, D. An iterative approach to word sense disambiguation. FLAIRS 2000. • (Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. Figa PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004. • (Patwardhan, Banerjee, and Pedersen 2003) Patwardhan, S. and Banerjee, S. and Pedersen, T. Using Measures of Semantic Relatedeness for Word Sense Disambiguation. CICLING 2003. • (Rada et al 1989) Rada, R. and Mili, H. and Bicknell, E. and Blettner, M. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1) 1989. • (Resnik 1993) Resnik, P. Selection and Information: A Class-Based Approach to Lexical Relationships. University of Pennsylvania 1993. • (Resnik 1995) Resnik, P. Using information content to evaluate semantic similarity. IJCAI 1995. • (Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme "Evaluating variants of the Lesk approach for disambiguating words”, LREC 2004. • (Yarowsky, 1993) Yarowsky, D. One sense per collocation. ARPA Workshop 1993.

Part 4: Supervised Methods of Word Sense Disambiguation

Natural Language Processing