Chapter 20 Part 3
E N D
Presentation Transcript
Chapter 20Part 3 Computational Lexical Semantics Acknowledgements: these slides include material from Dan Jurafsky, Rada Mihalcea, Ray Mooney, Katrin Erk, and Ani Nenkova 1
Similarity Metrics • Similarity metrics are useful not just for word sense disambiguation, but also for: • Finding topics of documents • Representing word meanings, not with respect to a fixed sense inventory • We will start with dictionary based methods and then look at vector space models
Thesaurus-based word similarity • We could use anything in the thesaurus • Meronymy • Glosses • Example sentences • In practice • By “thesaurus-based” we just mean • Using the is-a/subsumption/hypernym hierarchy • Can define similarity between words or between senses
Path based similarity • Two senses are similar if nearby in thesaurus hierarchy (i.e. short path between them)
path-based similarity • pathlen(c1,c2) = number of edges in the shortest path between the sense nodes c1 and c2 • wordsim(w1,w2) = • maxc1senses(w1),c2senses(w2)pathlen(c1,c2)
Problem with basic path-based similarity • Assumes each link represents a uniform distance • But, some areas of WordNet are more developed than others • Depended on the people who created it • Also, links deep in the hierarchy are intuitively more narrow than links higher up [on slide 4, e.g., nickel to money vs nickel to standard]
Information content similarity metrics • Let’s define P(C) as: • The probability that a randomly selected word in a corpus is an instance of concept c • A word is an instance of a concept if it appears below the concept in the WordNet hierarchy • We saw this idea when we covered selectional preferences
In particular • If there is a single node that is the ancestor of all nodes, then its probability is 1 • The lower a node in the hierarchy, the lower its probability • An occurrence of the word dime would count towards the frequency of coin, currency, standard, etc.
Information content similarity • Train by counting in a corpus • 1 instance of “dime” could count toward frequency of coin, currency, standard, etc • More formally: Here N is the total number of words (tokens) in the corpus that are also in the thesaurus
Information content similarity WordNet hierararchy augmented with probabilities P(C)
Information content: definitions • Information content: • IC(c)=-logP(c) • Lowest common subsumer LCS(c1,c2) • I.e. the lowest node in the hierarchy • That subsumes (is a hypernym of) both c1 and c2
Resnik method • The similarity between two senses is related to their common information • The more two senses have in common, the more similar they are • Resnik: measure the common information as: • The info content of the lowest common subsumer of the two senses • simresnik(c1,c2) = -log P(LCS(c1,c2))
Example Use: • Yaw Gyamfi, Janyce Wiebe, Rada Mihalcea, and Cem Akkaya (2009). Integrating Knowledge for Subjectivity Sense Labeling. HLT-NAACL 2009.
What is Subjectivity? • The linguisticexpression of somebody’s opinions, sentiments, emotions, evaluations, beliefs, speculations (private states) This particular use of subjectivity was adapted from literary theory Banfield 1982; Wiebe 1990
Examples of Subjective Expressions • References to private states • She was enthusiastic about the plan • Descriptions • That would lead to disastrous consequences • What a freak show
Subjectivity Analysis • Automatic extraction of subjectivity (opinions) from text or dialog
Subjectivity Analysis: Applications • Opinion-oriented question answering:How do the Chinese regard the human rights record of the United States? • Product review mining:What features of the ThinkPad T43 do customers like and which do they dislike? • Review classification:Is a review positive or negative toward the movie? • Tracking sentiments toward topics over time:Is anger ratcheting up or cooling down? • Etc.
Subjectivity Lexicons • Most approaches to subjectivity and sentiment analysis exploit subjectivity lexicons. • Lists of keywords that have been gathered together because they have subjective uses Brilliant Difference Hate Interest Love …
Automatically Identifying Subjective Words • Much work in this area Hatzivassiloglou & McKeown ACL97 Wiebe AAAI00 Turney ACL02 Kamps & Marx 2002 Wiebe, Riloff, & Wilson CoNLL03 Yu & Hatzivassiloglou EMNLP03 Kim & Hovy IJCNLP05 Esuli & Sebastiani CIKM05 Andreevskaia & Bergler EACL06 Etc. Subjectivity Lexicon available at : http://www.cs.pitt.edu/mpqa Entries from several sources
However… • Consider the keyword “interest” • It is in the subjectivity lexicon • But, what about “interest rate,” for example?
WordNet Senses Interest, involvement -- (a sense of concern with and curiosity about someone or something; "an interest in music") Interest -- (a fixed charge for borrowing money; usually a percentage of the amount borrowed; "how much interest do you pay on your mortgage?")
S O WordNet Senses Interest, involvement -- (a sense of concern with and curiosity about someone or something; "an interest in music") Interest -- (a fixed charge for borrowing money; usually a percentage of the amount borrowed; "how much interest do you pay on your mortgage?")
Senses • Even in subjectivity lexicons, many senses of the keywords are objective • Thus, many appearances of keywords in texts are false hits
Examples • “There are many differences between African and Asian elephants.” • “… dividing by the absolute value of the difference from the mean…” • “Their differences only grew as they spent more time together …” • “Her support really made a difference in my life” • “The difference after subtracting X from Y…”
Our Task: Subjectivity Sense Labeling • Automatically classifying senses as subjective or objective • Purpose: exploit labels to improve • Word sense diambiguation Wiebe and Mihalcea ACL06 • Automatic subjectivity and sentiment analysis systems Akkaya, Wiebe, Mihalcea (2009,2010,2011,2012,2014)
Sense O {1, 2, 5} Difference sense#1 O sense#2 O sense#3 S sense#4 S sense#5 O SWSD System Sense S {3,4} Subjectivity Tagging using Subjectivity WSD Subjectivity Or Sentiment Classifier “There are many differences between African and Asian elephants.” S O? S O? “Their differences only grew as they spent more time together …”
Sense O {1, 2, 5} Difference sense#1 O sense#2 O sense#3 S sense#4 S sense#5 O Sense S {3,4} Subjectivity Tagging using Subjectivity WSD Subjectivity Or Sentiment Classifier “There are many differences between African and Asian elephants.” S O SWSD System S O “Their differences only grew as they spent more time together …”
Using Hierarchical Structure LCS Target sense Seed sense
Using Hierarchical Structure LCS voice#1 (objective)
If you are interested in the entire approach and experiments, please see the paper (it is on my website)
Dekang Lin method Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. ICML • Intuition: Similarity between A and B is not just what they have in common • The more differences between A and B, the less similar they are: • Commonality: the more A and B have in common, the more similar they are • Difference: the more differences between A and B, the less similar • Commonality: IC(common(A,B)) • Difference: IC(description(A,B))-IC(common(A,B))
Dekang Lin similarity theorem • Lin (altering Resnik) defines: • The similarity between A and B is measured by the ratio between the amount of information needed to state the commonality of A and B and the information needed to fully describe what A and B are
Summary: thesaurus-based similarity between senses • There are many metrics (you don’t have to memorize these)
Using Thesaurus-Based Similarity for WSD • One specific method (Banerjee & Pedersen 2003): • For sense k of target word t: • SenseScore[k] = 0 • For each word w appearing within –N and +N of t: • For each sense s of w: • SenseScore[k] += similarity(k,s) • The sense with the highest SenseScore is assigned to the target word
Problems with thesaurus-based meaning • We don’t have a thesaurus for every language • Even if we do, they have problems with recall • Many words are missing • Most (if not all) phrases are missing • Some connections between senses are missing • Thesauri work less well for verbs, adjectives • Adjectives and verbs have less structured hyponymy relations
Distributional models of meaning • Also called vector-space models of meaning • Offer much higher recall than hand-built thesauri • Although they tend to have lower precision • Zellig Harris (1954): “oculist and eye-doctor … occur in almost the same environments…. If A and B have almost identical environments we say that they are synonyms. • Firth (1957): “You shall know a word by the company it keeps!”
Intuition of distributional word similarity • Nida example: A bottle of tesgüino is on the table Everybody likes tesgüino Tesgüino makes you drunk We make tesgüino out of corn. • From context words humans can guess tesgüino means • an alcoholic beverage like beer • Intuition for algorithm: • Two words are similar if they have similar word contexts.
Reminder: Term-document matrix • Each cell: count of term t in a document d: tft,d: • Each document is a count vector: a column below
Reminder: Term-document matrix • Two documents are similar if their vectors are similar
The words in a term-document matrix • Each word is a count vector: a row below
The words in a term-document matrix • Two words are similar if their vectors are similar
The Term-Context matrix • Instead of using entire documents, use smaller contexts • Paragraph • Window of 10 words • A word is now defined by a vector over counts of context words
Sample contexts: 20 words (Brown corpus) • equal amount of sugar, a sliced lemon, a tablespoonful of apricotpreserve or jam, a pinch each of clove and nutmeg, • on board for their enjoyment. Cautiously she sampled her first pineappleand another fruit whose taste she likened to that of • of a recursive type well suited to programming on the digital computer. In finding the optimal R-stage policy from that of • substantially affect commerce, for the purpose of gathering data and information necessary for the study authorized in the first section of this
Term-context matrix for word similarity • Two words are similar in meaning if their context vectors are similar
Should we use raw counts? • For the term-document matrix • We used tf-idf instead of raw term counts • For the term-context matrix • Positive Pointwise Mutual Information (PPMI) is common
Pointwise Mutual Information • Pointwise mutual information: • Do events x and y co-occur more than if they were independent? • PMI between two words: (Church & Hanks 1989) • Do words x and y co-occur more than if they were independent? • Positive PMI between two words (Niwa & Nitta 1994) • Replace all PMI values less than 0 with zero
Computing PPMI on a term-context matrix • Matrix F with W rows (words) and C columns (contexts) • fij is # of times wi occurs in context cj
p(w=information,c=data) = p(w=information) = p(c=data) = 6/19 = .32 = .58 11/19 = .37 7/19