Lecture 24 Distributional Word Similarity II

Lecture 24Distributional Word Similarity II CSCE 771 Natural Language Processing • Topics • Distributional based word similarity • example PMI • context = syntactic dependencies • Readings: • NLTK book Chapter 2 (wordnet) • Text Chapter 20 April 15, 2013

Overview • Last Time • Finish up Thesaurus based similarity • … • Distributional based word similarity • Today • Last Lectures slides 21- • Distributional based word similarity II • syntax based contexts • Readings: • Text 19,20 • NLTK Book: Chapter 10 • Next Time: Computational Lexical Semantics II

Pointwise Mutual Informatiom (PMI) • mutual Information Church and Hanks 1989 • (eq 20.36) • PointwiseMutual Information (PMI) Fano 1961 • . (eq20.37) • assoc-PMI • (eq20.38)

Computing PPMI • Matrix F with W (words) rows and C (contexts) columns • fij is frequency of wi in cj,

Example computing PPMI p(w information, c=data) = p(w information) = p(c=data) = Word Similarity_ Distributional Similarity I --NLP Jurafsky & Manning

Associations

PMI: More data trumps smarter algorithms • “More data trumps smarter algorithms: • Comparing pointwise mutual information • with latent semantic analysis” • Indiana University, 2009 • http://www.indiana.edu/~clcl/Papers/BSC901.pdf • “we demonstrate that this metric • benefits from training on extremely large amounts of data and • correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. “

Figure 20.10 Co-occurrence vectors Based on syntactic dependencies • Dependency based parser – special case of shallow parsing • identify from “I discovered dried tangerines.” (20.32) • discover(subject I) I(subject-of discover) • tangerine(obj-of discover) tangerine(adj-mod dried)

Defining Context using syntactic info • dependency parsing • chunking • discover(subject I) -- S  NP VP • I(subject-of discover) • tangerine(obj-of discover) -- VP verb NP • tangerine(adj-mod dried) -- NP  det ? ADJ N

Figure 20.11 Objects of the verb drink Hindle 1990 ACL • frequencies • it, much and anything more frequent than wine • PMI-Assoc • wine more drinkable http://acl.ldc.upenn.edu/P/P90/P90-1034.pdf

vectors review • dot-product • length • sim-cosine

Figure 20.12 Similarity of Vectors

Fig 20.13 Vector Similarity Summary

Figure 20.14 Hand-built patterns for hypernyms Hearst 1992 • Finding hypernyms (IS-A links) • (20.58) One example of red algae is Gelidium. • one example of *** is a *** • 500,000 hits on google • Semantic drift in bootstrapping

Hyponym Learning Alg. (Snow 2005) • Rely on wordnet to learn large numbers of weak hyponym patterns • Snow’s Algorithm • Collect all pairs of wordnet noun concepts with <ci IS-A cj,> • For each pair collect all sentences containing the pair • Parse the sentences and automatically extract every possible Hearst-style syntactic patterns from the parse tree • Use the large set of patterns as features in a logistic regression classifier • Given each pair extract features and use the classifier to determine if the pair is a hypernym/hyponym • New patterns learned • NPH like NP • NP is a NPH • NPH called NP • NP, a NPH (appositive)

Vector Similarities from Lin 1998 • hope (N): • optimism 0.141, chance 0.137, expectation 0.137, prospect 0.126, dream 0.119, desire 0.118, fear 0.116, effort 0.111, confidence 0.109, promise 0.108 • hope(V) • would like 0.158, wish 0.140. … • brief (N) • legal brief 0.256, affidavit 0.191, … • brief (A) • lengthy .256, hour-long 0.191, short 0.174, extended 0.163 … • full lists on page 667

Supersenses • 26 broad-category “lexicograher class” wordnet labels

Figure 20.15 Semantic Role Labelling

Figure 20.16

google(Wordnet NLTK) • .

wn01.py • # Wordnet examples from nltk.googlecode.com • import nltk • from nltk.corpus import wordnet as wn • motorcar = wn.synset('car.n.01') • types_of_motorcar = motorcar.hyponyms() • types_of_motorcar[26] • print wn.synset('ambulance.n.01') • print sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas]) • http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html

wn01.py continued • print "wn.synsets('dog', pos=wn.VERB)= ", wn.synsets('dog', pos=wn.VERB) • print wn.synset('dog.n.01') • ### Synset('dog.n.01') • print wn.synset('dog.n.01').definition • ###'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds' • print wn.synset('dog.n.01').examples • ### ['the dog barked all night']

wn01.py continued • print wn.synset('dog.n.01').lemmas • ###[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')] • print [lemma.name for lemma in wn.synset('dog.n.01').lemmas] • ### ['dog', 'domestic_dog', 'Canis_familiaris'] • print wn.lemma('dog.n.01.dog').synset

Section 2 synsets, hypernyms, hyponyms • # Section 2 Synsets, hypernyms, hyponyms • import nltk • from nltk.corpus import wordnet as wn • dog = wn.synset('dog.n.01') • print "dog hyperyms=", dog.hypernyms() • ###dog hyperyms= [Synset('domestic_animal.n.01'), Synset('canine.n.02')] • print "dog hyponyms=", dog.hyponyms() • print "dog holonyms=", dog.member_holonyms() • print "dog.roo_hyperyms=", dog.root_hypernyms() • good = wn.synset('good.a.01') • ###print "good.antonyms()=", good.antonyms() • print "good.lemmas[0].antonyms()=", good.lemmas[0].antonyms()

wn03-Lemmas.py • ### Section 3 Lemmas • eat = wn.lemma('eat.v.03.eat') • print eat • print eat.key • print eat.count() • print wn.lemma_from_key(eat.key) • print wn.lemma_from_key(eat.key).synset • print wn.lemma_from_key( 'feebleminded%5:00:00:retarded:00') • for lemma in wn.synset('eat.v.03').lemmas: • print lemma, lemma.count() • for lemma in wn.lemmas('eat', 'v'): • print lemma, lemma.count() • vocal = wn.lemma('vocal.a.01.vocal') • print vocal.derivationally_related_forms() • #[Lemma('vocalize.v.02.vocalize')] • print vocal.pertainyms() • #[Lemma('voice.n.02.voice')] • print vocal.antonyms()

wn04-VerbFrames.py • # Section 4 Verb Frames • print wn.synset('think.v.01').frame_ids • for lemma in wn.synset('think.v.01').lemmas: • print lemma, lemma.frame_ids • print lemma.frame_strings • print wn.synset('stretch.v.02').frame_ids • for lemma in wn.synset('stretch.v.02').lemmas: • print lemma, lemma.frame_ids • print lemma.frame_strings

wn05-Similarity.py • ### Section 5 Similarity • import nltk • from nltk.corpus import wordnet as wn • dog = wn.synset('dog.n.01') • cat = wn.synset('cat.n.01') • print dog.path_similarity(cat) • print dog.lch_similarity(cat) • print dog.wup_similarity(cat) • from nltk.corpus import wordnet_ic • brown_ic = wordnet_ic.ic('ic-brown.dat') • semcor_ic = wordnet_ic.ic('ic-semcor.dat')

wn05-Similarity.py continued • from nltk.corpus import genesis • genesis_ic = wn.ic(genesis, False, 0.0) • print dog.res_similarity(cat, brown_ic) • print dog.res_similarity(cat, genesis_ic) • print dog.jcn_similarity(cat, brown_ic) • print dog.jcn_similarity(cat, genesis_ic) • print dog.lin_similarity(cat, semcor_ic)

wn06-AccessToAllSynsets.py • ### Section 6 access to all synsets • import nltk • from nltk.corpus import wordnet as wn • for synset in list(wn.all_synsets('n'))[:10]: • print synset • wn.synsets('dog') • wn.synsets('dog', pos='v') • from itertools import islice • for synset in islice(wn.all_synsets('n'), 5): • print synset, synset.hypernyms()

wn07-Morphy.py • # Wordnet in NLTK • # http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html • import nltk • from nltk.corpus import wordnet as wn • ### Section 7 Morphy • print wn.morphy('denied', wn.NOUN) • print wn.synsets('denied', wn.NOUN) • print wn.synsets('denied', wn.VERB)

8 Regression Tests • Bug 85: morphy returns the base form of a word, if it's input is given as a base form for a POS for which that word is not defined: • >>> wn.synsets('book', wn.NOUN) • [Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'), Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'), Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'), Synset('book.n.10'), Synset('book.n.11')] • >>> wn.synsets('book', wn.ADJ) • [] • >>> wn.morphy('book', wn.NOUN) • 'book' • >>> wn.morphy('book', wn.ADJ)

nltk.corpus.reader.wordnet. • ic(self, corpus, weight_senses_equally=False, smoothing=1.0)Creates an information content lookup dictionary from a corpus. • http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#WordNetCorpusReader.ic • def demo(): • import nltk • print('loading wordnet') • wn = WordNetCorpusReader(nltk.data.find('corpora/wordnet')) print('done loading') • S = wn.synset • L = wn.lemma

root_hypernyms • defroot_hypernyms(self): • """Get the topmost hypernyms of this synset in WordNet.""" • result = [] • seen = set() • todo = [self] while todo: • next_synset = todo.pop() • if next_synset not in seen: • seen.add(next_synset) • next_hypernyms = next_synset.hypernyms() + … • return result

Lecture 24 Distributional Word Similarity II

Lecture 24 Distributional Word Similarity II

Presentation Transcript

Asymmetric Word Similarity

Word Meaning and Similarity

Lecture 6: Comparing Things Word Similarity

Word Meaning and Similarity

Word Similarity

ECE291 Computer Engineering II Lecture 24

ECE291 Computer Engineering II Lecture 24

Word II

Lecture 5 Desktop Publishing II – Word Processing

Word Similarity

NOVEL II Lecture 24

Lecture 8: Distributional considerations

Leveraging Sentiment to Compute Word Similarity

Lecture 22 Word Similarity

Lecture 24 Distributiona l based Similarity II

Distributional word Similarity

Lecture 19 Word Meanings II

Distributional Cues to Word Boundaries: Context Is Important

Word Meaning and Similarity

Lecture 22 Word Similarity

NOVEL II Lecture 24

Lecture 19 Word Meanings II