1 / 34

Lecture 24 Distributional Word Similarity II

Lecture 24 Distributional Word Similarity II. CSCE 771 Natural Language Processing. Topics Distributional based word similarity example PMI context = syntactic dependencies Readings: NLTK book Chapter 2 ( wordnet ) Text Chapter 20. April 15, 2013. Overview. Last Time

holt
Télécharger la présentation

Lecture 24 Distributional Word Similarity II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 24Distributional Word Similarity II CSCE 771 Natural Language Processing • Topics • Distributional based word similarity • example PMI • context = syntactic dependencies • Readings: • NLTK book Chapter 2 (wordnet) • Text Chapter 20 April 15, 2013

  2. Overview • Last Time • Finish up Thesaurus based similarity • … • Distributional based word similarity • Today • Last Lectures slides 21- • Distributional based word similarity II • syntax based contexts • Readings: • Text 19,20 • NLTK Book: Chapter 10 • Next Time: Computational Lexical Semantics II

  3. Pointwise Mutual Informatiom (PMI) • mutual Information Church and Hanks 1989 • (eq 20.36) • PointwiseMutual Information (PMI) Fano 1961 • . (eq20.37) • assoc-PMI • (eq20.38)

  4. Computing PPMI • Matrix F with W (words) rows and C (contexts) columns • fij is frequency of wi in cj,

  5. Example computing PPMI p(w information, c=data) = p(w information) = p(c=data) = Word Similarity_ Distributional Similarity I --NLP Jurafsky & Manning

  6. Example computing PPMI p(w information, c=data) = p(w information) = p(c=data) = Word Similarity_ Distributional Similarity I --NLP Jurafsky & Manning

  7. Associations

  8. PMI: More data trumps smarter algorithms • “More data trumps smarter algorithms: • Comparing pointwise mutual information • with latent semantic analysis” • Indiana University, 2009 • http://www.indiana.edu/~clcl/Papers/BSC901.pdf • “we demonstrate that this metric • benefits from training on extremely large amounts of data and • correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. “

  9. Figure 20.10 Co-occurrence vectors Based on syntactic dependencies • Dependency based parser – special case of shallow parsing • identify from “I discovered dried tangerines.” (20.32) • discover(subject I) I(subject-of discover) • tangerine(obj-of discover) tangerine(adj-mod dried)

  10. Defining Context using syntactic info • dependency parsing • chunking • discover(subject I) -- S  NP VP • I(subject-of discover) • tangerine(obj-of discover) -- VP verb NP • tangerine(adj-mod dried) -- NP  det ? ADJ N

  11. Figure 20.11 Objects of the verb drink Hindle 1990 ACL • frequencies • it, much and anything more frequent than wine • PMI-Assoc • wine more drinkable http://acl.ldc.upenn.edu/P/P90/P90-1034.pdf

  12. vectors review • dot-product • length • sim-cosine

  13. Figure 20.12 Similarity of Vectors

  14. Fig 20.13 Vector Similarity Summary

  15. Figure 20.14 Hand-built patterns for hypernyms Hearst 1992 • Finding hypernyms (IS-A links) • (20.58) One example of red algae is Gelidium. • one example of *** is a *** • 500,000 hits on google • Semantic drift in bootstrapping

  16. Hyponym Learning Alg. (Snow 2005) • Rely on wordnet to learn large numbers of weak hyponym patterns • Snow’s Algorithm • Collect all pairs of wordnet noun concepts with <ci IS-A cj,> • For each pair collect all sentences containing the pair • Parse the sentences and automatically extract every possible Hearst-style syntactic patterns from the parse tree • Use the large set of patterns as features in a logistic regression classifier • Given each pair extract features and use the classifier to determine if the pair is a hypernym/hyponym • New patterns learned • NPH like NP • NP is a NPH • NPH called NP • NP, a NPH (appositive)

  17. Vector Similarities from Lin 1998 • hope (N): • optimism 0.141, chance 0.137, expectation 0.137, prospect 0.126, dream 0.119, desire 0.118, fear 0.116, effort 0.111, confidence 0.109, promise 0.108 • hope(V) • would like 0.158, wish 0.140. … • brief (N) • legal brief 0.256, affidavit 0.191, … • brief (A) • lengthy .256, hour-long 0.191, short 0.174, extended 0.163 … • full lists on page 667

  18. Supersenses • 26 broad-category “lexicograher class” wordnet labels

  19. Figure 20.15 Semantic Role Labelling

  20. Figure 20.16

  21. google(Wordnet NLTK) • .

  22. wn01.py • # Wordnet examples from nltk.googlecode.com • import nltk • from nltk.corpus import wordnet as wn • motorcar = wn.synset('car.n.01') • types_of_motorcar = motorcar.hyponyms() • types_of_motorcar[26] • print wn.synset('ambulance.n.01') • print sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas]) • http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html

  23. wn01.py continued • print "wn.synsets('dog', pos=wn.VERB)= ", wn.synsets('dog', pos=wn.VERB) • print wn.synset('dog.n.01') • ### Synset('dog.n.01') • print wn.synset('dog.n.01').definition • ###'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds' • print wn.synset('dog.n.01').examples • ### ['the dog barked all night']

  24. wn01.py continued • print wn.synset('dog.n.01').lemmas • ###[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')] • print [lemma.name for lemma in wn.synset('dog.n.01').lemmas] • ### ['dog', 'domestic_dog', 'Canis_familiaris'] • print wn.lemma('dog.n.01.dog').synset

  25. Section 2 synsets, hypernyms, hyponyms • # Section 2 Synsets, hypernyms, hyponyms • import nltk • from nltk.corpus import wordnet as wn • dog = wn.synset('dog.n.01') • print "dog hyperyms=", dog.hypernyms() • ###dog hyperyms= [Synset('domestic_animal.n.01'), Synset('canine.n.02')] • print "dog hyponyms=", dog.hyponyms() • print "dog holonyms=", dog.member_holonyms() • print "dog.roo_hyperyms=", dog.root_hypernyms() • good = wn.synset('good.a.01') • ###print "good.antonyms()=", good.antonyms() • print "good.lemmas[0].antonyms()=", good.lemmas[0].antonyms()

  26. wn03-Lemmas.py • ### Section 3 Lemmas • eat = wn.lemma('eat.v.03.eat') • print eat • print eat.key • print eat.count() • print wn.lemma_from_key(eat.key) • print wn.lemma_from_key(eat.key).synset • print wn.lemma_from_key( 'feebleminded%5:00:00:retarded:00') • for lemma in wn.synset('eat.v.03').lemmas: • print lemma, lemma.count() • for lemma in wn.lemmas('eat', 'v'): • print lemma, lemma.count() • vocal = wn.lemma('vocal.a.01.vocal') • print vocal.derivationally_related_forms() • #[Lemma('vocalize.v.02.vocalize')] • print vocal.pertainyms() • #[Lemma('voice.n.02.voice')] • print vocal.antonyms()

  27. wn04-VerbFrames.py • # Section 4 Verb Frames • print wn.synset('think.v.01').frame_ids • for lemma in wn.synset('think.v.01').lemmas: • print lemma, lemma.frame_ids • print lemma.frame_strings • print wn.synset('stretch.v.02').frame_ids • for lemma in wn.synset('stretch.v.02').lemmas: • print lemma, lemma.frame_ids • print lemma.frame_strings

  28. wn05-Similarity.py • ### Section 5 Similarity • import nltk • from nltk.corpus import wordnet as wn • dog = wn.synset('dog.n.01') • cat = wn.synset('cat.n.01') • print dog.path_similarity(cat) • print dog.lch_similarity(cat) • print dog.wup_similarity(cat) • from nltk.corpus import wordnet_ic • brown_ic = wordnet_ic.ic('ic-brown.dat') • semcor_ic = wordnet_ic.ic('ic-semcor.dat')

  29. wn05-Similarity.py continued • from nltk.corpus import genesis • genesis_ic = wn.ic(genesis, False, 0.0) • print dog.res_similarity(cat, brown_ic) • print dog.res_similarity(cat, genesis_ic) • print dog.jcn_similarity(cat, brown_ic) • print dog.jcn_similarity(cat, genesis_ic) • print dog.lin_similarity(cat, semcor_ic)

  30. wn06-AccessToAllSynsets.py • ### Section 6 access to all synsets • import nltk • from nltk.corpus import wordnet as wn • for synset in list(wn.all_synsets('n'))[:10]: • print synset • wn.synsets('dog') • wn.synsets('dog', pos='v') • from itertools import islice • for synset in islice(wn.all_synsets('n'), 5): • print synset, synset.hypernyms()

  31. wn07-Morphy.py • # Wordnet in NLTK • # http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html • import nltk • from nltk.corpus import wordnet as wn • ### Section 7 Morphy • print wn.morphy('denied', wn.NOUN) • print wn.synsets('denied', wn.NOUN) • print wn.synsets('denied', wn.VERB)

  32. 8   Regression Tests • Bug 85: morphy returns the base form of a word, if it's input is given as a base form for a POS for which that word is not defined: • >>> wn.synsets('book', wn.NOUN) • [Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'), Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'), Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'), Synset('book.n.10'), Synset('book.n.11')] • >>> wn.synsets('book', wn.ADJ) • [] • >>> wn.morphy('book', wn.NOUN) • 'book' • >>> wn.morphy('book', wn.ADJ)

  33. nltk.corpus.reader.wordnet. • ic(self, corpus, weight_senses_equally=False, smoothing=1.0)Creates an information content lookup dictionary from a corpus. • http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#WordNetCorpusReader.ic • def demo(): • import nltk • print('loading wordnet') • wn = WordNetCorpusReader(nltk.data.find('corpora/wordnet')) print('done loading') • S = wn.synset • L = wn.lemma

  34. root_hypernyms • defroot_hypernyms(self): • """Get the topmost hypernyms of this synset in WordNet.""" • result = [] • seen = set() • todo = [self] while todo: • next_synset = todo.pop() • if next_synset not in seen: • seen.add(next_synset) • next_hypernyms = next_synset.hypernyms() + … • return result

More Related