1 / 17

WordNet

WordNet. WordNet, WSD. WordNet. What is WordNet?

starr
Télécharger la présentation

WordNet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WordNet WordNet, WSD

  2. WordNet • What is WordNet? • Miller 95: “WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets.”

  3. WordNet • Go to the main WordNet site: http://wordnet.princeton.edu/ • Open the wordnet folder on pongo: ~/dropbox/570/wordnet/dict

  4. WordNet Vocabulary • See glossary at: http://wordnet.princeton.edu/gloss • synset: A synonym set; a set of words that are interchangeable in some context • lemma: lower case ASCII text of word as found in the WordNet database index files • lexical pointer: A lexical pointer indicates a relation between words in synsets

  5. Navigating WordNet files • data.* files – the actual network files (synsets) • index.* files – contains lower case instances of all words in WordNet, with pointers to the synset entries in the network

  6. WordNet data file Synset file offset 00045430 04 n 01 performance 3 003 @ 00033580 n 0000 ~ 00045680 n 0000 ~ 00045874 n 0000 | any recognized accomplishment; "they admired his performance under stress“ 00045680 04 n 01 overachievement 0 003 @ 00045430 n 0000 + 02537922 v 0101 ! 00045874 n 0101 | better than expected performance (better than might have been predicted from intelligence tests) Synset type File number # words in synset word # pointers to other synsets Type of pointer POS Pointer See: wndb

  7. Pointer symbols • For nouns: !    Antonym @    Hypernym  ~    Hyponym #m    Member holonym #s    Substance holonym #p    Part holonym %m    Member meronym %s    Substance meronym %p    Part meronym =    Attribute +    Derivationally related form         See: wninput

  8. WordNet index file abomination n 3 2 @ + 3 0 09613960 07401317 00734041 lemma (word) POS # pointers pointers synset file offset # synsets

  9. WordNet tools • Many, many tools • General documentation: http://wordnet.princeton.edu/doc • Online query and lookup: http://wordnet.princeton.edu/perl/webwn • APIs and tools: http://wordnet.princeton.edu/links • WordNet::similarity: http://wn-similarity.sourceforge.net/ • WordNet::similarity web interface: http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi

  10. WordNet and WSD • Milhalcea 2002 describes system to sense encode text using WordNet (and related tools and resources)

  11. Milhalcea 2002 • Some tools and resources described: • Senseval • http://www.senseval.org/ • Evalutation exercises for Word Sense Disambiguation • Senseval-1 – 3, held in last several years, workshops at ACL • Senseval-4 coming up • Data and materials from Senseval-3 can be downloaded • Some useful materials for multiple languages • Materials and test data for English, Italian, Basque, Catalan, Chinese, Romanian, and Spanish

  12. Milhalcea 2002 • Some tools and resources described: • Semcor • Sense tagged Brown corpus • Created at Princeton • Used for training WSD systems • Can be downloaded from Milhalcea’s web site: http://www.cs.unt.edu/~rada/downloads.html • We’re also planning on installing it on Pongo

  13. McCarthy et al 2004 • Task: find the predominant word senses in untagged text • Unlike Milhalcea 2002, did not rely on supervised method using SemCor • Built a thesaurus from raw text and Wordnet • Intuition: word sense more likely to be determined from untagged corpus from context, affected by genre, domain or text type • Rather than relying on SemCor’s 250,000 words, where the word senses are rather limited

  14. McCarthy et al • Thesaurus development relies on dependencies between “neighbors” • Look at distributional similarities between a word and its neighbors

  15. McCarthy et al • Experimented with several similarity measures available in WordNet::similarity • First experiment used SemCor to see how well the unsupervised system worked • 2595 polysemous nouns in SemCor

  16. McCarthy et al • Experiment #2 against SENSEVAL-2 English All Words Data • Comparison between the precision and recall for SemCor vs. their automatic data (and the SENSEVAL ceiling)

  17. McCarthy et al • Some experiments with domain specific corpora gave these results:

More Related