1 / 25

JIGSAW: an Algorithm for Word Sense Disambiguation

JIGSAW: an Algorithm for Word Sense Disambiguation. Dipartimento di Informatica University of Bari Pierpaolo Basile (basilepp@di.uniba.it) Giovanni Semeraro (semeraro@di.uniba.it). Word Sense Disambiguation.

cree
Télécharger la présentation

JIGSAW: an Algorithm for Word Sense Disambiguation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JIGSAW: an Algorithm for Word Sense Disambiguation Dipartimento di Informatica University of Bari Pierpaolo Basile (basilepp@di.uniba.it) Giovanni Semeraro (semeraro@di.uniba.it)

  2. Word Sense Disambiguation • Word Sense Disambiguation (WSD) is the problem of selecting a sense for a word from a set of predefined possibilities • Sense Inventory usually comes from a dictionary or thesaurus • Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches

  3. All Words WSD • Attempt to disambiguate all open-class words in a text: • “He put his suit over the back of the chair” • How? • Knowledge-based approaches • Use information from dictionaries • Position in a semantic network • Use discourse properties • Minimally supervised approaches • Most frequent sense

  4. JIGSAW • Knowledge-basedWSD algorithm • Disambiguation of words in a text by exploiting WordNet senses • Combination of three different strategies to disambiguate nouns, verbs, adjectives and adverbs • Main motivation:the effectiveness of a WSD algorithm is strongly influenced by the POS tagof the target word

  5. JIGSAW algorithm • Input:document d = {w1, w2, ... , wh} • Output: list of WordNet synsets X = {s1, s2, ... , sk} • each element si is obtained by disambiguating the target word wi • based on the information obtained from WordNet about words in the context • context C of the target word: a window of n words to the left and another n words to the right, for a total of 2n surrounding words • For each word JIGSAWadopts a different strategy based on POS tag

  6. JIGSAW_nouns: the idea • Based on Resnik [Resnik95] algorithm for disambiguating noun groups • Given a set of nouns W={w1,w2, ... ,wn} from document d: • each wi has an associated sense inventory Si={si1, si2, ... , sik} of possible senses • Goal: assigning each wi with the most appropriate sense sihSi, according to the similarity of wi with the other nouns in W

  7. Wcat={cat#1,cat#2} T={mouse#1,mouse#2} Wcat={02037721,00847815} T={02244530,03651364} Mouse#1: any of numerous small rodents… 0.726 Cat#1: feline mammal… 0.0 cat 0.0 mouse Mouse#2: a hand-operated electronic device … 0.107 Cat#2: computerized axial tomography… JIGSAW_nouns: semantic similarity “The white cat is hunting the mouse” w = cat C = {mouse}

  8. JIGSAW_nouns: MSS support MostSpecificSubsumer {placental_mammal} • MostSpecificSubsumer between words • Give more importance to senses that are hyponym of MSS • Combine MSS support with semantic similarity Wcat={cat#1,cat#2} T={mouse#1,mouse#2}

  9. Difference between JIGSAW_nouns and Resnik • Leacock-Chodorow measure to calculate similarity (instead Information Content) • a Gaussian factor G, which takes into account the distance between words in the text • a factor R, which takes into account the synset frequency score in WordNet • a parameterized search for the MSS (Most Specific Subsumer) between two concepts

  10. JIGSAW_verbs: the idea • Try to establish a relation between verbs and nouns (distinct IS-A hierarchies in WordNet) • Verb wi disambiguated using: • nouns in the context C of wi • nouns into the description (gloss + WordNet usage examples) of each candidate synset for wi

  11. JIGSAW_verbs: algorithm [1/4] • For each candidate synset sik of wi • computes nouns(i, k): the set of nouns in the description for sik • for each wj in C and each synset sik computes the highest similarity maxjk • maxjk is the highest similarity value for wj wrt the nouns related to the k-th sense for wi (using Leacock-Chodorow measure)

  12. JIGSAW_verbs: algorithm [2/4] wi=play C={basketball, soccer} I play basketball and soccer • (70) play -- (participate in games or sport; "We played hockey all afternoon"; "play cards"; "Pele played for the Brazilian teams in many important matches") • (29) play -- (play on an instrument; "The band played all night long") • … nouns(play,1): game, sport, hockey, afternoon, card, team, match nouns(play,2): instrument, band, night … nouns(play,35): …

  13. game1 game2 game similarity … gamek sport1 sport2 sport … sportm JIGSAW_verbs: algorithm [3/4] wi=play C={basketball, soccer} nouns(play,1): game, sport, hockey, afternoon, card, team, match basketball1 … basketball basketballh MAXbasketball = MAXiSim(wi,basketball) winouns(play,1)

  14. JIGSAW_verbs: algorithm [4/4] • finally, anoverall similarity score,(i, k), among sik and the whole context C is computed: • the synset assigned to wi is the one with the highest  value

  15. JIGSAW_others • Based on the WSD algorithm proposed by Banerjee and Pedersen [Banerjee07] (inspired to Lesk) • Idea: computes the overlap between the glosses of each candidate sense for the target word to the glosses of all words in its context • assigns the synset with the highest overlap score • if ties occur, the most common synset in WordNet is chosen

  16. Evaluation • EVALITA All-Words-Task • disambiguate all words in a text • Dataset • 16 texts in Italian language • about 5000 words (tagged by ItalWordNet) • Processing • WSD needs others NLP steps: • Text normalization and Tokenization • Part-Of-Speech Tagging (based on ACOPOST) • Lemmatization (based on Morph-it! Resource) META

  17. Evaluation (Results) • Comments: • results are encouraging considering that our system exploits only ItalWordNet • pre-processing phases, lemmatization and POS-tagging, introduce errors: • 77,66% lemmatization precision • 76,23% POS-tagging precision

  18. Conclusions • Conclusions: • Knowledge-based WSD algorithm • Different strategy for each POS-tagging • Use only WordNet (ItalWordNet) and some heuristics • Advantage: use the same strategy for Italian and English [Basile07] • Drawback: low precision (now)

  19. Future Work • Including new knowledge sources: • Web • (e.g. Topic Signature) • Wikipedia • (e.g. Wikipedia similarity) • Do not include resources that are available for only few languages • Use others heuristics: • Statistical distribution of senses instead WordNet frequency • WordNet domains

  20. References • S. Banerjee and T. Pedersen. An adapted lesk algorithm for word sense disambiguation using wordnet. In CICLing’02: Proc. 3rd Int’l Conf. on Computational Linguistics and Intelligent Text Processing,pages 136–145, London, UK, 2002. Springer-Verlag. • P. Basile, M. de Gemmis, A.L. Gentile, P. Lops, and G. Semeraro. JIGSAW algorithm for word sense disambiguation. In SemEval-2007: 4th Int. Workshop on Semantic Evaluations, pages 398–401. ACL press, 2007. • C. Leacock and M. Chodorow. Combining local context and wordnet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, pages 305–332. MIT Press, 1998. • P. Resnik. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for Computational Linguistics, 1995.

  21. Backup slides

  22. Results (detail) • Comments: • polysemy of verb is high • generally proper nouns have only one sense • lemmatizer and pos-tagger work worst for adjectives

  23. [s21 s22 … s1h] [sn1 sn2 … snm] [s11 s12 … s1k] JIGSAWnouns: The idea • Inspired by the idea proposed by Resnik [Resnik95] [ w1, w2, … wn ] [ w1 w2 … wn ] [0.4 0.3 0.5] [0.2 0.3 0.4] [0.6 0.1 0.2] • Most plausible assignment of senses to a set of co-occurring nouns is the one that maximizes the relatedness of meanings among the senses chosen • Relatedness is measured by computing a score for each sij • confidence with which sij is the most appropriate synset for wi • The score is a way to assign credit to word senses based on similarity with co-occurring words (support) [Resnik95]P. Resnik. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for Computational Linguistics, 1995.

  24. 0.15 [s21 s22 … s1h] [sn1 sn2 … snm] [s11 s12 … s1k] the more similar two words are, the more informative will be the most specific concept that subsumes both of them 0.56 JIGSAWnouns: The support [ w1 w2 … wn ] most specific subsumer MSS semantic similarity score

  25. MSS [s21 s22 … s1h] [sn1 sn2 … snm] [s11 s12 … s1k] Placentalmammal Carnivore Rodent Mouse (rodent) Feline, felid Cat (feline mammal) JIGSAWnouns: The idea cat mouse [ w1 w2 … wn ] most specific subsumer MSS feline mammal rodent [0.56 0.0 0.0] [0.56 0.0 0.56] [0.0 0.0 0.0] MSS = placental mammal

More Related