1 / 54

OMIOTIS: A Thesaurus-based Measure of Semantic Relatedness

OMIOTIS: A Thesaurus-based Measure of Semantic Relatedness. George Tsatsaronis DB-NET Research Team, A.U.E.B. – http://www.db-net.aueb.gr SKEL N.C.S.R. “DEMOKRITOS” - http://www.iit.demokritos.gr/skel/ Web: http://www.db-net.aueb.gr/gbt/ e-mail: gbt@aueb.gr , gbt@iit.demokritos.gr

gavivi
Télécharger la présentation

OMIOTIS: A Thesaurus-based Measure of Semantic Relatedness

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OMIOTIS: A Thesaurus-based Measure of Semantic Relatedness George Tsatsaronis DB-NET Research Team, A.U.E.B. – http://www.db-net.aueb.gr SKEL N.C.S.R. “DEMOKRITOS” - http://www.iit.demokritos.gr/skel/ Web: http://www.db-net.aueb.gr/gbt/ e-mail: gbt@aueb.gr, gbt@iit.demokritos.gr Joint work with Iraklis Varlamis and Michalis Vazirgiannis

  2. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  3. Presentation Layout • Lexical Ambiguity Problem← • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  4. The Problem of Lexical Ambiguity • Syntactic Ambiguity • A word can be found with different POS in text • i.e., “Oxides and hydroxides of metals and ammonia are included in bases.” • and “He baseshis claim on some observation.” • Semantic Ambiguity • A word can occur with different meanings in text • i.e. “The old car needed constant attention.” • and “The troops stood at attention.” • A word can occur as part of a phrase • i.e. “United States of America were founded by thirteen colonies of Great Britain.” • Stemming • i.e., is disturb the stem of the verb disturb, or of the noun disturbances, which can also have other meanings? N.C.S.R. "Demokritos", December 2008

  5. Problems Propagate • Text Retrieval • Text Classification • Paraphrasing • Other problem aspects • Machine Translation, Summarization N.C.S.R. "Demokritos", December 2008

  6. Impact of Syntactic and Semantic Ambiguity in Text Retrieval (1/2) N.C.S.R. "Demokritos", December 2008

  7. Impact of Syntactic and Semantic Ambiguity in Text Retrieval (2/2) N.C.S.R. "Demokritos", December 2008

  8. Problem Nature of Semantic Ambiguity • Polysemy: A word can have different meanings in different contexts (i.e. sentences, texts). • Thesauri, like WordNet, give us the possible meanings (senses) of any dictionary word. • WordNet uses synonym sets, called synsets, to represent the words’ senses. • i.e. the noun bank has 10 different synsets in WordNet. N.C.S.R. "Demokritos", December 2008

  9. Lexical Resources • Machine Readable Dictionaries (MRDs) – like Collins English Dictionary (CED), Oxford Advanced Learner’s Dictionary (OALD), Longman Dictionary of Ordinary Contemporary English (LDOCE). • Thesauri, like WordNet, Roget’s (lately available with Java API), EuroWordNet • All MRDs for each word they provide • Possible parts of speech (POS) • Possible meanings and respective definitions (glosses) • Usage examples • Thesauri add semantic relations (usually symmetrical) • Hierarchical: Hypernym/Hyponym, Meronym-Troponym/Holonym, etc. • Horizontal: Antonym/Synonym, Domain, Entailments/Causes, etc. N.C.S.R. "Demokritos", December 2008

  10. WordNet – an often used thesaurus • Developed by Princeton, more than 200.000 synsets. • Versions 2.0 and 2.1 come with semantic relations crossing POS. • The most widely used thesaurus in the WSD literature since 2000. • Senseval 2, Senseval 3, SemCor and SemEval are manually annotated on WordNet 2.0 and 2.1. N.C.S.R. "Demokritos", December 2008

  11. How to tackle with Lexical Ambiguity • Syntactic Ambiguity • POS Tagging (Brill, Viterbi algorithm, MaxEnt) • Semantic Ambiguity • Word Sense Disambiguation (knowledge-based, corpus-based, hybrid) • Phrase Detection (dictionary look up) • Questions raise: • How to combine all these pieces of information? • Is there any other way to address lexical ambiguity? • A measure of semantic relatedness combining lexical and semantic features? N.C.S.R. "Demokritos", December 2008

  12. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches← • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  13. C0 D0 E0 C11 C12 C13 D11 D12 E11 E12 Ckm Cpq WSD Overall Idea • What is the semantic relatedness between Ckm and Cpq ? … Term2 Term1 N.C.S.R. "Demokritos", December 2008

  14. Notation • len(ci,cj) is the length of the shortest path • depth(ci) = len(root, ci) is the depth of a node • lso(ci,cj) is the lowest super-ordinate (or most specific common subsumer) of ci, cj. • Given any rel(ci,cj), the rel(wi,wj) is simply: N.C.S.R. "Demokritos", December 2008

  15. Dictionary-based Semantic Relatedness • Wu and Palmer (1994) • Hirst and St-Onge (1998) • Leacock and Chodorow (1998) • Veale (2004) N.C.S.R. "Demokritos", December 2008

  16. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches ← • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  17. Overall Idea Term t1 Term t2 r11 r21 • Core Idea: “A word is characterized by the company it keeps” r12 r22 … … Large Corpus (i.e., BNC, Wikipedia) r1n r2n Vector representation of terms based on features, like frequency of co-occurence Frequencies are transformed by a variety of formulas and weights. Use of techniques like LSA (SVD) is also a potential. Relatedness can then measured through cosine of the angle created by the two vectors. N.C.S.R. "Demokritos", December 2008

  18. Corpus-based Semantic Relatedness • PMI-IR (Turney 2001) N.C.S.R. "Demokritos", December 2008

  19. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches ← • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  20. C0 C11 C12 C13 Ckm Cpq Overall Idea TF(CC0) = F(FOC(C11),FOC(C12),FOC(C13)) Frequencies of Occurrence propagate FOC(C1j) = F(FOC(C2i)) … FOC(Ckm) FOC(Cpq) N.C.S.R. "Demokritos", December 2008

  21. Hybrid Approaches • Resnik (1995) • Jiang and Conrath (1997) • Lin (1998) N.C.S.R. "Demokritos", December 2008

  22. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS← • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  23. OMIOTIS: A Thesaurus-based measure of Semantic Relatedness • OMIOTIS is a dictionary-based measure of semantic relatedness. • It does not require any type of training. It relies in the use of WordNet. • For the first time, three important factors are considered in tandem: • Semantic path length • Depth of senses comprising the path • Importance of the semantic edge types N.C.S.R. "Demokritos", December 2008

  24. Semantic Networks Construction • Veronis and Ide [Veronis and Ide 1990] developed the first method that utilizes semantic networks to disambiguate open class words. • It was one of the first formal semantic network definitions N.C.S.R. "Demokritos", December 2008

  25. Incorporating more semantic information • Tsatsaronis et. Al. (2007) proposed a new method for constructing Semantic Networks and use of spreading of activation to process them. • Incorporated all of the available semantic information and developed a new strategy to spread the activation • Developed an edges weighting scheme respective to the TF-IDF. N.C.S.R. "Demokritos", December 2008

  26. Edge Weights and Activation Control • Edge weights are given by: • Activation is spread by: • Fan-out and distance constraint to prevent network from overflow. N.C.S.R. "Demokritos", December 2008

  27. OMIOTIS: Semantic Compactness Definition 1. Given a word thesaurus O, a weighting scheme for the edges that assigns a weight e in (0, 1) for each edge, a pair of senses S = (s1, s2), and a path of length l connecting the two senses, the semantic compactness of S (SCM(S,O)) is defined as where e1, e2, ..., el are the path’s edges. If s1 = s2 SCM(S,O) = 1. If there is no path between s1 and s2 SCM(S,O) = 0. N.C.S.R. "Demokritos", December 2008

  28. OMIOTIS: Semantic Path Elaborration Definition 2. Given a word thesaurus O and a pair of senses S = (s1, s2), where s1,s2 in O and s1 is not s2, and a path between the two senses of length l, the semantic path elaboration of the path (SPE(S,O)) is defined as , where diis the depth of sense siaccording to O, and dmax the maximum depth of O. If s1 = s2, and d = d1 = d2 SPE(S,O) = d dmax. If there is no path from s1 to s2, SPE(S,O) = 0. N.C.S.R. "Demokritos", December 2008

  29. OMIOTIS: Semantic Relatedness Definition 3. Given a word thesaurus O, and a pair of senses S = (s1, s2) the semantic relatedness of S (SR(S,O)) is defined as max{SCM(S,O) ・ SPE(S,O)}. N.C.S.R. "Demokritos", December 2008

  30. Computation of Semantic Relatedness N.C.S.R. "Demokritos", December 2008

  31. OMIOTIS Where: N.C.S.R. "Demokritos", December 2008

  32. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness ← • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  33. Word-to-Word Data Sets • 65 pairs of words (R&G) • 30 pairs of words (M&C) • Word-Similarity-353 Collection (Finkelstein et AL. 2006) • For all pairs, we have human judgements (“gold standards”) • Evaluation takes place with measuring Spearman Correlation from the human judgements ranked list • Other measures have also been used, based on Kendall’s Tau N.C.S.R. "Demokritos", December 2008

  34. Example (R&G) N.C.S.R. "Demokritos", December 2008

  35. Results N.C.S.R. "Demokritos", December 2008

  36. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy← • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  37. Scholastic Aptitude Test (SAT) • Given a pair of words, find the most relevant pair to it, among 5 more pairs of words. • The key is to find the pair that keeps among all possible aspects the semantic analogies with the initial one. N.C.S.R. "Demokritos", December 2008

  38. Results in the 374 SAT Collection OMIOTIS Scores 131/374 (35%), and if horizontal and vertical relatedness are combined, it reaches 198/374 (52,94%) N.C.S.R. "Demokritos", December 2008

  39. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness← • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  40. The 50-documents Collection • Michael D. Lee et Al. (2005) created a dataset where for all possible pairs among 50 documents, 83 subjects assigned a score of similarity for each one. • Documents vary from 51 to 126 words. • Data Set was assessed on whether it is within the normal range of standard English text, according to four language models • Log-normal, generalized inverse Gauss-Poisson, Yule-Simon and Zipfian. • The data set was found to be within normal range in terms of word frequency spectrum and vocabulary growth. N.C.S.R. "Demokritos", December 2008

  41. Results on the 50 document collection • The average ‘inter-rater’ correlation was 0.605 • Cosine correlation with humans, scores 0.27 (bag of words representation and TF-IDF weighting). • OMIOTIS (early results) shows a correlation of above 0.45. • LSA based techniques score 0.6, but need training. N.C.S.R. "Demokritos", December 2008

  42. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing← • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  43. Microsoft Research Paraphrase Corpus • 5801 pairs of sentences gleaned over a period of 18 months. • Each pair of sentences was deemed as a paraphrase pair (1) or not (0). Two judges, with disagreements being resolved by a third judge. • After judges disagreement resolutions, 67% were judged semantically equivalent. N.C.S.R. "Demokritos", December 2008

  44. Paraphrase Results • The table shows error reduction rates (%) from the standard vectorial model in the paraphrase task. N.C.S.R. "Demokritos", December 2008

  45. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval← • Limitations • Future Work N.C.S.R. "Demokritos", December 2008

  46. TREC 1, 4 and 6 N.C.S.R. "Demokritos", December 2008

  47. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations← • Future Work N.C.S.R. "Demokritos", December 2008

  48. Limitations • Scaling prior to the construction of the huge database was infeasible. • Corpora, like Wikipedia, offer tremendous amounts of pieces of information. OMIOTIS does not take corpora information into account. • Context is not really taken into account, as WSD is not conducted. N.C.S.R. "Demokritos", December 2008

  49. Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work ← N.C.S.R. "Demokritos", December 2008

  50. Future Work • Embed WSD information • Combine information rising from huge corpora • Combine thesauri (i.e., use Roget’s as well) Some interesting working ideas • Model the impact of ambiguity in text retrieval (similar attempts were made by Sanderson, using pseudowords) • Combine the indexing of OMIOTIS distances, with a SoA IR platform, like Terrier. This will allow for online searching. N.C.S.R. "Demokritos", December 2008

More Related