1 / 17

Measuring the Semantic Similarity of Texts

Measuring the Semantic Similarity of Texts. Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen. Outline. Introduction Semantic Similarity of Words Semantic Similarity of Texts A Walk-Through Example Evaluation Conclusion. Introduction.

alika-hays
Télécharger la présentation

Measuring the Semantic Similarity of Texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring the Semantic Similarity of Texts Author:Courtney Corley and Rada Mihalcea Source:ACL-2005 Reporter:Yong-Xiang Chen

  2. Outline • Introduction • Semantic Similarity of Words • Semantic Similarity of Texts • A Walk-Through Example • Evaluation • Conclusion

  3. Introduction • Measures of text similarity have been used for • IR, text classification, WSD, automatic evaluation of machine translation, text summarization • The typical approach to use a simple lexical matching method, and produce a similarity score • But most text similarity metrics will fail in these texts • I own a dog • I have an animal

  4. Introduction (cont.) • LSA measure similarity between texts by including • Similar terms in large text collections • In this paper, we explore a knowledge-based method for measuring the semantic similarity of texts • There are several methods for finding the semantic similarity of words • We combine these methods into a text-to-text semantic similarity method

  5. Semantic Similarity of Words • The Leacock & Chodorow (Leacock and Chodorow, 1998) similarity • Length: the length of the shortest path between two concepts • D: the maximum depth of the taxonomy • The Wu and Palmer (Wu and Palmer, 1994) similarity

  6. Semantic Similarity of Words (cont.) • The information content (IC) of the LCS • P(c): the probability of encountering an instance of concept c in a large corpus • Lin’s metric(Lin, 1998) • Jiang & Conrath (Jiang and Conrath, 1997)

  7. Language Models • Language models are used to account for the distribution of words in language • We take into account the specificity of words • For example, • collie and sheepdog: higher weight • go and be: give less importance • TF does not always constitute a good measure of word importance • The distribution of words across an entire collection can be a good indicator of the specificity of the words --(IDF)

  8. Semantic Similarity of Texts • A directional measure of semantic similarity • indicates the semantic similarity of a text segment Ti with respect to a text segment Tj • Sets of open-class words—N, V, Adj, Adv • Determine pairs of similar words across the sets corresponding to the same open-class in two text • For nouns and verbs, we use a measure based on WordNet • Apply lexical matching to the other word classes

  9. Semantic Similarity of Texts (cont.) • maxSim: the highest semantic similarity of the six methods • The score is between 0 and 1 with respect to Ti • If this similarity measure results in a score greater than 0, then the word is added to the set of similar words for the corresponding word class WSpos • A bidirectional similarity

  10. A Walk-Through Example • First, the text segments are tokenize, POS tagged • The words are inserted into word class sets

  11. A Walk-Through Example (cont.) • We seek a WordNet-based semantic similarity for N and V • Only lexical matching for Adj, Adv, and cardinals

  12. A Walk-Through Example (cont.) • We use • The semantic similarity with respect to text 1 as 0.6702 • With respect to text 2 as 0.7202 • A bidirection measure of similarity: 0.6952

  13. Evaluation • To test the effectiveness of the text semantic similarity metric • Automatically identify if two text segments are paraphrases of each other • Corpus: • The Microsoft paraphrase corpus 4,076 training pairs and 1,725 test pairs • PASCAL corpus 580 development pairs and 800 test pairs • Two setting • An unsupervised setting threshold of 0.5 • An supervised setting the optimal threshold and weights associated with various similarity methods are determined through learning on training data

  14. Evaluation (cont.) • Three baseline • Randomly choosing a true or false value for each text pair • A lexical matching which counts the number of matching words • Using tf * idf • paraphrase identification • 狗正在吃骨頭 -> 骨頭正在被狗吃 • entailment identification • 我能看見一條狗 -> 我能看見一隻動物

  15. Evaluation (cont.)

  16. Evaluation (cont.)

  17. Conclusion • The accuracy of text semantic similarity for paraphrase identification(68.8%, 71.5%) • For the entailment data set, the accuracy 58.3 % is better than the PASCAL entailment evaluation (Dagan et al., 2005) • Our method relies on a bag-of-words approach • Improves significantly over the traditional methods • But ignores many of important relationships in sentence structure

More Related