Measuring the Semantic Similarity of Texts

Measuring the Semantic Similarity of Texts Author：Courtney Corley and Rada Mihalcea Source：ACL-2005 Reporter：Yong-Xiang Chen

Outline • Introduction • Semantic Similarity of Words • Semantic Similarity of Texts • A Walk-Through Example • Evaluation • Conclusion

Introduction • Measures of text similarity have been used for • IR, text classification, WSD, automatic evaluation of machine translation, text summarization • The typical approach to use a simple lexical matching method, and produce a similarity score • But most text similarity metrics will fail in these texts • I own a dog • I have an animal

Introduction (cont.) • LSA measure similarity between texts by including • Similar terms in large text collections • In this paper, we explore a knowledge-based method for measuring the semantic similarity of texts • There are several methods for finding the semantic similarity of words • We combine these methods into a text-to-text semantic similarity method

Semantic Similarity of Words • The Leacock & Chodorow (Leacock and Chodorow, 1998) similarity • Length: the length of the shortest path between two concepts • D: the maximum depth of the taxonomy • The Wu and Palmer (Wu and Palmer, 1994) similarity

Semantic Similarity of Words (cont.) • The information content (IC) of the LCS • P(c): the probability of encountering an instance of concept c in a large corpus • Lin’s metric(Lin, 1998) • Jiang & Conrath (Jiang and Conrath, 1997)

Language Models • Language models are used to account for the distribution of words in language • We take into account the specificity of words • For example, • collie and sheepdog: higher weight • go and be: give less importance • TF does not always constitute a good measure of word importance • The distribution of words across an entire collection can be a good indicator of the specificity of the words －－(IDF)

Semantic Similarity of Texts • A directional measure of semantic similarity • indicates the semantic similarity of a text segment Ti with respect to a text segment Tj • Sets of open-class words—N, V, Adj, Adv • Determine pairs of similar words across the sets corresponding to the same open-class in two text • For nouns and verbs, we use a measure based on WordNet • Apply lexical matching to the other word classes

Semantic Similarity of Texts (cont.) • maxSim: the highest semantic similarity of the six methods • The score is between 0 and 1 with respect to Ti • If this similarity measure results in a score greater than 0, then the word is added to the set of similar words for the corresponding word class WSpos • A bidirectional similarity

A Walk-Through Example • First, the text segments are tokenize, POS tagged • The words are inserted into word class sets

A Walk-Through Example (cont.) • We seek a WordNet-based semantic similarity for N and V • Only lexical matching for Adj, Adv, and cardinals

A Walk-Through Example (cont.) • We use • The semantic similarity with respect to text 1 as 0.6702 • With respect to text 2 as 0.7202 • A bidirection measure of similarity: 0.6952

Evaluation • To test the effectiveness of the text semantic similarity metric • Automatically identify if two text segments are paraphrases of each other • Corpus: • The Microsoft paraphrase corpus 4,076 training pairs and 1,725 test pairs • PASCAL corpus 580 development pairs and 800 test pairs • Two setting • An unsupervised setting threshold of 0.5 • An supervised setting the optimal threshold and weights associated with various similarity methods are determined through learning on training data

Evaluation (cont.) • Three baseline • Randomly choosing a true or false value for each text pair • A lexical matching which counts the number of matching words • Using tf * idf • paraphrase identification • 狗正在吃骨頭 -> 骨頭正在被狗吃 • entailment identification • 我能看見一條狗 -> 我能看見一隻動物

Evaluation (cont.)

Conclusion • The accuracy of text semantic similarity for paraphrase identification(68.8%, 71.5%) • For the entailment data set, the accuracy 58.3 % is better than the PASCAL entailment evaluation (Dagan et al., 2005) • Our method relies on a bag-of-words approach • Improves significantly over the traditional methods • But ignores many of important relationships in sentence structure

Measuring the Semantic Similarity of Texts

Measuring the Semantic Similarity of Texts

Presentation Transcript

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity

Semantic Textual Similarity (STS) Workshop

MEASURING THE SIMILARITY BETWEEN IMPLICIT SEMANTIC RELATIONS USING WEB SEARCH ENGINES [2009]

Semantic Similarity in a Taxonomy

Components for a semantic textual similarity system

Measuring Semantic Similarity between Words Using HowNet

Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities

Measuring the degree of similarity: PAM and blosum Matrix

Extracting semantic role information from unstructured texts

Semantic Similarity Measures Across The Gene Ontology.

Semantic Annotation for Interlingual Representation of Mulilingual Texts

Feature Based Approaches to Semantic Similarity

Cognitive Computation Group Resources for Semantic Similarity

Algorithmic Detection of Semantic Similarity

Measuring the Semantic Web

MEASURING THE SIMILARITY BETWEEN IMPLICIT SEMANTIC RELATIONS USING WEB SEARCH ENGINES

Measuring the Structural Similarity of Semistructured Documents Using Entropy

Measuring the degree of similarity: PAM and blosum Matrix

Semantic Similarity Computation on the Web of Data

Minimum Spanning Trees Displaying Semantic Similarity

Mining User Similarity from Semantic Trajectories

Semantic Similarity Measurement and Geographic Applications Similarity approaches