Computing Semantic Similarities based on Machine-Readable Dictionaries

Computing Semantic Similarities based on Machine-Readable Dictionaries

Abstract • If two words have similar definitions, they are semantically similar. • A definition is represented by a definition vector. • Each dimension represents a word in the dictionary. • The score of each dimension in the vector is calculated by a variation of tf*idf.

Introduction • Machine-Readable Dictionaries (MRDs) are human encoded knowledge about words. • Transforming a hard-copy dictionary into a machine-readable one is far easier than building a new lexical ontology.

Basic Idea • If two words have something in common, then there will also be some common words in the definitions. • Two definition vectors generated from definitions

Dictionaries and Preparation • Two machine readable dictionaries: • Longman Dictionary of Contemporary English (LDCE) • Only 2000 English words • ModernChinese Standardized Dictionary (MCSD) • POS: BMM (backward maximum matching algorithm) • Formal and regular • Fewer ambiguities in definitions than in free text • It can get nice result

Measuring Similarities - 1/6 • Let W be the set of all words in a dictionary D. • W={w1, w2, w3, · · · , wmax}, • Through E, we have the definition vector of a word w.

Measuring Similarities - 2/6 • i is the iteration time • If a word a occurs in the definition of word b but seldom occur in definitions of other words, then a is important for the explanation of b.

Measuring Similarities - 3/6 • This paper uses r(w,wl) to measure the association between w and a word wlin its definition. • tf(w,wl): the occurrence counts of wlin w’s definitions. • ef(wl) : the number of words that have wlin their definitions.

Measuring Similarities - 4/6 • C(wl, w) is iteratively calculated as:

w w’ wl Measuring Similarities – 5/6 • S : a set of stop words • δβ,N : Top βN words, β is set to 0.6 • w’ ∈ e(w)∧(w’, wl) ∈ Ei−1. αis a weighting parameter. Words from ithiteration have a weight of αi−1.

Measuring Similarities – 6/6 • Pearson’s product-moment correlation coefficient

Evaluation

Evaluation- Chinese set M&C data set

90 0.9 Evaluation - Chinese set

Evaluation - Chinese set • Mdc: • Data sparseness problem of Chinese Web • Searching results in Chinese contains much more duplicates than its English counterpart. • Therefore, the double checking approach is not suitable for the Chinese task. • Mhow: • Mhowoutputs 1 for ”鳥,鶴”(bird, crane), since ”鳥” is the hypernym of ”鶴”. • But for ”熔爐” and ”火爐”, Mhow outputs very low similarity, which is contrary to human instinct.

Evaluation – M&C Data Set Fig. 4 shows the affect of different α and iteration counts on the English data set.

Conclusions • A novel method that uses dictionary as a main resource for measuring word similarities. • Each dimension of the vector represents a word and its value represents the importance of the word in the definition. • The importance value is calculated like tf*idf in definitions.

Computing Semantic Similarities based on Machine-Readable Dictionaries