Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations

Learning Phonetic Similarity for Matching Named EntityTranslations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems Engineering & Engineering Management The Chinese University of Hong Kong Shatin, Hong Kong {wlam,rzhuang,pscheung}@se.cuhk.edu.hk SIGIR 04 2004/09/09

Abstract • A novel name entity matching model which considers both semantic and phonetic clues. • The matching model is formulated as an optimization bipartite weighted graph matching problem. • Investigate three learning algorithm for obtaining the similarity information of basic phoneme units based on training examples.

Introduction • Using bilingual dictionaries • System will encounter difficulties • The OOV problem(the new or unseen terms) Submitted Queries for news search consist of named entities or proper nouns. 1997 Consider phonetic information A method called Convec was developed to generate bilingual lexicon from comparable corpus. 1998 A similarity-based backward transliteration approach .2002 Automatic identification of word translation from unrelated English and German corpora. 1999 Mining parallel documents form parallel Web sites.1999 Sigir Mining term translations from Web anchor.2002

Named Entity Matching Model (2.1)- Problem Nature - • Given a pair of named entities which are translation of each other, it is to find part of the entity is matched. • To computer the similarity between two given named entities written in two language. • Note that this is a different problem form cross-language transliteration. • Example: • University of Akron  阿克倫大學 • Palo Alto Chamber of Commerce  帕洛阿爾商會 • Two issue: 阿克倫大學帕洛阿爾商會 Tokenization , Partial matching

Named Entity Matching Model (2.2)- Matching Model Investigation - • English entity E: < t1,t2,…. ,tm0> • Chinese entity C: <s1,s2,….,sn0> • Bilingual dictionary: Linguistic Data Consortium • Three learning algorithm for phoneme units • Weight associated with each word segment

Named Entity Matching Model (2.3)- Tokenization - • Consider a pair : • English entity E: < t1,t2,…. ,tm0> • Chinese entity C: <s1,s2,….,sn0> • For each tj is looked up in the bilingual dictionary. • Scanned Chinese entity to get word segment which can maximally match. • The degree of matching : • Treated as separate tokens : • If the degree of matching exceeds or reaches a certain threshold . • Group adjacent terms which do not involve in the dictionary mapping. ex:帕洛阿爾  帕洛阿爾

Named Entity Matching Model (2.4)Hybrid semantic and Phonetic Matching Algorithm – 1/4 • Let • English entity, E , be represented as token <e1,…,em> • Chinese entity, C, be represented as token <c1,…,cn> • Let undirected bipartite weighted graph with vertex set V and edge set L. • The vertex set V is set to {VE U VC} Where VE={e1,…,em} and VC ={c1,…,cn} • If there is a mapping found semantically or phonetically between an English token ei and Chinese token cj , there will be an edge .

Named Entity Matching Model (2.4)Hybrid semantic and Phonetic Matching Algorithm – 2/4 • Let edge weight be m(ei,cj) • For semantic mapping m(ei,cj) = r, ( ) • For phonetic mapping m(ei,cj) = (0,1]. (describe below) • After the edges and weights of the graph have been constructed : • The matching problem is reduced to finding a set of edges such that the total weight is maximized and each token can only be mapped to a single token on the other side. • This requirement can be formulated as a bipartite weighted graph matching problem.

Named Entity Matching Model (2.4)Hybrid semantic and Phonetic Matching Algorithm – 3/4 • Formal description of the problem : • This is a NP-Complete problem.

Named Entity Matching Model (2.4)Hybrid semantic and Phonetic Matching Algorithm – 4/4 • Formulated maximum cost assignment problem as a minimum cost assignment problem. • The Hungarian search algorithm can solve it efficiently. • Step1: remove no edge token • Step2: add dummy vertices • Step3: add dummy edge with weight zero • Step4: transformation each edge m(ei,cj) to the cost F-m(ei,cj) , where F =

Phonetic Matching ModelGenerating Phonetic Representation • Similarity of two term based on pronunciation. • Phonetic generation procedure: • English terms : using PRONLEX , resource provided by LDC • For example : “father”  “faDR” • A letter-to-phoneme tagging lexicon and a set of transformation rules are used. • 458 basic phoneme units. • Chinese terms : using Pin-Yin symbols • For example : “港”  “gang3” • 791 basic phoneme units. • Cantonese terms : using Jyut-Ping symbols • For example : “爸”  “baa1” • 1139 basic phoneme units.

Phonetic Matching ModelPhonetic Matching Algorithm • Given an English term and a Chinese term: • For calculating similarity score need prepare a phoneme pronunciation similarity (PPS) table. • English-Mandarin : 348,831 entries • English-Cantonese : 502,299 entries • In Particular, the number of entries for • English-Mandarin : 35,077 entries • English-Cantonese : 39,981 entries

Phonetic Matching ModelPhonetic Matching Algorithm • Suppose : • An English term ,A, is represented by basic phoneme unit sequence <a1,…am0> • An Chinese term ,B, is represented by basic phoneme unit sequence <b1,…bn0> • Let Si,j be the optimal longest common subsequence similarity score ,and the recursive formula as follow:

Learning phonetic similarityThe Windrow-Hoff algorithm • The Widrow-Hoff algorithm: (Learning PPS Table) • Yk : similarity score. • Zk : 1 positive training example, 0 negative example • Uk,i,j be a binary variable. Phoneme unit involving unit i (English) and j (Chinese). • Vi,j score, where i and j refer to a specific English and Chinese phoneme unit in PPS table V. • ma(English) and mb(Chinese) the number of phoneme units.

Learning phonetic similarityThe Exponentiated-Gradient Algorithm • EG requires that the elements in V are nonnegative and sum to 1. • Each element in V is divided by Maxi,j(Vi,j). • Let We define as : • where κ > 0 is the learning rate. Ψ is a normalization expression which is the sum of the updated Vi,j.

Learning phonetic similarityThe Genetic Algorithm Object function: .Initial population .Fitness Function .Selection .Crossover .Mutation

Experiments on Named Entity Matching Model • 20,000 Chinese-English person name pairs as training data. • 2,000 person name pairs different from training to evaluate the learning performance. • The average reciprocal rank (ARR) is used to measure the performance: • Manual : 0.78

Experiments on Named Entity Matching Model • Evaluated the performance of the overall named entities matching model. • 1,000 named entities from the same corpus.

Mining New Entity Translations From News • Bilingual comparable news: • Online daily Web news stories. • To discover new,unseen named entity

Mining New Entity Translations From News

Experiments on Ming New Translations

Conclusions • A novel named entity matching model • Consider both semantic and phonetic • Three learning algorithm on training the phonetic similarity information. • Flexible and Comprehensive • Hybrid model can handle named entity matching. • Bilingual comparable news: • Online daily Web news stories. • To discover new,unseen named entity

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations

Presentation Transcript

Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations

TRANSLATIONS

Translations

LEARNING WORD TRANSLATIONS

Translations

Translations

Translations

Translations

Translations

Translations

Translations

LEARNING WORD TRANSLATIONS