1 / 11

Improving Named Entity Translation Combining Phonetic and Semantic Similarities

Improving Named Entity Translation Combining Phonetic and Semantic Similarities. Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School of Computer Science, CMU NAACL 2004. Introduction.

addo
Télécharger la présentation

Improving Named Entity Translation Combining Phonetic and Semantic Similarities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School of Computer Science, CMU NAACL 2004

  2. Introduction • In the 2001 C-E translation evaluation test data, 20% of NEs are not included in the 50K LDC C-E translation lexicon. • Most previous studies focused only on phonetic information • There are NEs not translated in phonetic values (e.g. “南懷仁, Ferdinand Verbiest”) • Combining phonetic similarities (transliteration) and semantic similarities (context) to cover these non-transliterated NEs. • Source language: Chinese • Target language: English

  3. Surface String Transliteration • Training data: • LDC C-E dictionary • Bootstrapping unsupervised learning • Learning transliterating probabilities between pinyin and English letters • Pre-processing: Romanizing Chinese word into pinyin. • 0th iteration: Using editing distances to generate mappings between Chinese and English word pairs.. • Using 3,000 word translations with minimum editing distance of the 0th iteration to estimate new transliterating probabilities. • Repeating generating new translation mappings using new transliterating probabilities. • In each iteration, additional 500 pairs with a minimum transliterating cost are added into the existing NE pair list to update new transliterating probabilities. • Repeat until adding more NE pairs does not improve the extraction accuracy further.

  4. Contextual Semantic Similarity • Training data: a subset of English Xinhua News corpus • Context Vector Selection: • POS • Phi-Square: • Weight of POS: • Distance • Weight of Location: • Weight Vector:

  5. Contextual Semantic Similarity

  6. Contextual Semantic Similarity • Semantic Similarity between Context Vectors: • Semantic similarity: • P(vf|ve) is computed with a modified IBM translation model-2 [Brown et al. 1993]: • I: the length of the source vector • J: the length of the target vector • p(e|f): the word translation probability estimated from a C-E aligned corpus with IBM model1 • P(ve|vf) is estimated in the similar way

  7. Cross-lingual Retrieval for NE translations

  8. Cross-lingual Retrieval for NE translations • English NEs in the retrieved text are automatically tagged by IdentiFinderTM from BBN (Bikel et al.,1997). • Overall similarity score: • The NE pairs with the highest overall similarity scores are considered translations. • Since NE can be translated in several different ways, and there are typos at times, from among the top NE hypothesis with similar spelling, the one with the highest frequency are chosen as the translation.

  9. Cross-lingual Retrieval for NE translations • Sentence-based or Document-based? • Test data: Chinese newswire documents • 114 Chinese NEs are selected and translated manually • Indexed Corpus: 963,478 English documents from the Xinhua News Agency • Retrieval Model: TF-IDF • Top 1000 results are regarded as the relevant text • The recall of document-based indexing is better. (70% comparing with 60%)

  10. Experiment Results • Test dataset: • NIST 2002 Machine Translation Evaluation test data • 100 Chinese documents, 878 sentences, 25430 words • 2469 NEs are automatically tagged • (PER: 20%, LOC: 60%, ORG: 20%) • Only PER and LOC are focused • Among 1,898 tagged PERs and LOCs, 338 of them are true NEs and not covered by the LDC lexicon. • Baseline system: • The CMU statistical MT system. (Vogel et al., 2003)

  11. Experiment Results

More Related