1 / 37

Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language. Takashi Tsunakawa 1 Naoaki Okazaki 1 Jun’ichi Tsujii 1,2. LREC 2008 29 May, 2008. 1 Department of Computer Science, Graduate School of Information Science and Technology, University of Tokyo

Télécharger la présentation

Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Bilingual Lexicons Using Lexical Translation Probabilitiesvia Pivot Language Takashi Tsunakawa1 Naoaki Okazaki1 Jun’ichi Tsujii1,2 LREC 2008 29 May, 2008 1Department of Computer Science, Graduate School of Information Science and Technology, University of Tokyo 2School of Computer Science, University of Manchester / National Centre for Text Mining

  2. Introduction • Building bilingual lexicons via pivot languages 计步器 (jìbùqì) C-E lexicon CHINESE ENGLISH E-J lexicon odometer pedometer オドメーター 万歩計 (mampokei) (odomētā) ペドメータ 歩数計 (pedomēta) JAPANESE ペドメーター (hosūkei) (pedomētā)

  3. Introduction • Building bilingual lexicons via pivot languages 计步器 (jìbùqì) • オドメーター (odomētā) • ペドメータ(pedomēta),ペドメーター(pedomētā),歩数計(hosūkei),万歩計(mampokei) odometer pedometer Creative  CommonsAttribution ShareAlike 2.0  License by skippy13

  4. Constructing Japanese-Chinese lexicon from Japanese-English and English-Chinese lexicons through English terms • J-E and E-C lexicons are well-supported for many terms and domains, compared to J-C lexicons • Especially for technical terms, there are few J-C lexicons because technical terms are first written by English in most cases Advantages of the pivotal approach The pivotal approach could help us to (semi-) automatically find J-C translation term pairs

  5. Mismatch problem • We cannot find a Chinese-Japanese term pair that does not share the identical English translations. Is it possible to generate the following lexical item?

  6. Merging Two Bilingual Lexicons • “Exact merging” • cannot merge pairs that do not share the identical English translations mismatch problem • Challenges to merge more terms • “Word-based merging” • “Alignment-based merging”

  7. Word-based merging • Tokenize a term into word tokens, and • Translate each word by the bilingual lexicon 全球变暖 globalheating (qúanqíu-bìannŭan) 地球 温暖化 (chikyū - ondanka)

  8. Alignment-based merging:Overview • Align each word, • Calculate word translation probabilities, and • Translate each word by the probabilities 全球 变暖 warming global heating global heating 温暖化 地球 温暖化

  9. Alignment-based merging:Overview C-E translation word pairs (with probabilities) C-E lexicon phrase J-C translation word pairs (with probabilities) phrase-based SMT Word-by-word translation Merging word pairs & re-calculating probabilities phrase J-E translation word pairs (with probabilities) J-E lexicon Japanese translations of C-E lexicon (Add term frequencies on Web) phrase

  10. Alignment-based merging • Apply word alignment (GIZA++) (Och & Ney, 2003) for all term pairs • Calculate word translation probabilities from co-occurrence frequencies For both of the bilingual lexicons, source(f)-pivot(p) and pivot(p)-target(e) C(wp,wf; ap-f): Co-occurrence frequency of wp and wf, which are aligned by GIZA++

  11. Alignment-based merging • Calculate word translation probabilities from a target-language word to a source-language word (Utiyama & Isahara, 2007):

  12. Alignment-based merging • Calculate the translation probabilities (scores) based on the noisy channel model (Brown et al., 1990) i-th word of we • The language model p(we) is calculated by using the number of Web searching results (Google) of the term we • p(we) ∝ (hit count of we) • Generate the merged lexicon with translation probabilities are greater than zero. • New_Lexicon = {(wf,we)|Pr(we|wf)>0 and Pr(wf|we) > 0}

  13. Experimental settings • Used lexicons: Bilingual lexicons that consist of technical terms • C-E: Wanfang Data E-C & C-E Science and Technology Dictionary • J-E: JST Machine Translation Dictionary • By “exact merging,” we can translate about 22% of Japanese (or Chinese) terms Utilization ratio

  14. Experimental results • Utilization ratio • Alignment-based merging drastically improved the utilization ratio, and the size of merged lexicon also increased • Accuracy (by manual evaluation) • MRR: Mean Reciprocal Rank (Voorhees, 1999) calculates the mean of reciprocal ranks over all source terms • Prec1: Precision of the highest ranked terms • Prec10: Precision that the 10-best outputs include the correct one

  15. Experimental results: Examples (1/2) (jiăomó - shízhì - yán) • A Chinese-to-Japanese example of “角膜 实质 炎” (keratitis parenchymatosa)

  16. Experimental results: Examples (2/2) (hatsuiku - jōtai) • A J-to-C example of “発育 状態” (growth status)

  17. Conclusion • Alignment-based merging of two bilingual lexicons via a pivot language is proposed • The alignment-based merging could achieve at least 75% utilization ratio in our experiments • The precision still remains 0.14 (Japanese-to-Chinese) and 0.20 (Chinese-to-Japanese), which would be improved by sophisticated scoring method • Future directions • To choose the correct translation with examining the context or semantic classes of source and target terms • To evaluate a machine translation system with this lexicon integrated

  18. Thank you for your attention • Acknowledgments • MEXT, Japan • Japan Science and Technology Agency (JST), Japan • NICT, Japan • Wanfang Data, China

  19. Experimental Results • Our system could generate at least one Japanese translations into 73.4% (385509/525259) of the C-E lexicons Correct Japanese translations are highlighted Japanese reference translation Chinese input term (infectious hepatitis virus, 感染性肝炎ウイルス) (coliphage, 大腸菌ファージ)

  20. Experimental Results same character but the meanings are not identical (acoustic delay line storage, 音響遅延線記憶装置) (complement form, 補数形式)

  21. Manual evaluation • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations

  22. Manual evaluation • 1. 数 组 元素 – array element – 配列 元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations

  23. Manual evaluation • 1. 数 组 元素 – array element – 配列 元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • 2. 计算机 化 管理 学会– ICM • – 特 発 性 心筋 障害 • The Chinese means “Institution for Computerization Management”, • and the Japanese means “Idiopathic Cardiomyopathy” • Possible solutions: • Special treatment for acronyms • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations

  24. Manual evaluation • 1. 数 组 元素 – array element – 配列 元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • 2. 计算机 化 管理 学会– ICM • – 特 発 性 心筋 障害 • The Chinese means “Institution for Computerization Management”, • and the Japanese means “Idiopathic Cardiomyopathy” • Possible solutions: • Special treatment for acronyms • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations • 3. 信息量– information content – 量 • The Japanese dropped the translation of “information” • Possible solutions: • Add parallel corpora for training

  25. Manual evaluation • 1. 数 组 元素 – array element – 配列 元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • 2. 计算机 化 管理 学会– ICM • – 特 発 性 心筋 障害 • The Chinese means “Institution for Computerization Management”, • and the Japanese means “Idiopathic Cardiomyopathy” • Possible solutions: • Special treatment for acronyms • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations • 3. 信息量– information content – 量 • The Japanese dropped the translation of “information” • Possible solutions: • Add parallel corpora for training 4. 转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (No Japanese translations) All English words seem to be common but failed to generate Japanese translations (maybe because the score was below the threshold for searching hypotheses)

  26. Conclusion • We proposed the method using phrase-based SMT for constructing J-C lexicon from J-E and C-E lexicons. • We could obtain J translations for 73.4% of items in the C-E lexicon, and it outperformed the “exact matching” (22.2%). • 36.5% of the top J translations were correct and that 67.5% of the top-10 J translations included the correct one. • We could apply this method for support of manual construction of bilingual dictionaries and use this lexicon for MT. • Future work • Parameter optimization of SMT by using existing J-C lexicons • Chinese character similarity considering each similarity between individual characters • More sophisticated reordering model (considering parts-of-speech) • Other translation directions (EJ, JC, EC)

  27. Acquisition of Translation Pairs of Technical Terms • Large-scale translation dictionaries (lexicons) of technical terms are required for translating technical documents • For constructing such dictionaries, we must ask the experts who can deal with both languages • It requires huge costs • We must support rapid increase of new terms • Automatic acquisition of translation candidates of technical terms • Support for constructing the dictionary • Improvement of the performance of machine translation systems

  28. J-E bilingual lexicon • 527,206translation pairs • Numbers of distinct terms: 465,565J terms, 509,259E terms

  29. C-E bilingual lexicon • Wanfang Data E-C & C-E Science and Technology Dictionary • 525,259 pairs

  30. Construction of the C-J bilingual lexicon • Attach Japanese translations for each lexical item of C-E lexicon

  31. Overview of constructing J-C lexicon • We assume the C-E and J-E lexicons as parallel corpora, and use them for training data for constructing a J-C SMT system • Word/phrase-level merging in English can be available by applying an SMT approach for the C-E and J-E lexicons • We apply C-J phrase-based SMT for Chinese terms in the C-E lexicon • Statistical approaches seem to be effective because of similarities of semantics and word order between C and J • Easy to introduce other clues such as Chinese character similarity

  32. Collecting J-E & C-E translation phrase pairs • Apply morphological analyzers, and obtain word alignments by GIZA++ (Och and Ney, 2003) for J-E and C-E lexicons • Collect phrase pairs by “Grow-diag-final” method (using Moses, Koehn et al., 2007) and calculate the probabilities by the relative frequencies J-E lexicon ころがり   疲れ   寿命 rolling fatigue life

  33. Merging phrase pairs(Utiyama & Isahara,2007) (J-E & E-C phrases to J-C phrases)

  34. Merging phrase pairs(Utiyama & Isahara,2007) (J-E & E-C phrases to J-C phrases) (Zeis a normalized factor)

  35. Features for learning of the log-linear model • We employ the following features h1-h4 for the log-linear model: • Phrase translation prob. • where are the i-th phrase pair for the translation • 3-gram language model of the target language • where p(we) is a language model probability from other monolingual corpora • Phrase reordering penalty(Koehn et al., 2003) • Chinese character similarity(Zhang et al., 2005)

  36. Feature 3: Phrase reordering penalty(Koehn et al., 2003) • The feature value is the sum of penalties d defined by the following formula for the phrase pairs we, wf • where aiis the position of the first word of wfand bi-1 is the position of the last word of wftranslated in the previous step d(e1 e2, f1 f2 f3) = 0 f1 f2 f3 f4 f5 f6 f7 f8 d(e3, f8) = – |8 – 3 – 1| = – 4 d(e4, f6 f7) = – |6 – 8 – 1| = – 3 d(e5 e6, f4 f5) = – |4 – 7 – 1| = – 4 h3(e1…e6, f1…f8) = – 11 e1 e2 e3 e4 e5 e6

  37. Feature 4: Chinese character similarity • Chinese and Japanese writing systems both have Chinese characters, and their similarity should be a powerful clue to derive the translation phrase pairs (Zhang et al., 2005) • We define the feature value h4 between we and wf as follows: • Differences of Chinese and Japanese forms of characters are ignored • Example:h4(万歩計,计步器) = h4(万歩計, 計歩器) = h4(ABC,CBD) = 1 – 2 / 3 = 0.333

More Related