1 / 25

Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon. Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi Graduate School of Informatics, Kyoto University. IJCNLP2013 (2013/10/17). Outline. Background Related Work

joelle
Télécharger la présentation

Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accurate Parallel Fragment Extraction from Quasi-Comparable Corporausing Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, SadaoKurohashi Graduate School of Informatics, Kyoto University IJCNLP2013 (2013/10/17)

  2. Outline • Background • Related Work • Proposed Method • Experiments • Conclusion

  3. Outline • Background • Related Work • Proposed Method • Experiments • Conclusion

  4. Bilingual Corpora [Fung+ 2004] • Lack of parallel corpora • Parallel sentences can be extracted from noisy and comparable corpora • Quasi-comparable corpora more available, however few parallel sentences exist

  5. Parallel Fragments • In quasi-comparable corpora, there could be parallel fragments in comparable sentences • Parallel fragments are also helpful for SMT • We aim to accurately extract parallel fragments from comparable sentences Zh: 应用/铅/离子/选择/电极/电位/滴定/法/测定/甘草/及/其/制品/中/的/甘草/酸 (Applying lead ion selective electrodepotentiometric titration method to determine licorice and its products ‘s glycyrrhizic acid) Ja: </原/報/>/鉛/イオン/選択/性/電極を/用いる/混合/試料/中/の/…/と/電位/差/滴定/法/の/比較 (<Original Report> lead ion selective electrode used mixed sample ‘s … and potentiometric titration method ‘s comparison)

  6. Outline • Background • Related Work • Proposed Method • Experiments • Conclusion

  7. Parallel Sub-sentential Fragment Extraction[Munteanu+ 2006] • Extract translation lexiconfrom a parallel corpus • Apply a lexicon filter to comparable sentences in two directions independently • Assign initial scores according to the lexicon • Score smoothing to gain new knowledge that does not exist in the lexicon • Extract sub-sentential (not exactly parallel) fragment

  8. Lexicon Filter on Ja-to-Zh Direction 应 用 铅 离 子 选 择 电 极 电 位 滴 定 法 测 定 甘 草 及 其 制 品 中 的 甘 草 酸 < 原 報 > 鉛 イ オ ン 選 択 性 電 極 を 用 い る 混 合 試 料 中 の と 電 位 差 滴 定 法 の 比 較

  9. Lexicon Filter on Zh-to-JaDirection < 原 報 > 鉛 イ オ ン 選 択 性 電 極 を 用 い る 混 合 試 料 中 の と 電 位 差 滴 定 法 の 比 較 应 用 铅 离 子 选 择 电 极 电 位 滴 定 法 测 定 甘 草 及 其 制 品 中 的 甘 草 酸

  10. Outline • Background • Related Work • Proposed Method • Experiments • Conclusion

  11. System Overview Parallel corpus Use a more accurate lexicon filter Lexicon filter Use an alignment model to locate the source and target fragment candidates simultaneously Target corpora SMT Classifier Parallel fragment candidates Parallel fragments Comparable sentences Source corpora Translated sentences (4) Alignment (3) (5) (1) (2) IR: top N results

  12. Parallel Fragment Candidate Detection by Alignment Monotonic, non-NULL and longestaligned fragments more than 3 tokens

  13. Lexicon Filter − Assign Initial Scores Assign scores in two directions to alignedword pairs in the candidates according to translation lexicon

  14. Lexicon Filter −Score Smoothing Only smooth a word with negative score when both the left and rightwords around it have positive scores

  15. Fragment Extraction • Fragments more than 3 tokens with continuous positive • scores in both directions

  16. Outline • Background • Related Work • Proposed Method • Experiments • Parallel Fragment Extraction • Translation • Conclusion

  17. Experimental settings (Parallel Fragment Extraction 1/2) • Parallel corpus: Zh-Ja abstract corpus (680k sentences, scientific domain) • Quasi-Comparable Corpora • Chinese corpora: CNKI (90k articles, 420k sentences, chemistry domain) • Japanese corpora: CiNii (880k articles, 5M sentences, scientific domain) • Comparable sentences: 30k chemistry domain sentences were extracted

  18. Experimental settings (Parallel Fragment Extraction 2/2) • Alignment: GIZA++ with symmetrizationheuristics • Only: only use the extracted comparable sentences • External: together with 11k chemistry domain data in the parallel corpus • Translation lexicon • IBM Model 1 [Brown+ 1993] • Log-Likelihood-Ratio (LLR) [Munteanu+ 2006] • Sub-corpora sampling lexicon (SampLEX) [Vulic+ 2012] • Compare with [Munteanu+ 2006]

  19. Results ※ Accuracy: manually evaluated 100 fragments based on exact match

  20. Experimental Settings (Translation) • Baseline: Zh-Ja paper abstract corpus (680k with 11k chemistry domain sentences) • Tuning: 368 sentences of chemistry domain • Testing: 367 sentences of chemistry domain • Decoder: Moses • Language model: 5–gram language model on the Jaside of the parallel corpus using SRILM • Compare MT performance by appending the extracted fragments to the baseline training data

  21. BLUE-4 for Different Systems * * * * ※ “*” denotes that the result is better than “Baseline” significantly at p < 0.05

  22. Outline • Background • Related Work • Proposed Method • Experiments • Conclusion

  23. Conclusion • We proposed an accurate parallel fragment extraction system using alignment model and translation lexicon • Future Work • A method to deal with ordering • Parallel corpus independent method • Try other language pairs and domains

  24. Thank you for your attention!

  25. Examples of Extracted Fragment Pairs ※ Noise is written in red font • Most noise is due to the noisy translation lexicon (Example 5-7) • Score smoothing also produces some noise (Example 8)

More Related