1 / 18

IBM Statistical Machine Translation for Spoken Languages

IBM T. J. Watson Research Center. IBM Statistical Machine Translation for Spoken Languages. Young-Suk Lee IWSLT 2005 October 24−25, 2005. © 2005 IBM Corporation. IBM T. J. Watson Research Center. Outline. Baseline Phrase Translation System Block Acquisition Decoding

niveditha
Télécharger la présentation

IBM Statistical Machine Translation for Spoken Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IBM T. J. Watson Research Center IBM Statistical Machine Translation for Spoken Languages Young-Suk LeeIWSLT 2005October 24−25, 2005 © 2005 IBM Corporation

  2. IBM T. J. Watson Research Center Outline • Baseline Phrase Translation System • Block Acquisition • Decoding • Performance Enhancing Techniques • Extended Block Acquisition Algorithm • System Combination • IWSLT 2005 Evaluations • Conclusions & Future Work © 2005 IBM Corporation

  3. e1 e2 e3 e4 e5 e6 f1 f2 f3 IBM T. J. Watson Research Center Baseline System: Block Acquisition Block (b): a phrase translation pair consisting of source & target phrase © 2005 IBM Corporation

  4. IBM T. J. Watson Research Center Decoding I • Phrase translation models • Direct model: • Source channel model: • Block unigram model: © 2005 IBM Corporation

  5. IBM T. J. Watson Research Center Decoding II • IBM Model 1 cost per phrase in both directions • Word trigram language model • Word-level distortion models applied to blocks • Word count penalty • Block count penalty © 2005 IBM Corporation

  6. Arabic: lA Aryd AzAlthA لا أريدإزالتها IBM T. J. Watson Research Center Extended Block Acquisition English:Ido n't want it extracted © 2005 IBM Corporation

  7. I don't want it extracted lA Aryd AzAlthA لا أريدإزالتها IBM T. J. Watson Research Center Extended Block Acquisition Algorithm • Expansion word list: A list of target words typically aligned to null source words (e.g. I, do, it) • Extend the target phrase to include an expansion word if it occurs in the neighborhood of a seed block © 2005 IBM Corporation

  8. IBM T. J. Watson Research Center Impact of Extended Block Aquisition: A2E BLEUr16n4 EXTENDED EXTENDED Reordering Rules CSTAR 03 Dev Set IWSLT 04 Dev Set © 2005 IBM Corporation

  9. IBM T. J. Watson Research Center Impact of Extended Block Acquisition: C2E BLEUr16n4 EXTENDED EXTENDED Reordering Rules CSTAR 03 Dev Set IWSLT 04 Dev Set © 2005 IBM Corporation

  10. IBM T. J. Watson Research Center System Combination: Recipe Phrase Lexicon 1 Phrase Lexicon 2 Phrase Lexicon 3 SYSTEM 1 SYSTEM 2 SYSTEM 3 translate translate translate Algorithm: Select the Best © 2005 IBM Corporation

  11. IBM T. J. Watson Research Center Arabic-to-English Phrase Lexicons llmEArDp 'of the opposition' → l# Al# EArD +p → l# Al# EArDp lA Aryd AzAlthA → lA A# ryd AzAl +t +hA → lA Aryd AzAlt +hA OOV Ratio © 2005 IBM Corporation

  12. YES NO ... YES NO IBM T. J. Watson Research Center System Combination Algorithm • h-sys (system producing the highest BLEU score) vs. l-sys1, l-sys2, ..., l-sysn output(l-sys1) cost(h-sys) > cost(l-sys1) + threshold_1 output(l-sysn) cost(h-sys) > cost(l-sysn) + threshold_n output(h-sys) • Combine the selected output as the final translation © 2005 IBM Corporation

  13. IBM T. J. Watson Research Center Impact of System Combination: IWSLT 05 A2E Unrestricted Data Track BLEUr16n4 system combination morph segmented morph analysis unsegmented Reordering Rules © 2005 IBM Corporation

  14. IBM T. J. Watson Research Center Impact of System Combination: IWSLT 05 C2E Unrestricted Data Track BLEUr16n4 char seg & unreordered system combination word seg & reorder char seg & reorder Reordering Rules © 2005 IBM Corporation

  15. IBM T. J. Watson Research Center IWSLT 2005: Training Corpora for A2E TM: Number of sentence pairs, LM: Number of words © 2005 IBM Corporation

  16. IBM T. J. Watson Research Center IWSLT 2005: IBM System Performances © 2005 IBM Corporation

  17. IBM T. J. Watson Research Center Conclusions & Future Work • Conclusions • Robust system performances on • Large & small training corpora • Various language pairs: A2E, C2E, S2E, E2S • System combination & Extended block acquisition algorithm • Effective for A2E & C2E translations • Future Work: System Combination • Extend the technique to models derived by distinct algorithms • Refine the algorithm to discriminate effective decoder parameters • Apply the technique to TC-Star SLT partner systems © 2005 IBM Corporation

  18. IBM T. J. Watson Research Center IWSLT 2005: Training Corpora for C2E TM: Number of sentence pairs, LM: Number of words © 2005 IBM Corporation

More Related