1 / 30

Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets

Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets. Jörg Tiedemann, Uppsala University Preslav Nakov, Qatar Computing Research Institute. RANLP’2013 September 10, 2013, Hissar, Bulgaria. Statistical Machine Translation (SMT): Trained on Bi-texts. English.

jagger
Télécharger la présentation

Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets Jörg Tiedemann, Uppsala University Preslav Nakov, Qatar Computing Research Institute RANLP’2013 September 10, 2013, Hissar, Bulgaria

  2. Statistical Machine Translation (SMT): Trained on Bi-texts English Reach Out to Asia (ROTA) has announced its fifth Wheels ‘n’ Heels, Qatar’s largest annual community event, which will promote ROTA’s partnership with the Qatar Japan 2012 Committee. Held at the Museum of Islamic Art Park on 10 February, the event will celebrate 40 years of cordial relations between the two countries. Essa Al Mannai, ROTA Director, said: “A group of 40 Japanese students are traveling to Doha especially to take part in our event. SMT systems: - learn from human-generated translations - extract useful knowledge and build models - use the models to translate new sentences

  3. The Problem: No Enough Training Datafor Most Language Pairs Zipfiandistribution of language resources

  4. The Lack of Training Bi-texts is a Big Issue MacedonianEnglish SMT • Ref: It's a simple matter of self-preservation. • SMT: It's simply a question of себесочувување. • Ref: Your girlfriend's very cynical. • SMT: Пријателкатациничнаyou very much.

  5. Typical Solution: Pivoting Macedonian: Никогаш не сум преспала цела сезона. Bulgarian: Никога не съм спала цял сезон. English: I’ve never slept for an entire season. • For related languages • subword transformations • use character-level translation?

  6. Closely-Related Languages

  7. Character-Level SMT • MK: Никогаш не сум преспала цела сезона. • BG: Никога не съм спала цял сезон. • MK: Н и к о г а ш _ н е _ с у м _ п р е с п а л а _ ц е л а _ с е з о н а _ . • BG: Н и к о г а _ н е _ с ъ м _ с п а л а _ ц я л _ с е з о н _ .

  8. Character-Level Phrase Pairs Can cover: • word prefixes/suffixes • entire words • word sequences • combinations thereof Max-phrase-length=10 LM-order=10

  9. Data: OPUS movie subtitles (cleansed & realigned) • Training • Development: 10K sentences • Test: 10K sentences

  10. N-gram Character Alignment

  11. Character Alignment and Phrase Table Filtering Macedonian-Bulgarian character-level SMT

  12. The Impact of Data Size: MKBG Macedonian-Bulgarian

  13. MacedonianBulgarianEnglish

  14. MacedonianBulgarianEnglish

  15. MacedonianBulgarianEnglish

  16. MacedonianBulgarianEnglish

  17. Optimizing MKBGEN Pivot SMT:Local vs. Global Tuning combined = baseline + word-based + char-based

  18. Optimizing MKBGEN Pivot SMT:Local vs. Global Tuning combined = baseline + word-based + char-based global tuning based on 20 x 20 n-best lists

  19. Example Translations

  20. Languages in Europe

  21. Slavic Languages in Europe

  22. Pivot Languages for MK??EN SMT CZ SL SR BG MK

  23. Varying the Training Data Size MK-XX Pivoting (baseline MK-EN = 22.33)

  24. Using Synthetic Data Translate Bulgarian to Macedonian in a BG-EN corpus

  25. Using Synthetic Data Translate Bulgarian to Macedonian in a BG-XX corpus

  26. Using Synthetic Data Translate Bulgarian to Macedonian in a BG-XX corpus

  27. Using Synthetic Data Translate Bulgarian to Macedonian in a BG-XX corpus All synthetic data combined (+mk-en): 36.69 BLEU

  28. Human Judgments

  29. Conclusion and Future Work • Findings • character alignment: use bigrams! +0.4 BLEU • phrase table filtering: removes noise! +0.5 BLEU • global tuning: better than local! +1.0 BLEU • bitext size: char- better than word-level with little data! +3.0 BLEU • choice of pivot language: closer is better! • synthetic data: better than pivoting +2.5 BLEU • results confirmed by manual evaluation • Overall: +14 BLEU • Future Work • robustness of char-level models • domain shifts • noisy inputs: spelling and tokenization, etc. • other language pairs Thank you! Thanks to Petya Kirova and Veno Pacovski for the manual judgments.

More Related