1 / 28

Approaching a New Language in Machine Translation

Approaching a New Language in Machine Translation. Anna Sågvall Hein, Per Weijnitz. A Swedish example. Experiences of rule-based translation by means of translation software that was developed from scratch statistical translation by means of publicly available software.

odelia
Télécharger la présentation

Approaching a New Language in Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz

  2. A Swedish example • Experiences of • rule-based translation by means of translation software that was developed from scratch • statistical translation by means of publicly available software SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  3. Developing a robust transfer-based system for Swedish • collecting a small sv-en translation corpus from the automotive domain (Scania) • building a prototype of a core translation engine, Multra • extending the translation corpus to 50k words for each language and scaling-up the dictionaries for the extended corpus • building a translation system, Mats for hosting Multra and processing real-word documents • making the system robust, transparent and trace-able • building an extended, more flexible version of Mats, Convertus SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  4. Features of the Multra engine • transfer-based • modular • analysis by chart parsing • transfer based on unification • generation based on unification and concatenation • non-deterministic processing • preference machinery SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  5. Features of the host system(s) • robust • always produces a translation • modular • a separate module for each translation step • transparent • text based communication between modules • trace-able • step-wise for each module • evaluation of the linguistic coverage • counting and collecting missing units from each module • process communication • MATS, unidirectional pipe • Convertus, blackboard SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  6. Robustness • dictionary • complementary access to external dictionaries • analysis • exploiting partial analyses • concatenation of sub-strings in preserved order • transfer • only differences covered by rules • generation • token translations presented in source language order • fall back generations cleaned up using a language model SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  7. Language resources, full system • analysis • dictionary • grammar • transfer • dictionary • grammar • generation • dictionary • grammar • external translation dictionary • target language model SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  8. Language resources, simplified, direct translation system • analysis • dictionary • transfer • dictionary • generation • dictionary • external translation dictionary • target language model SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  9. Achievements • Bleu scores ~0.4-0.5 for training materials • automotive service literature • EU agricultural texts • security police communication • academic curricula SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  10. Current project • Translation of curricula of Uppsala University from Swedish to English SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  11. Current development • initial studies of automatic extraction of grammar rules from text and tree-banks for parsing and generation • inspired by • Megyesi, B. (2002). Data-Driven Syntactic Analysis - Methods and Applications for Swedish. Ph.D.Thesis. Department of Speech, Music and Hearing, KTH, Stockholm, Sweden. • Nivre, J., Hall, J. and Nilsson, J. (2006) MaltParser: A Data-Driven Parser-Generator for Dependency Parsing. In Proceedings of LREC. SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  12. Statistical MT • Publicly available software: • decoder • Pharaoh (Koehn 2004) • translation models • UPlug (Tiedemann, J. 2003) • GIZA++ (Och, F. J. and Ney, H. 2000) • Thot (Ortiz-Martínez, D. et al. 2005) • language models • SRILM (Stolcke, A. 2002) SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  13. Success factors • language differences • translation direction • size of training corpus • density of corpus • corpus density: lexical openness, degree of repetetiveness of n-grams, plus other significant factors • How can they be appropriately formalised? Measured? Combined? SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  14. Experiments • limited amount of training data (assumed for minority languages) <=32k sentence pairs • Swedish represents the minority lang. • search for correlation between density of corpus and translation quality SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  15. Mats automotive corpus SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  16. Europarl SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  17. Mats & Europarl, density in terms of type/occurrence ratio SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  18. BLEU for Europarl: 10 SL->sv SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  19. BLEU for Europarl: sv->10 TL SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  20. 4-gram type/occurrence ratio, SL->sv SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  21. 3-gram type/occurrence ratio, SL->sv SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  22. Detailed view, Europarl, sv->en • Examining the correlation between SL n-gram type/occurrence – density - and BLEU. SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  23. Detailed view, Europarl sv-fi • Examining the correlation between SL n-gram type/occurrence – density - and BLEU. SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  24. Rule-based and statistical - moving slightlyoff domain • MATS automotive corpus used for training, 16k • test data from Mats (outside training data) and from separate, similar corpus: Scania98 SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  25. Correlation between overlap and performance - Pharaoh • MATS automotive corpus used for training, 16k • test data from MATS and Scania98 • measured occurrences of test data units that also occur in the training data • test and training source language data overlap: the precondition for successful data driven MT SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  26. Summary • development of Convertus, a robust transfer-based system equipped with language resources for sv-en translation in several domains • BLEU measures of smt using publicly available software (Pharaoh) and Europarl • 10 languages, two translation directions, and training intervals of 5k sentence pairs up to 32k • data on density of Europarl in terms of overlaps • comparing rbmt and smt using Convertus and Pharaoh • searching for a formal way of quantifying how well a corpus will work for SMT • starting with density of source language SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  27. Concluding remarks • building a rule-based system from scratch is a major undertaking • customizing existing software is better • smt systems can be built fairly easily using publicly available software • restrictions on commercial use, though • factors influencing quality in smt • size of training corpus • density of source side of training corpus • language differences and translation direction • other important factors (future work) • quality of training corpus, alignment quality, … SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

  28. Concluding remarks (cont.) • smt versus rbmt • smt seems more sensitive to density than rbmt • error analysis and correction can be linguistically controlled in rbmt as opposed to smt SALTMIL, LREC 2006, Sågvall Hein & Weijnitz

More Related