1 / 61

METIS ( Traducció Automàtica per a llengües amb pocs recursos )

METIS ( Traducció Automàtica per a llengües amb pocs recursos ). Maite Melero (GLiCom – BM). Roadmap. METIS II (2004-2007) ES-EN approach (GLiCom) METIS II evaluation results Rapid deployment of METIS CA-EN pair. Current approaches to MT. In industry: mainly rule-based

eavan
Télécharger la présentation

METIS ( Traducció Automàtica per a llengües amb pocs recursos )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. METIS (Traducció Automàtica per a llengües amb pocs recursos ) Maite Melero (GLiCom – BM) Seminari NLP-UPC

  2. Roadmap • METIS II (2004-2007) • ES-EN approach (GLiCom) • METIS II evaluation results • Rapid deployment of METIS CA-EN pair Seminari NLP-UPC

  3. Current approaches to MT • In industry: mainly rule-based • require lots of expensive manual labour • In academia: mostly data driven (statistical and example-based MT) • require large parallel corpora • What happens with smaller languages? Seminari NLP-UPC

  4. METIS II (2004-2007): the aims • Construct free text translations by • relying on hybrid techniques • employing basic resources • retrieving the basic stock for translations from large monolingual corpora of the target language only Seminari NLP-UPC

  5. Similar approach: MATADOR • MATADOR (Habash and Dorr, 2002, 2003; Habash,2003, 2004). • Main difference: • MATADOR aims at language pairs with resource asymmetry: low resources for the source language, and high resources for the target language • METIS aims at low resources on both sides Seminari NLP-UPC

  6. Metis II: The main ideas • Hybrid approach: strong data driven component plus a limited number of rules • Simple resources, readily available • Weights associated with resources and the search algorithm • TL corpus: processed off-line to construct TL model • Language-specific components independent from the core search engine • Special data format for the core engine input (UDF) • Several language pairs test feasibility of the approach: Dutch, German, Greek and Spanish  English. Seminari NLP-UPC

  7. METIS II architecture Seminari NLP-UPC

  8. What are basic NLP resources? • Part-of-speech taggers • Lemmatizers • Manually corrected POS tagged corpus (can be used to train a statistical tagger such as TnT (Brants, 2000)) • (optionally) Chunkers Seminari NLP-UPC

  9. Metis II: Fields of experimentation • SL analysis • depth & richness of syntactic structure • Transfer • which pieces / structures of information • Generation • re-ordering of chunks and words Seminari NLP-UPC

  10. Metis II: SL Analysis (Morphology) All language pairs provide: • Lemmatisation: • abstraction from inflection • POS tagging: • verb, noun, adjectives, articles, pronouns, etc, with subclasses according to properties of SL • Nominal Inflection: • number, gender, case • Verbal Inflection: • number, person, tense, mood, type (ptc, fin, inf, etc.) Seminari NLP-UPC

  11. Metis II: SL Analysis (Syntax) • No syntactic SL analysis: Spanish • Phrase detection (nominal, prepositional, verbal groups) and Clause detection (main and subordinate clause): Dutch, German & Greek • Recursive embedding of phrases and clauses: • one level, no embedding: German • two level embedding: Greek • full recursivity: Dutch • detect phrase & clause head: Dutch & Greek • subject detection: German & Greek • topological field analysis: German Seminari NLP-UPC

  12. Metis II: Source Language Analysis • Provides generalization: • Smaller lexicon • Less data sparsity in TL corpus Seminari NLP-UPC

  13. Metis II: Transfer (Mapping of SL features to TL) Seminari NLP-UPC

  14. Metis II: TL Generation (Reordering) • Reordering of the transferred items into TL structure is conceived as a process of hypothesis generation and filtering, according to most likely TL pattern (from TL model). • Mostly pattern-based and use only info from TL,but • can also be partly rule-based and use information from SL (Dutch and German) Seminari NLP-UPC

  15. Metis II: TL Generation (Reordering) • Information to be matched in TL model • Shallow syntactic information: all exc. Spanish • n-gram patterns of mapped Pos & lemmata: Spanish • Matching Procedure • top down: Greek • bottom up: all exc. Greek Seminari NLP-UPC

  16. Metis II: Reordering mechanism for TL word order generation Seminari NLP-UPC

  17. Metis II Spanish-English Translation Paradigm Spanish sentence POS tagger and lemmatizer Spanish Preprocessing Translation Model Bilingual flat lexicon (no structure transfer rules) English Generation Search over ngram models extracted from English corpus English sentence Seminari NLP-UPC

  18. Main Translation Problems • Lexical selection: i.e. picking the right translation for a given word • escribir una carta  write a letter • jugar una carta play a card • Translation divergences: i.e. whenever word-by-word translation does not work • ver a Juan  see (to) Juan • cruzar nadando  cross swimming (swim across) Seminari NLP-UPC

  19. Translation Divergences. How MT has addressed them • Linguistic based MT systems devise data representations that minimize translation divergences. • [head] ver  [head] see [arg2] Juan [arg2] Juan • Remaining divergences need to be solved in the translation module: • Hand written bilingual mapping rules (Transfer MT). • Mappings automatically extracted from parallel corpus (Example Based MT). Seminari NLP-UPC

  20. Translation Divergences. Our constraints. • Very basic resources required, both for source and target languages: only lemmatizer-POS tagger and (TL) chunker. • No deep linguistic analysis to minimize divergences • No parallel corpus, only target corpus • Keep translation model very simple: only bilingual lexicon. • No mapping rules, either hand-written, or automatically learned. Seminari NLP-UPC

  21. Translation Divergences. Our approach. • Handle structure modifications in the TL Generation component. • Treatment independent of the SL, i.e. much more general and reusable. Seminari NLP-UPC

  22. SL Preprocessing (Spanish) Tagger (CastCG) Statistical disambiguation SL normalization Seminari NLP-UPC

  23. Spanish Tagger: CastCG Me alojo en la casa de huéspedes. Seminari NLP-UPC

  24. SL Normalization: Tag Mapping Seminari NLP-UPC

  25. SL Normalization: e.g. Pronoun Insertion in Pro-drop Seminari NLP-UPC

  26. Sp-Eng lexmetis HD HD Translation Model: Spanish-English Lexicon Look-up Oxford List of Pseudo-English candidates (UDF) Seminari NLP-UPC

  27. Sp-Eng lexmetis Translation Model: Compound Detection <trans-unit id="6"> <option id="1"> <token-trans id="1"> <lemma>boarding</lemma> <pos>VVG</pos> </token-trans> <token-trans id="2"> <lemma>house</lemma> <pos>NN1</pos> </token-trans> </option> </trans-unit> Oxford casa => house casa => casa de huéspedes casa de huéspedes => boarding house Seminari NLP-UPC

  28. Translation Model: Unfound words • Past participle Ex. “denominado” > denominar (VM) > designate (VV) > designated (AJ0) • Adverbs Ex. “técnicamente” > técnico (AQ) > technical (AJ0) > technically (AV0) Seminari NLP-UPC

  29. TL Generation (English) Pseudo-English UDF Search Engine (TL models) English lemmatized sentence Token generation English translation Seminari NLP-UPC

  30. n-gram n-gram n-gram Search Engine (1st version) Lexical preselection Candidate expansion TL models Candidate scoring Seminari NLP-UPC

  31. the worker must carry helmet … n-gram n-gram n-gram wear bottle drive headphones helmet TL models Search Engine (2nd version): beam search decoding Search engine Lexical pre-selection Candidate expansion Scoring Seminari NLP-UPC

  32. 1-gram 2-gram 3-gram 4-gram 5-gram Target Language Models BNC 6 M sents stay|VV in|PRP the|AT0 house|NN TL Model subst. 1! position (for n>2) stay|VV in|PRP the|AT0 NN Seminari NLP-UPC

  33. 3-gram 5-gram 1-gram 2-gram 4-gram Handling Structure Divergences in TL Generation: Local Structure Modifications n  freq want|VV go|VV  want|VV to|TO0 go|VV I at|PRP the|AT0 home|NN  at|PRPhome|NN D n  freq • Insertion of functional words: want|VV to|TO0 go|VV • Deletion of functional words: at|PRP (the|AT0) home|NN • Permutation of content words: a|AT0 {day|NN happy|AJ0} Seminari NLP-UPC

  34. Search Engine: beam search decoding • Performance problems • Combinatorial explosion in the expansion step: Suppose we are given a source sentence with at least 35 words which translate to at least to English words. Thus: Seminari NLP-UPC

  35. Search Engine: beam search decoding • Performance problems • Combinatorial explosion in the expansion step: Suppose we are given a source sentence with at least 35 words which translate to at least to English words. Thus: The search space of candidates must bepruned. Seminari NLP-UPC

  36. Search Engine: beam search decoding • Performance problems • Combinatorial explosion in the expansion step • Combinatorial explosion in the scoring computation step. Seminari NLP-UPC

  37. Search Engine: beam search decoding • Solution: • To incrementally build the search space (following Philipp Koehn’s Pharaoh: aBeam Search Decoder for Phrase-Based Statistical Machine Translation Models) (2004) Seminari NLP-UPC

  38. Search Engine: beam search decoding • Solution: • To incrementally build the search space (following Philipp Koehn’s Pharaoh: aBeam Search Decoder for Phrase-Based Statistical Machine Translation Models) (2004) w1,…,wk are pushed on the first stack. The stack is ranked and pruned up to a given stack depth Seminari NLP-UPC

  39. Search Engine: beam search decoding • Solution: • To incrementally build the search space (following Philipp Koehn’s Pharaoh: aBeam Search Decoder for Phrase-Based Statistical Machine Translation Models) (2004) Each candidate of (i-1)-th stack is expanded via the dictionary and edit ops. Again Candidates are ranked and pruned up to given stack depth. Seminari NLP-UPC

  40. Search Engine: beam search decoding • Solution: • To incrementally build the search space (following Philipp Koehn’s Pharaoh: aBeam Search Decoder for Phrase-Based Statistical Machine Translation Models) (2004) The scoring of each partial translations is computed using the already computed stored scorings: Seminari NLP-UPC

  41. Search Engine: beam search decoding • Solution: • To incrementally build the search space (following Philipp Koehn’s Pharaoh: aBeam Search Decoder for Phrase-Based Statistical Machine Translation Models) (2004) • At the N-th step (the source sentence contains N tokens) the decoding process stops. We get a ranked stack with the translation candidates. Seminari NLP-UPC

  42. Handling Structure Divergences in TL Generation: Non-local Movements Normalized BNC Syntactic model Boundaries PRPAT0NN VV AT0NN AT0NN PRPAT0NN VV AT0NN VV PRPAT0NN Chunked Corpus eg. [The man] [sleeps] [at the park] Seminari NLP-UPC

  43. Evaluation final METIS prototype • Comparison with SYSTRAN: • Widely used • Available for all language pairs • Rule-based, many man-years of development • Goal: get an estimation of what has been achieved Seminari NLP-UPC

  44. Methodology: Test sets • Two test sets: • 200 sentences manually chosen from Europarl • 200 sentences from balanced test suite used to validate system development from a variety of domains: • 25% grammatical phenomena • 25% newspapers • 25% technical • 25% scientific Seminari NLP-UPC

  45. Methodology: Metrics • BLEU & NIST: measure edit distance using Ngrams • TER (Translation Error Rate): measures the amount of editing that a human would have to perform Seminari NLP-UPC

  46. Methodology: References • All metrics use human created references to compare with MT output. • Europarl: 5 references (4 resulting from human translating each SL into English + original English one) • Development: 3 references Seminari NLP-UPC

  47. METIS-II SYSTRAN difference % NL-EN 0.1925 0.3828 0.1903 50% DE-EN 0.2816 0.3958 0.1142 71% EL-EN 0.1861 0.3132 0.1271 59% ES-EN 0.2784 0.4638 0.1854 60% Results on Europarl test set BLEU Seminari NLP-UPC

  48. METIS-II SYSTRAN difference % NL-EN 0.2369 0.3777 0.1408 70% DE-EN 0.2231 0.3133 0.0902 71% EL-EN 0.3661 0.3946 0.0285 92% ES-EN 0.2941 0.4634 0.1693 63% Results on development test suite BLEU Seminari NLP-UPC

  49. Europarl Dev difference NL-EN 0.1925 0.2369 0.0444 DE-EN 0.2816 0.2231 -0.0585 EL-EN 0.1861 0.3661 0.1800 ES-EN 0.2784 0.2941 0.0157 METIS-II on both test sets BLEU Seminari NLP-UPC

  50. Grammar News Science Tech METIS-II 0.22 0.33 0.29 0.26 SYSTRAN 0.48 0.46 0.47 0.45 Results according to text type (ES-EN on Development Testsuite) BLEU Seminari NLP-UPC

More Related