1 / 27

Statistical Machine Translation SMT – Basic Ideas

Statistical Machine Translation SMT – Basic Ideas. Stephan Vogel MT Class Spring Semester 2011. Overview. Deciphering foreign text – an example Principles of SMT Data processing. Deciphering Example. Apinaye – English Apinaye belongs to the Ge family of Brazil

nevin
Télécharger la présentation

Statistical Machine Translation SMT – Basic Ideas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Machine TranslationSMT – Basic Ideas Stephan Vogel MT Class Spring Semester 2011 Stephan Vogel - Machine Translation

  2. Overview • Deciphering foreign text – an example • Principles of SMT • Data processing Stephan Vogel - Machine Translation

  3. Deciphering Example • Apinaye – English • Apinaye belongs to the Ge family of Brazil • Spoken by 800 (according to SIL, 1994) • http://www.ethnologue.com/show_family.asp?subid=90784http://www.language-museum.com/a/apinaye.php • Example from Linguistic Olympics 2008, see http://www.naclo.cs.cmu.edu • Parallel Corpus (some characters adapted) Kukre kokoi The monkey eats Ape kre The child works Ape kokoi rats The big monkey works Ape mi mets The good man works Ape mets kra The child works well Ape punui mi pinjets The old man works badly • Can we translate new sentence? Stephan Vogel - Machine Translation

  4. Deciphering Example • Parallel Corpus (some characters adapted) • Can we build a lexicon from these sentence pairs? • Observations: • Apinaye: Kukre (1) Ape (5), English: The (6), works (5)Aha! -> first guess: Ape – works • monkey in 1, 3; child in 2, 4; man in 4, 6different distribution over corpus: do we find words with similar distribution on the Apinaye side? Stephan Vogel - Machine Translation

  5. … Vocabularies Corpus Vocabularies • Observations: • 9 Apinaye words, 11 English words • Expectations: • English words without translation? • Apinaye words corresponding to more then 1 English word? Stephan Vogel - Machine Translation

  6. … Word Frequencies Corpus Vocabularies, with frequencies • Suggestions: • ‘ape’ (5) could align to ‘The’ (6) or ‘works’ (5) • More likely that content word ‘works’has match, i.e. ‘ape’ = ‘works’ • Other word pairs difficult to predict – too many similar frequencies Stephan Vogel - Machine Translation

  7. … Location in Corpus Corpus Vocabularies, with occurrences • Observations: • Same sentences: ‘kukre’ – ‘eats’, ‘kokoi’ – ‘monkey’, ‘ape’ – ‘works’,‘kra’ – ‘child’, ‘rats’ – ‘big’, ‘mi’ – ‘man’ • ‘mets’ (4 and 5) =? ‘good’ (4) and ‘well’ (5); makes sense • ‘punui’ and ‘pinjets’ match ‘old’ and ‘badly’ – which is which? Stephan Vogel - Machine Translation

  8. … Location in Sentence Corpus • Observations: • First English word (‘The’) does not align; we say it aligns to the NULL word • Apinaye verb in first position • English last word aligns to 1st or 2nd position • English -> Apinaye: reverse word order (not strictly in sentence pair 5) • Hypothesis: • alignment for last sentence pair is 1-0 2-4 3-3 4-1 5-2 I.e: ‘pinjets’ – ‘old’ and ‘punui’ – ‘badly’ Stephan Vogel - Machine Translation

  9. … POS Information Corpus • Observations: • English determiner (‘The’) does not align; perhaps no determiners in Apinaye • English Verb Adverb -> Apinaye: Verb Adverb -> no reordering • English Adjective Noun -> Apinaye: Noun Adjective -> reordering • Hypothesis: • ‘pinjets’ is Adj to make it N Adj, ‘punui’ is Adv(consistent with alignment hypothesis) Stephan Vogel - Machine Translation

  10. Translate New Sentences: Ap - En • Source Sentence: Ape rats mi mets • Lexical information: works big man good/well • Reordering information: The good man works big • Better lexical choice: The good man works hard • Compare: Ape mi mets -> The good man works • Source Sentence: Kukre rats kokoi punui • Lexical information: eats big monkey badly • Reordering information: The bad monkey eats big • Better lexical choice: The bad monkey eats a lot Stephan Vogel - Machine Translation

  11. Translate New Sentences: En - Ap • Source Sentence: The old monkey eats a lot • Lexical information: NULL pinjets kokio kukre rats • Reordering information: kukre rats kokio pinjets • Or • Deleting words: old monkey eats a lot • Rephrase: old monkey eats big • Reorder: eats big monkey old • Lexical information: kukre rats kokio pinjets • Source Sentence: The big child works a long time • Delete plus rephrase: big child works big • Reorder: works big child big • Lexical information: Ape rats kra rats Stephan Vogel - Machine Translation

  12. Overview • Deciphering foreign text – an example • Principles of SMT • Data processing Stephan Vogel - Machine Translation

  13. Principles of SMT • We will use the same approach – learning from data • Build translation models using frequency, co-occurrence, word position, etc. information • Use the models to translate new sentences • Not manually, but fully automatically • The training will be automatically • The is still lots of manual work left: designing models, preparing data, running experiments, etc. Stephan Vogel - Machine Translation

  14. Statistical versus Grammar-Based • Often statistical and grammar-based MT are seen as alternatives, even opposing approaches – wrong !!! • Dichotomies are: • Use probabilities || everything is equally likely, yes/no decision • Rich (deep) structure || no or only flat structure • Both dimensions are continuous • Examples • EBMT: no/little structure and heuristics • SMT: (initially only) flat structure and probabilities • XFER: deep(er) structure and heuristics • Goal: structurally rich probabilistic models • statXFER: deep structure and probabilities • Syntax-augmented SMT: deep structure and probabilities Stephan Vogel - Machine Translation

  15. Statistical Machine Translation • Translator translates source text • Use machine learning techniques to extract useful knowledge • Translation model: word and phrase translations • Language model: how likely words follow in a particular sequence • Translation system (decoder) usesthese models to translates newsentences • Advantages: • Can quickly train for new languages • Can adopt to new domains • Problems: • Need parallel data • All words, even punctuation, are equal • Difficult to pin-point the causes of errors Source Target Translation Model Language Model SourceSentence Translation Stephan Vogel - Machine Translation

  16. Tasks in SMT • Modelling build statistical models which capture characteristic features of translation equivalences and of the target language • Training train translation model on bilingual corpus, train language model on monolingual corpus • Decoding find best translation for new sentences according to models • Evaluation • Subjective evaluation: fluency, adequacy • Automatic evaluation: WER, Bleu, etc • And all the nitty-gritty stuff • Text preprocessing, data cleaning • Parameter tuning (minimum error rate training) Stephan Vogel - Machine Translation

  17. Noisy Channel View “French is actually English, which has been garbled during transmission; recover the correct, original English” Noisy channel distortsinto French Speaker speaks English You hear French, but need to recover the English Stephan Vogel - Machine Translation

  18. Bayesian Approach Select translations which has highest probability ê = argmax{ p(e | f) } = argmax{ p(e) p(f | e) } Model Channel Search Process Model Source Stephan Vogel - Machine Translation

  19. SMT Architecture p(e) – language model p(f | e) – translation model Stephan Vogel - Machine Translation

  20. Log-Linear Model • In practice: ê = argmax{ log(p(e)) + log( p(f | e)) } • Translaiton model (TM) and language model (LM) may be of different quality: • - simplifying assumptions • - trained on different abounts of data • Give different weights to both models ê = argmax{ w1 * log(p(e)) + w2 * log( p(f | e)) } • Why not add more features? ê = argmax{ w1 * h1(e,f) + ... wn * hn(e, f) } • Note: We don‘t need the normalization constant for the argmax Stephan Vogel - Machine Translation

  21. Overview • Deciphering foreign text – an example • Principles of SMT • Data processing Stephan Vogel - Machine Translation

  22. Corpus Statistics We want to know how much data • Corpus size: not file size, not documents, but words and sentences • Why is file size not important? • Vocabulary: number of word types We want to know some distributions • How many words are seen only once? • Why is this interesting? • Does it help to increase the corpus? • … • How long are the sentence • Does it matter if we have many short of fewer, but longer sentences? Stephan Vogel - Machine Translation

  23. All Simple, Basic, Important • Important: When you publish, these numbers are important • To be able to interpret the resultsE.g. what works on small corpora may not work on large corpora • To make them comparable to other papers • Basic: no deep thinking, no fancy • Simple: a few unix commands, a few simple scripts • wc, grep, sed, sort, uniq • perl, awk (my favorite), perhaps python, … • Let’s look at some data! Stephan Vogel - Machine Translation

  24. BTEC Spa-Eng • Corpus Statistics • Corpus and vocabulary size • Percentage of singletons • Number of unknown words, out-of-vocabulary (OOV) rate • Sentence length balance • Text normalization • Spoken language forms: I’ll, we’ar, but also I will, we are Note: this was shown online Stephan Vogel - Machine Translation

  25. Tokenization • Punctuation attached to words • Example: ‘you’ ‘you,’ ‘you.’ ‘you?’ • All different strings, i.e. all are different words • Tokenization can be tricky • What about punctuation in numbers • What about appreviations(A5-0104/1999) • Numbers are not just numbers • Percentages: 1.2% • Ordinals: 1st, 2. • Ranges: 2000-2006, 3:1 • And more: (A5-0104/1999) Stephan Vogel - Machine Translation

  26. GigaWord Corpus • Distributed by LDC • Collection of new papers: NYT, Xinhua News, … • > 3 billion words • How large is vocabulary? • Some observations in vocabulary • Number of entries with digits • Number of entries with special characters • Number of strange ‘words’ • Some observations in corpus • Sentences with lots of numbers • Sentences with lots of punctuation • Sentences with very long words Note: this was shown online Stephan Vogel - Machine Translation

  27. And then the more interesting Stuff • POS tagging • Parsing • For syntax-based MT systems • How parallel are the parse trees? • Word segmentation • Morphological processing In all these tasks the central problem is: How to make the corpus more parallel? Stephan Vogel - Machine Translation

More Related