Machine Translation Decoder for Phrase-Based SMT

Machine TranslationDecoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011 Stephan Vogel - Machine Translation

Decoder • Decoding issues • Two step decoding • Generation of translation lattice • Best path search • With limited word reordering • Specific Issues (Next Session) • Recombination of hypotheses • Pruning • N-best list generation • Future cost estimation Stephan Vogel - Machine Translation

Decoding Issues Decoder takes source sentence and all available knowledge (translation model, distortion model, language model etc)and generates a target sentence • Many alternative translations are possible • Too many to explore them all -> pruning is necessary • Pruning leads to search errors • Decoder outputs model-best translation • Ranking of hyps according to model is different from ranking according to external metric  • Bad translations get better models scores than good translations -> model errors • Models see only limited context • Different hypotheses become identical under the model • -> Hypothesis recombination Stephan Vogel - Machine Translation

Decoding Issues • Languages have different word order • Modeled by distortion models • Exploring all possible reorderings to expensive (essentially O(J!)) • Need to restrict reordering -> different reordering strategies • Optimizing the system • We use a bunch of models (features), need to optimize scaling factors (feature weights) • Decoding is expensive • Optimize on n-best list -> need to generate n-best lists Stephan Vogel - Machine Translation

Decoder: The Knowledge Sources • Translation models • Phrase translation table • Statistical lexicon and/or manual lexicon • Named entities • Translation information stored as transducers or extracted on the fly • Language model: standard n-gram LM • Distortion model: distance-based or lexicalized • Sentence length model • Typically simulated by word-count feature • Other features • Phrase-count • Number of untranslated words • … Stephan Vogel - Machine Translation

The Decoder: Two Level Approach • Build translation lattice • Run left-right over test sentence • Search for matching phrases between source sentence and phrase table (and other translation tables) • For each translation, insert edges into the lattice • First best search (or n-best search) • Run left-right over lattice • Apply n-gram language model • Combine translation model scores and language model score • Recombine and prune hypotheses • At sentence end: add sentence length model score • Trace back best hypothesis (or n-best hypotheses) • Notice: this is convenient for describing decoder • Implementation can interleave both processes • Implementation can make a difference due to pruning Stephan Vogel - Machine Translation

Building Translation Lattice Sentence: ich komme morgen zu dir Reference: I will come to you tomorrow • Search in corpus for phrases and their translations • Insert edges into the lattice I will come to your office I come tomorrow to you come I ich komme morgen zu dir 0 1 2 … … J Stephan Vogel - Machine Translation

Phrase Table in Hash Map • Store phrase table in hash map (source phrase as key) • For each n-gram in source sentence access hash map foreach j = 1 to J-1 // start position of phrase foreach l = 0 to lmax-1 // phrase length SourcePhrase = (wj … wj+l) TargetPhrases = Hashmap.Get( SourcePhrase ) foreach TargetPhrase t in TargetPhrases create new edge e’ = (j-1, j+l, t ) // add TM scores • Works fine for sentence input, but too expensive for lattices • Lattices from speech recognizer • Paraphrases • Reordering as preprocessing step • Hierarchical transducers Stephan Vogel - Machine Translation

Example: Paraphrase Lattice • Large: top-5 paraphrases • Pruned Stephan Vogel - Machine Translation

Phrase Table as Prefix Tree Stephan Vogel - Machine Translation

Phrase Table as Prefix Tree ja , okay dann Montag bei mir okay then okay on Monday then Stephan Vogel - Machine Translation

Building the Translation Lattice • Book-keeping: hypothesis h = (n, n, s0, hprev, e ) n – nodes0– initial state in transducerhprev – previous hypothesise– edge • Convert sentence into lattice structure • At each node n, insert ‘empty’ hypothesis h = (n, n, s0, hprev = nil, e = nil ) as starting point for phrase search from this position • Note: Previous hyp and edge are only needed for hierarchical transducers, to be able to ‘propagate’ partial translations Stephan Vogel - Machine Translation

Algorithm for Building Translation Lattice foreach node n = 0 to J create empty hypothesis h0 = (n, n, s0, NIL, NIL) Hyps( n ) = Hyps( n ) + h0 foreach incoming edge e in n w = WordAt( e ) nprev = FromNode( e ) foreach hypothesis hprev = (nstart, nprev, sprev, hx, ex ) in Hyps( nprev ) if transducer T has transition (s->s’ : w ) ifs’ is emitting state foreach translation t emitted in s’ create new edge e’ = (ns, n, t ) // add TM scores ifs’ is not final state create new hypothesis h’ = (ns, n, s’, hprev, e ) Hyps( n ) = Hyps( n ) + h’ Stephan Vogel - Machine Translation

Searching for Best Translation • We have constructed a graph • Directed • No cycles • Each edge carries a partial translation (with scores) • Now we need to find the best path • Adding additional information (DM, LM, ….) • Allowing for some reordering Stephan Vogel - Machine Translation

Monotone Search • Hypotheses describe partial translations • Coverage information, translation, scores • Expand hypothesis over outgoing edges I will come to your office I come tomorrow to you come I ich komme morgen zu dir h: c=0..4, t=I will come tomorrow zu h: c=0..3, t=I will come tomorrow h: c=0..4, t=I will come tomorrow to h: c=0..5, t=I will come tomorrow to your office Stephan Vogel - Machine Translation

Reordering Strategies • All permutations • Any re-ordering possible • Complexity of traveling salesman -> only possible for very short sentences • Small jumps ahead – filling the gaps pretty soon • Only local word reordering • Implemented in STTK decoder • Leaving small number of gaps – fill in at any time • Allows for global but limited reordering • Similar decoding complexity – exponential in number of gaps • IBM-style reordering (described in IBM patent) • Merging neighboring regions with swap – no gaps at all • Allows for global reordering • Complexity lower than 1, but higher than 2 and 3 Stephan Vogel - Machine Translation

IBM Style Reordering • Example: first word translated last! 0 1 gap 2 another gap 3 4 partially filled 5 6 7 • Resulting reordering: 2 3 7 8 9 10 11 5 6 4 12 13 14 1 Stephan Vogel - Machine Translation

Sliding Window Reordering • Local reordering within sliding window of size 6 [ ] 0 [ ] 1 gap [ ] 2 another gap [ ] 3 [ ] 4 partially filled [ ] 5 new gap [ ] 6 [ ] 7 [ ] 8 Stephan Vogel - Machine Translation

Coverage Information • Need to know which source words have already been translated • Don’t want to miss some words • Don’t want to translate words twice • Can compare hypotheses which cover the same words • Use Coverage vector to store this information • For ‘small jumps ahead’: position of first gap plus short bit vector • For ‘small number of gaps’: array of positions of uncovered words • For ‘merging neighboring regions’: left and right position Stephan Vogel - Machine Translation

Limited Distance Word Reordering • Word and phrase reordering within a given window • From first un-translated source word next k positions • Window length 1: monotone decoding • Restrict total number of reordering (typically 3 per 10 words) • Simple ‘Jump’ model or lexicalized distortion model • Use bit vector 1001100… = words 1, 4, and 5 translated • For long sentences long bit vectors, but only limited reordering allowed, therefore: Coverage = ( first untranslated word, bit vector)i.e. 111100110… -> (4, 00110…) Stephan Vogel - Machine Translation

Jumping ahead in the Lattice • Hypotheses describe a partial translation • Coverage information, translation, scores • Expand hypothesis over uncovered position (within window) I will come to your office I come tomorrow to you come I ich komme morgen zu dir h: c=11000, t=I will come h: c=11011, t=I will come to your office h: c=11111, t=I will come to your office tomorrow Stephan Vogel - Machine Translation

Hypothesis for Search • Organize search according to number of translated words c • It is expensive to expand the translation • Replace by back-trace information • Generate full translation only for the best (n-best) final translation • Book-keeping: hypothesis h = (Q, C, L, i, hprev, e) • Q – total cost (we keep also cumulative costs for individual models) • C – coverage information: positions already translated • L – language model state: e.g. last n-1 words for n-gram LM • i – number of target words • hprev – pointer to previous hypothesis • e – edge traversed to expand hprev into h • hprev and e is the back-trace information: used to reconstruct the full translation Stephan Vogel - Machine Translation

Algorithm for Applying Language Model for coverage c = 0 to J-1 foreach h in Hyps( c ) foreach node n within reordering window foreach outgoing edge e in n if no coverage collision between h.C and C(e) TMScore = -log( p( t| s ) // typically several scores DMScore = -log p( jump ) // or lexicalized DM score // other scores like word count, phrase count, etc foreach target word tk in t LMScore += -log p (tk | Lk-1 ) Lk = Lk-18ti endfor Q’ = k1*TMScore + k2*LMScore + k3*DMScore + … h’ = ( h.Q + Q’, h.C & C(e), L’, h.i + |t|, h, e ) Hyps( c’ ) += h’ Stephan Vogel - Machine Translation

Algorithm for Applying LM cont. // coverage is now J, i.e. sentence end reached foreach h in Hyps( J ) SLScore = -log p( h.i | J ) // sentence length model LMScore += -log p (</s> | Lh ); // end-of-sentence LM score L’= Lh8</s> Q’ = a*LMScore + b*SLScore h’ = ( h.Q + Q’, h.C , L’, h.i, h, e ) Hyps( J+1 ) += h’ Sort Hyps( J+1 ) according to total score Q Trace back over sequence of (h, e) to construct actual translation Stephan Vogel - Machine Translation

Sentence Length Model • Different language have different level of ‘wordiness’ • Histogram over source sentence length – target sentence length shows that distribution is rather flat -> p( J | I ) is not very helpful • Very simple sentence length model: the more – the better • i.e. give bonus for each word (not a probabilistic model) • Balances shortening effect of LM • Can be applied immediately, as absolute length is not important • However: this is insensitive to what’s in the sentence • Optimize length of translations for entire test set, not each sentence • Some sentences are made too long to cover for sentences which are too short Stephan Vogel - Machine Translation

Machine Translation Decoder for Phrase-Based SMT

Machine Translation Decoder for Phrase-Based SMT

Presentation Transcript

Web-Based Machine Translation

pre-ordering dependency subtreeS for phrase-based smt

Rules Based Machine Translation

Statistical Machine Translation Part III – Phrase- based SMT / Decoding

Statistical Machine Translation Part V – Phrase-based SMT

Non-contiguity phrase-based SMT

Morphological Analysis for Phrase-Based Statistical Machine Translation

Example-Based Machine Translation

Machine Translation Phrase Alignment

The State of the Art in Phrase-Based Statistical Machine Translation (SMT)

A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang

Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem

Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure

Statistical Machine Translation SMT – Basic Ideas

Example-based Machine Translation

Morphological Analysis for Phrase-Based Statistical Machine Translation

Machine Translation Decoder for Phrase-Based SMT

Modified Distortion Matrices for Phrase-Based SMT

Statistical Machine Translation Part III – Many-to-Many Alignments and Phrase-Based SMT

Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation