Machine Translation

Machine Translation Day 20

Evaluating MT

MT Evaluation Source : ズキズキ痛みます。 16 human translations: • I have a throbbing pain. • I am experiencing a throbbing pain. • I am suffering from a throbbing pain. • I am feeling a throbbing pain. • It is a throbbing pain. • It's throbbing and it really hurts. • It's painful and it's throbbing. • It's throbbing with pain. It's in throbbing pain. It hurts so much it's throbbing. I've got a throbbing pain. I can feel a throbbing pain. I am suffering from a throbbing pain. I am experiencing a throbbing pain. I have a painful throbbing. I feel a painful throbbing. Data from International Workshop on Spoken Language Translation

MT Evaluation • No “right answer”! • What can we test instead? • Human adequacy / fluency ratings • Human efficacy in an application (e.g. question answering from translated foreign documents vs. native documents) • Very accurate, but slow & expensive • Agreement with reference translations • BLEU (BiLingual Evaluation Understudy: IBM) • Fast system development

BLEU (Papineni, ACL 2002) • MT output: 1: It is a guide to action which ensures that the military always obeys the commands of the party. 2: It is to insure the troops forever hearing the activity guidebook that party direct. • Human (reference) translations: 1: It is a guide to action that ensures that the military will forever heed Party commands. 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. 3: It is the practical guide for the army always to heed the directions of the party.

BLEU • MT output: 1: It is a guide to action whichensures that the military always obeys thecommandsof the party. 2: It is to insure the troops forever hearing the activity guidebook that party direct. • Human (reference) translations: 1: It is a guide to action that ensures that the military will forever heed Party commands. 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. 3: It is the practical guide for the army always to heed the directions of the party.

BLEU • MT output: 1: It is a guide to action which ensures that the military always obeys the commands of the party. 2: It is to insure the troops forever hearing the activity guidebook thatparty direct. • Human (reference) translations: 1: It is a guide to action that ensures that the military will forever heed Party commands. 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. 3: It is the practical guide for the army always to heed the directions of the party.

BLEU: observations 1: It is a guide to action whichensures that the military always obeys thecommandsof the party. 2: It is to insure the troops forever hearing the activity guidebook thatparty direct. • Observations • Word overlap is indicative • n-gram (word sequence) overlap is even more distinct • Drawing from multiple reference translations helps

BLEU metric • Compute n-gram precisions: Pn = c(matched n-grams) / c(n-grams in candidate) • Compute a brevity penalty (Prevent candidates from deleting difficult words) BP = exp( min( 1 – r/c, 0 ) ), r = reference length, c = candidate length • Combine using geometric mean BLEU = BP ∙ (∏i=1n Pi)^(1/n) • Produces score on a 0-1 scale – often expressed as a “percentage” (e.g., * 100)

BLEU results circa 2002 Distinguishes humans from machines… …correlates well with human judgments [from Papineni et al., ACL 2002] [from G. Doddington, NIST] However nowadays we’re starting to see problems: - Some systems score better than human translations - In competitions, some “gaming of BLEU” - Rule based systems are at a disadvantage after tuning

MT Evaluation: Human • Absolute evaluation • Given a reference translation human evaluators are asked to rank translation quality on a scale of 1-4 4= Ideal:grammatically correct, all information included 3= Acceptable: Not perfect, but definitely comprehensible, AND with accurate transfer of all important information. 2= Possibly acceptable: may be interpretable given context/time, some information transferred accurately 1= Unacceptable: Absolutely not comprehensible and/or little or no information transferred accurately. • Relative evaluation • Human judges are presented with a reference translation and two machine translations in random order, and must pick the better of the two • Criteria for decision are left up to individual judge

Absolute quality: SpanishEnglish Average quality scores: Babelfish=2.344 MSR-MT=2.727

Extrinsic evaluation: Microsoft product support site • Microsoft support knowledge base • Thousands of customer support articles available at http://support.microsoft.com • However, most are only available in English • Translating all articles by hand is too expensive • Instead we present unedited MT articles • Available in Spanish, French, German, Japanese, etc. • Some of the publicly available data-driven translations (2002-2003)

http://support.microsoft.com

PSS survey results (Spanish) • Overall satisfaction with the article (scale: 1 to 9) • 86.0% scored between 5 and 9; US English = 74.2% • Technical accuracy of the article (1 to 9) • 75.3% scored between 5 and 9 • Task success • “Did the information in the (machine translated) knowledge base article help answer your question?” • Yes: • Machine translated Spanish = 49.7% • Human translated Spanish = 51.2% • US English = 53.6%

Word alignment

A very simple MT system • Get a translation dictionary • Assign a uniform distribution over all translations of each source word • Tokenize input sentence, replace each word with its English translation: weilergesterngegangenist because he yesterday gone is • Not terrible, but not very fluent

Simple Statistical Machine Translation • Given foreign f, find best English translation e* e* = argmaxe P(e | f) • Use Bayes’ rule to get “noisy channel” model P(e | f) = P(f | e) ∙ P(e) / P(f) argmaxe P(e | f) = argmax P(f | e) ∙ P(e) • P(f | e) is the channelor translation model • P(e) is the language model

Toy System A • Channel model reversed, otherwise identical • Now gives a probability of source given target • Uniform distribution over all source translations of a given target word • Word-based bigram model as language model • Improve translations in context • Improves fluency overall • Looks like an HMM tagger: • Find Viterbi path through a lattice or trellis

Toy System A: search Each partial hypothesis keeps track of the last word generated (for LM score) and the total score so far he -5.6 yesterday -8.3 gone -9.9 <s> 0 because -3.2 eat -9.8 eat -10.3 him -5.4 left -10.4 Only need to keep the best hypothesis ending in some word – bigram LM can’t see beyond that (Viterbi!) his -5.9

Learning the translation model • Start from seminal work by IBM back in the late 1980s – early 1990s • They develop models for identifying word correspondences (word alignments) of parallel data

Learning the translation model • Say we had some word aligned parallel data • How would we estimate a translation model? the the house flower la maison la fleur the blue house la maison bleu

Learning the translation model • Say I had a model of P(french | english) • How can I find alignments? the the house flower la maison la fleur the blue house la maison bleu

Parameter estimation • Given lists of parallel sentences (e, f) • If we had the hidden alignments a, then we could estimate multinomial parameters based on counts c(e, f):= number of times e was aligned to f c(e):= number of occurrences of e t(f | e):=c(e, f) / c(e) • On the other hand, if we knew the parameterst(∙ | ∙), we could find the most likely alignments • Bit of a chicken and an egg problem…

Expectation-Maximization • Enter the Expectation-Maximization algorithm • Method for optimizing parameters / finding hidden state in unsupervised problems • A procedural description for now • Pick an initial set of parameters t0(f | e), set k = 0 • Until convergence… • Find expected values of the hidden states ak+1 for each pair assuming parameters tkare correct (Expectation) • Find the most likely parameters tk+1 assuming that hidden states ak+1 are correct (Maximization) • Increment k

Model 1 [null] [null] the the house flower la maison la fleur [null] the blue house la maison bleu

Model 1, EM iteration 0 [null] 0.33 0.33 [null] 0.33 0.33 the 0.33 0.33 the 0.33 0.33 house 0.33 0.33 flower 0.33 0.33 la maison la fleur [null] 0.25 0.25 0.25 the 0.25 0.25 0.25 blue 0.25 0.25 0.25 house 0.25 0.25 0.25 la maison bleu

IBM Word-based translation(Brown et al., 1993) [null] I do not speak French je ne parle pas francais • Model P(f | e): French translations given English

Model 1 • Lots of simplifying assumptions: • All lengths are equally likely P(m | e) ∼ uniform = ε • All word alignments are equally likely P(aj | a1j-1, f1j-1, m, e) ∼ uniform = 1 / (l + 1) • French word depends on English word it’s aligned to P(fj | a1j, f1j-1, m, e) ∼ t(fj | eaj) ∼ multinomial over English words • Resulting model P(f,a | e) = ε / (l + 1)m ∏j=1m t(fj | eaj)

A generative story(IBM Models 1-2, HMM) Pick the length of the French sentence For each position in the French sentence… Pick the English word aligned to the French word in that position, then… Pick the French word in that position E, F: English, French vocabularies e = e1l = (e1, …, el): English sentence, ei ∈ E f = f1m = (f1, …, fm): French sentence, fj ∈ F a = a1m = (a1, …, am): word alignment, aj ∈ [0..l] P(f, a | e) = P(m | e) ∙ ∏j=1m( P(aj | a1j-1, f1j-1, m, e) ∙ P(fj | a1j, f1j-1, m, e) ) Exact – chain rule!

Progression of alignment models • Models of increasing complexity • Only Model 1 is convex • Models 3, 4, 5 each capture new aspects of the sentence • Capture “fertility” • Different movement models • Each model can initialize its successor – helps avoid local minima • Freely available tools for this task • GIZA++ • Berkeley aligner

Toy System A’ • Our prior toy system used a uniform distribution for translations • Now we can plug in Model 1 parameters • Language model helps pick translations that are fluent • Translation model helps pick translations that are adequate • Looks just like an HMM!

Toy System A’ Each translation is like a part-of-speech tag Becomes Bigram LM + Model 1! he -5.6 yesterday -8.3 gone -9.9 <s> 0 because -3.2 eat -9.8 eat -10.3 him -5.4 left -10.4 his -5.9

Some questions: • What about standard translation dictionaries? Should we include them, and how? • What translation phenomena are we covering and what are we missing? • Does it work?

Toy System B • System A: finds better translations in context, but can’t reorder “ergesterngegangenisthe yesterday left had”(should be “he had left yesterday”) • System B: allow all possible permutations • Each hypothesis now remembers: • Last target word generated • Set of source words already translated • 5! = 125 permutations, 10! = 3.6M, 20! = 2.43e18 • No way we can afford to keep all translations! • Group into stacks based on count of words covered • Histogram pruning: limited number of hypotheses on any stack • Threshold pruning: only keep hypotheses within d of best on stack

Toy System B: search Stack 0 Stack 1 Stack 2 yesterday -1.900100 yesterday -5.6100100 Like an expanded Viterbi search, but each hypothesis also needs to remember which source words have been translated already! because -5.2100100 <s> 000000 because -3.210000 … he -5.81100000 he -3.501000 … …

Beyond Toy System B • Many problems with this system: • System allows all possible reorderings, but some are much more likely than others • Contextual information is only captured by the target language model, not in the source • Multiple paths from here: • Better word alignment • Phrase-based translation: learn bigger translation units – this is crucial! • Better reordering models: syntax can help here

Word-based MT results SRC: Le politique de la haine REF: Politics of hate WB: The policy of the hatred SRC: Où était le plan solide? REF: But wherewas the solid plan? WB: Wherewas the economic base? SRC: Nous avonesigné le protocole. REF: We did sign the memorandum of agreement. WB: We have signed the protocol. SRC: 对外经济贸易合怍部今无提供的数据表明，今年至十一月中国实际利用外资四百六十九点五九亿美元,其中包括外商直接投资四百点零七亿美元。 REF: According to the data provided today by the Ministry of Foreign Trade and Economic Cooperation, as of November this year, China has actually utilized 46.959 billion US dollars of foreign capital, including 40.007 billion US dollars of direct investment from foreign businessmen. WB: The Ministry of Foreign Trade and Economic Cooperation, including foreign direct investment 40.007 billion US dollars today provide data include that year to November china actually using 46.959 billion US dollars and

Word alignment and phrase extraction (Koehn, Och, Marcu 2003) the blue house a casa azul

Word alignment and phrase extraction (Koehn, Och, Marcu 2003) the blue house a casa azul the a

Word alignment and phrase extraction (Koehn, Och, Marcu 2003) the blue house a casa azul the a blue azul

Word alignment and phrase extraction (Koehn, Och, Marcu 2003) the blue house a casa azul the a blue azul house casa

Word alignment and phrase extraction (Koehn, Och, Marcu 2003) the blue house a casa azul the a blue azul house casa blue house casa azul

Word alignment and phrase extraction (Koehn, Och, Marcu 2003) the blue house a casa azul the a blue azul house casa blue house casa azul the blue house a casa azul

Phrase table Extract phrases from all sentence pairs Estimate P(src | tgt) with c(src, tgt) / c(tgt)

Machine Translation