A daptable A utomatic E valuation M etrics for M achine T ranslation

Adaptable Automatic EvaluationMetrics forMachine Translation Lucian Vlad Lita joint work with Alon Lavie and Monica Rogati

Outline • BLEU and ROUGE metric families • BLANC –family of adaptable metrics • All common skip n-grams • Local n-gram model • Overall model • Experiments and results • Conclusions • Future work • References

t i m e Automatic Evaluation Metrics translation quality ( candidate | reference ) • Manual human judgments • Edit distance (WER) • Word overlap (PER) • Metrics based on n-grams • n-gram precision (BLEU) • weighted n-grams (NIST) • longest common subsequence (Rouge-L) • skip 2-grams (pairs of ordered words – Rouge-S) • Integrate additional knowledge (synonyms, stemming) (METEOR)

Automatic Evaluation Metrics translation quality ( candidate | reference ) • Manual human judgments • Machine translation (MT) evaluation metrics • Manually created estimators of quality • Improvements often shown on the same data • Rigid notion of quality • Based on existing judgment guidelines • Goal: trainable evaluation metric t i m e

Goal: Trainable MT Metric • Build on the features used by established metrics (BLEU, ROUGE) • Extendable – additional features/processing • Correlate well with human judgments • Trainable models • Different notions of “translation quality” • E.g. computer consumption vs. human consumption • Different features will be more important for different • Languages • Domains

R:the students asked the professor C:the students talk professor The WERMetric • Transform reference (human) translation R into candidate (machine) translation C • Levenshtein (edit) distance # of word insertions, deletions, and substitutions Word Error Rate = # words in R

R:the students asked the professor C:the students talk professor The PERMetric • Word overlap between candidate (machine) translation C and reference (human) translation R • Bag of words  |count of w in R – count of w in C| Position Independent Error Rate w in C  # words in R

R:the students asked the professor C:the students talk professor The BLEUMetric • Modified n-gram precisions • 1-gram precision = 3 / 4 • 2-gram precision = 1 / 3 • … • Contiguous n-gram overlap between reference (human) translation R and candidate (machine) translation C n BLEU = (Pi-gram)1/n * ( brevity penalty ) i = 1

The BLEU Metric • BLEU is the most established evaluation metric in MT • Basic feature: contiguous n-grams of all sizes • Computes modified precision • Uses a simple formula to combine all precision scores • Bigram precision is “as important” as unigram precision • Brevity penalty – quasi recall

R:the students asked the professor C:the students talk professor The Rouge-LMetric • Longest common subsequence (LCS) of the candidate (machine) translation C and reference (human) translation R • LCS = 3 “the students … professor” LCS (C,R) LCS (C,R) = = Precision Recall # words in C # words in R Rouge-L = harmonic mean (Precision, Recall) = 2PR / (P+R)

R:the students asked the professor C:the students talk professor The Rouge-SMetric • Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R • Skip2(C) = 6 { “the students”, “the talk”, “the professor”, “students talk”, “students professor”, “talk professor” } • Skip2(C,R) = 3 { “the students” , “the professor”, “students professor” } 11

R:the students asked the professor C:the students talk professor The Rouge-SMetric • Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R Skip2 (C,R) Skip2 (C,R) = = Precision Recall |C| choose 2 |R| choose 2 Rouge-S = harmonic mean (Precision, Recall)

The ROUGE Metrics • Rouge-L • Basic feature: longest common subsequence LCS • Size of the longest common skip n-gram • Weighted LCS • Rouge-S • Basic feature: skip bigrams • Skip bigram gap size irrelevant • Limited to n-grams of size 2 • Both use harmonic mean (F1-measure) to combine precision and recall

Is BLEU Trainable? • Can we assign/learn relative importance between P2 and P3? • Simplest model: regression • Train/test on past MT output [C,R] • Inputs: P1, P2 , P2 … and brevity penalty • P1, P2 , P2, bp HJ fluency score n BLEU = (Pi-gram)1/n * ( brevity penalty ) i = 1

Is Rouge Trainable? • Simple regression on • Size of the longest common skip n-gram • Number of common skip 2-grams • Second order parameters (dependencies) – model is not linear in its inputs anymore • Window size (computation reasons) • F-measure to F (replacing brevity penalty) • Potential models • Iterative methods • Hill climbing? • Non-linear (Bp, |LCS|, Skip2, F, ws) HJ fluency score

The BLANC Metric Family • Generalization of established evaluation metrics • N-gram features used by BLEU and ROUGE • Trainable parameters • Skip n-gram contiguity in C • Relative importance of n (i.e. bigrams vs. trigrams) • Precision-recall balance • Adaptability to different: • Translation quality criteria, languages, domains • Allow additional processing/features (e.g. METEOR matching)

R: the new student brought the food R:the new studentbroughtthe food C: the one pure student brought the necessary condiments C: the one pure studentbroughtthe necessary condiments the(0,5) the(4,0) ( , , , ) the(0,0) student(2,3) brought(3,4) ( , , , ) ( , , , ) the(4,5) ( , , , ) All Common Skip N-grams 1 0 0 0 1 1 0 0 1 2 1 0 # 1grams: 4 # 2grams: 6 # 3grams: 4 # 4grams: 1 3 1 3 1

R: the new student brought the food R:the new studentbroughtthe food C: the one pure student brought the necessary condiments C: the one pure studentbroughtthe necessary condiments ( , , , ) the(0,0)  score(the0,0,student2,3)  student(2,3) brought(3,4) ( , , , ) ( , , , ) ’’ ’  the(4,5) ( , , , ) All Common Skip N-grams 1 0 0 0 1 s22 0 0 1 s32 1 0 score(1-grams) score(2-grams) score(3-grams) score(4-grams) ? 3 1 1

All Common Skip N-grams • Algorithms literature: all common subsequences • Listing vs. counting subsequences • Interested in counting • # common subsequences of size 1, 2, 3 … • Replace counting with  score over all n-grams of the same size • Score(w1…wi,wi+1…wn) = Score(w1…wi)  Score(w1+1…wn) • BLANCi(C,R) = f(common i-grams of C,R)

Modeling Gap Size Importance skip 3-grams … the ____ ____ ____ ____ student ____ ____ has … … the ____ student has … … the student has …

C: … the __ __ __ __ student __ __ has … Modeling Gap Size Importance • Model the importance of skip n-gram gap size as an exponential function with one parameter () • Special cases • Gap size doesn’t matter (Rouge-S):  = 0 • No gaps are allowed (BLEU):  = large number

Modeling Candidate-Reference Gap Difference skip 3-gram match R: … the ____ student has … C1: … the ____ ____ ____ ____ student ____ ____ has … C2: … the student has …

Modeling Candidate-Reference Gap Difference R: … the __ student has … C: … the __ __ __ __ student __ __ has … • Model the importance of gap size difference between the candidate and reference translations as an exponential function with one parameter () • Special cases • Gap size differences do not matter:  = 0 • Skip 2-gram overlap (Rouge-S):  = 0,  = 0, n=2 • Largest skip n-gram (Rouge-L):  = 0,  = 0, n=LCS

Skip N-gram Model • Incorporate simple scores into an exponential model • Skip n-gram gap size • Candidate-reference gap size difference • Possible to incorporate higher level features • Partial skip n-grams matching (e.g. synonyms, stemming) • “the __ students” vs. “the __ pupils”, “the __ students” vs. “the __ student” • From word classing to syntax • e.g. score( “students __ __ professor”) ? score (“the __ __ of”)

BLANC Overview Find All Common Skip Ngrams Candidates References Find Common Skip Ngram Compute Skip Ngram Pair Features e-ifi (sn) • Global parameters • precision/recall • f(skip ngram size) Combine All Common Skip Ngram Scores • Criterion • adequacy • fluency • f(adequacy, fluency) • other • Coefficient • pearson • spearman Compute Correlation Trained Metric

Incorporating Global Features • Compute BLANC precision and recall for each n-gram size i • Global exponential model based on • N-gram size: I  BLANCi (C,R) i=1..n • F-measure parameter F for each size i • Average reference segment size • Other scores (i.e. BLEU, ROUGE-L, ROUGE-S) • … • Train for average human judgment vs. train for best overall correlation (as the error function)

Experiment Setup • Tides evaluation data • Arabic  English 2003, 2004 • Training and test sentences separated by year • Optimized: • n-gram contiguity • difference in gap size (C vs. R) • Balance between precision and recall • Correlation using the Pearson correlation coefficient • Compared BLANC to BLEU and ROUGE • Trained BLANC for • Fluency vs. adequacy • System level vs. sentence level

Tides 2003 Arabic Evaluation • Pearson [-1,1] correlation with human judgments at system level and sentence level

Tides 2004 Arabic Evaluation • Pearson [-1,1] correlation with human judgments at system level and sentence level

Advantages of BLANC • Consistently good performance • Candidate evaluation is fast • Adaptable • fluency and adequacy • languages, domains • Help train MT systems for specific tasks • e.g. information extraction, information retrieval • Model complexity • Can be optimized for specific MT system performance levels

Disadvantages of BLANC • Training data vs. number of parameters • Model complexity • Guarantees of the training process

Conclusions • Move towards learning evaluation metrics • Quality criteria – e.g. fluency, adequacy • Correlation coefficients – e.g. Pearson, Spearman • Languages – e.g. English, Arabic, Chinese • BLANC – family of trainable evaluation metrics • Consistently performs well on evaluating machine translation output

Future Work • Recently obtained a two year NSF Grant • Try different models and improve the training mechanism for BLANC • Is a local exponential model the best choice? • Is a global exponential model the best choice? • Explore different training methods • Integrate additional features • Apply BLANC to other tasks (summarization)

References • Leusch, Ueffing, Vilar and Ney, “Preprocessing and Normalization for Automatic Evaluation of Machine Translation.” IEEMTS Workshop, ACL 2005 • Lin and Och, “Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics”, ACL 2004 • Lita, Lavie and Rogati, “BLANC: Learning Evaluation Metrics for MT”, HLT-EMNLP 2005 • Papineni, Roukos, Ward and Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation”, IBM Report 2002 • Akiba, Imamura and Sumita, “Using Multiple Edit Distances to Automatically Rank Machine Translation Output”, MT Summit VIII 2001 • Su, Wu and Chang, “A new Quantitative Quality Measure for a Machine Translation System”, COLING 1992

Thank you

Acronyms, acronyms … • Official: Broad Learning Adaptation for NumericCriteria • Inspiration: white light contains light of all frequencies • Fun: Building on Legacy Acronym Naming Conventions • Bleu, Rouge, Orange, Pourpre … Blanc?

A daptable A utomatic E valuation M etrics for M achine T ranslation

A daptable A utomatic E valuation M etrics for M achine T ranslation

Presentation Transcript

H E M A T

The A sia-Pacific A ssociation for M achine T ranslation AAMT http://www.aamt.info/ 15 th Oct. 2013

T he P erpetual M oney M achine

A M S T E R D A M

Brazil: M etrics

M A T A S E A -

M achine

WA S T E M A N A G E M E N T

T E A M

T HE A UGMENTED A NESTHESIA M ACHINE

T HE M ACHINE I NSIDE