360 likes | 372 Vues
This research delves into automatic evaluation metrics for machine translation, focusing on BLEU and ROUGE metric families, BLANC adaptable metrics, n-gram models, and more. It introduces trainable models, discusses different notions of translation quality, and explores integrating additional knowledge for improved evaluation. The study presents the WER metric, PER metric, BLEU metric, Rouge-L metric, Rouge-S metric, and the potential trainability of BLEU and Rouge metrics.
E N D
Adaptable Automatic EvaluationMetrics forMachine Translation Lucian Vlad Lita joint work with Alon Lavie and Monica Rogati
Outline • BLEU and ROUGE metric families • BLANC –family of adaptable metrics • All common skip n-grams • Local n-gram model • Overall model • Experiments and results • Conclusions • Future work • References
t i m e Automatic Evaluation Metrics translation quality ( candidate | reference ) • Manual human judgments • Edit distance (WER) • Word overlap (PER) • Metrics based on n-grams • n-gram precision (BLEU) • weighted n-grams (NIST) • longest common subsequence (Rouge-L) • skip 2-grams (pairs of ordered words – Rouge-S) • Integrate additional knowledge (synonyms, stemming) (METEOR)
Automatic Evaluation Metrics translation quality ( candidate | reference ) • Manual human judgments • Machine translation (MT) evaluation metrics • Manually created estimators of quality • Improvements often shown on the same data • Rigid notion of quality • Based on existing judgment guidelines • Goal: trainable evaluation metric t i m e
Goal: Trainable MT Metric • Build on the features used by established metrics (BLEU, ROUGE) • Extendable – additional features/processing • Correlate well with human judgments • Trainable models • Different notions of “translation quality” • E.g. computer consumption vs. human consumption • Different features will be more important for different • Languages • Domains
R:the students asked the professor C:the students talk professor The WERMetric • Transform reference (human) translation R into candidate (machine) translation C • Levenshtein (edit) distance # of word insertions, deletions, and substitutions Word Error Rate = # words in R
R:the students asked the professor C:the students talk professor The PERMetric • Word overlap between candidate (machine) translation C and reference (human) translation R • Bag of words |count of w in R – count of w in C| Position Independent Error Rate w in C # words in R
R:the students asked the professor C:the students talk professor The BLEUMetric • Modified n-gram precisions • 1-gram precision = 3 / 4 • 2-gram precision = 1 / 3 • … • Contiguous n-gram overlap between reference (human) translation R and candidate (machine) translation C n BLEU = (Pi-gram)1/n * ( brevity penalty ) i = 1
The BLEU Metric • BLEU is the most established evaluation metric in MT • Basic feature: contiguous n-grams of all sizes • Computes modified precision • Uses a simple formula to combine all precision scores • Bigram precision is “as important” as unigram precision • Brevity penalty – quasi recall
R:the students asked the professor C:the students talk professor The Rouge-LMetric • Longest common subsequence (LCS) of the candidate (machine) translation C and reference (human) translation R • LCS = 3 “the students … professor” LCS (C,R) LCS (C,R) = = Precision Recall # words in C # words in R Rouge-L = harmonic mean (Precision, Recall) = 2PR / (P+R)
R:the students asked the professor C:the students talk professor The Rouge-SMetric • Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R • Skip2(C) = 6 { “the students”, “the talk”, “the professor”, “students talk”, “students professor”, “talk professor” } • Skip2(C,R) = 3 { “the students” , “the professor”, “students professor” } 11
R:the students asked the professor C:the students talk professor The Rouge-SMetric • Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R Skip2 (C,R) Skip2 (C,R) = = Precision Recall |C| choose 2 |R| choose 2 Rouge-S = harmonic mean (Precision, Recall)
The ROUGE Metrics • Rouge-L • Basic feature: longest common subsequence LCS • Size of the longest common skip n-gram • Weighted LCS • Rouge-S • Basic feature: skip bigrams • Skip bigram gap size irrelevant • Limited to n-grams of size 2 • Both use harmonic mean (F1-measure) to combine precision and recall
Is BLEU Trainable? • Can we assign/learn relative importance between P2 and P3? • Simplest model: regression • Train/test on past MT output [C,R] • Inputs: P1, P2 , P2 … and brevity penalty • P1, P2 , P2, bp HJ fluency score n BLEU = (Pi-gram)1/n * ( brevity penalty ) i = 1
Is Rouge Trainable? • Simple regression on • Size of the longest common skip n-gram • Number of common skip 2-grams • Second order parameters (dependencies) – model is not linear in its inputs anymore • Window size (computation reasons) • F-measure to F (replacing brevity penalty) • Potential models • Iterative methods • Hill climbing? • Non-linear (Bp, |LCS|, Skip2, F, ws) HJ fluency score
The BLANC Metric Family • Generalization of established evaluation metrics • N-gram features used by BLEU and ROUGE • Trainable parameters • Skip n-gram contiguity in C • Relative importance of n (i.e. bigrams vs. trigrams) • Precision-recall balance • Adaptability to different: • Translation quality criteria, languages, domains • Allow additional processing/features (e.g. METEOR matching)
R: the new student brought the food R:the new studentbroughtthe food C: the one pure student brought the necessary condiments C: the one pure studentbroughtthe necessary condiments the(0,5) the(4,0) ( , , , ) the(0,0) student(2,3) brought(3,4) ( , , , ) ( , , , ) the(4,5) ( , , , ) All Common Skip N-grams 1 0 0 0 1 1 0 0 1 2 1 0 # 1grams: 4 # 2grams: 6 # 3grams: 4 # 4grams: 1 3 1 3 1
R: the new student brought the food R:the new studentbroughtthe food C: the one pure student brought the necessary condiments C: the one pure studentbroughtthe necessary condiments ( , , , ) the(0,0) score(the0,0,student2,3) student(2,3) brought(3,4) ( , , , ) ( , , , ) ’’ ’ the(4,5) ( , , , ) All Common Skip N-grams 1 0 0 0 1 s22 0 0 1 s32 1 0 score(1-grams) score(2-grams) score(3-grams) score(4-grams) ? 3 1 1
All Common Skip N-grams • Algorithms literature: all common subsequences • Listing vs. counting subsequences • Interested in counting • # common subsequences of size 1, 2, 3 … • Replace counting with score over all n-grams of the same size • Score(w1…wi,wi+1…wn) = Score(w1…wi) Score(w1+1…wn) • BLANCi(C,R) = f(common i-grams of C,R)
Modeling Gap Size Importance skip 3-grams … the ____ ____ ____ ____ student ____ ____ has … … the ____ student has … … the student has …
C: … the __ __ __ __ student __ __ has … Modeling Gap Size Importance • Model the importance of skip n-gram gap size as an exponential function with one parameter () • Special cases • Gap size doesn’t matter (Rouge-S): = 0 • No gaps are allowed (BLEU): = large number
Modeling Candidate-Reference Gap Difference skip 3-gram match R: … the ____ student has … C1: … the ____ ____ ____ ____ student ____ ____ has … C2: … the student has …
Modeling Candidate-Reference Gap Difference R: … the __ student has … C: … the __ __ __ __ student __ __ has … • Model the importance of gap size difference between the candidate and reference translations as an exponential function with one parameter () • Special cases • Gap size differences do not matter: = 0 • Skip 2-gram overlap (Rouge-S): = 0, = 0, n=2 • Largest skip n-gram (Rouge-L): = 0, = 0, n=LCS
Skip N-gram Model • Incorporate simple scores into an exponential model • Skip n-gram gap size • Candidate-reference gap size difference • Possible to incorporate higher level features • Partial skip n-grams matching (e.g. synonyms, stemming) • “the __ students” vs. “the __ pupils”, “the __ students” vs. “the __ student” • From word classing to syntax • e.g. score( “students __ __ professor”) ? score (“the __ __ of”)
BLANC Overview Find All Common Skip Ngrams Candidates References Find Common Skip Ngram Compute Skip Ngram Pair Features e-ifi (sn) • Global parameters • precision/recall • f(skip ngram size) Combine All Common Skip Ngram Scores • Criterion • adequacy • fluency • f(adequacy, fluency) • other • Coefficient • pearson • spearman Compute Correlation Trained Metric
Incorporating Global Features • Compute BLANC precision and recall for each n-gram size i • Global exponential model based on • N-gram size: I BLANCi (C,R) i=1..n • F-measure parameter F for each size i • Average reference segment size • Other scores (i.e. BLEU, ROUGE-L, ROUGE-S) • … • Train for average human judgment vs. train for best overall correlation (as the error function)
Experiment Setup • Tides evaluation data • Arabic English 2003, 2004 • Training and test sentences separated by year • Optimized: • n-gram contiguity • difference in gap size (C vs. R) • Balance between precision and recall • Correlation using the Pearson correlation coefficient • Compared BLANC to BLEU and ROUGE • Trained BLANC for • Fluency vs. adequacy • System level vs. sentence level
Tides 2003 Arabic Evaluation • Pearson [-1,1] correlation with human judgments at system level and sentence level
Tides 2004 Arabic Evaluation • Pearson [-1,1] correlation with human judgments at system level and sentence level
Advantages of BLANC • Consistently good performance • Candidate evaluation is fast • Adaptable • fluency and adequacy • languages, domains • Help train MT systems for specific tasks • e.g. information extraction, information retrieval • Model complexity • Can be optimized for specific MT system performance levels
Disadvantages of BLANC • Training data vs. number of parameters • Model complexity • Guarantees of the training process
Conclusions • Move towards learning evaluation metrics • Quality criteria – e.g. fluency, adequacy • Correlation coefficients – e.g. Pearson, Spearman • Languages – e.g. English, Arabic, Chinese • BLANC – family of trainable evaluation metrics • Consistently performs well on evaluating machine translation output
Future Work • Recently obtained a two year NSF Grant • Try different models and improve the training mechanism for BLANC • Is a local exponential model the best choice? • Is a global exponential model the best choice? • Explore different training methods • Integrate additional features • Apply BLANC to other tasks (summarization)
References • Leusch, Ueffing, Vilar and Ney, “Preprocessing and Normalization for Automatic Evaluation of Machine Translation.” IEEMTS Workshop, ACL 2005 • Lin and Och, “Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics”, ACL 2004 • Lita, Lavie and Rogati, “BLANC: Learning Evaluation Metrics for MT”, HLT-EMNLP 2005 • Papineni, Roukos, Ward and Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation”, IBM Report 2002 • Akiba, Imamura and Sumita, “Using Multiple Edit Distances to Automatically Rank Machine Translation Output”, MT Summit VIII 2001 • Su, Wu and Chang, “A new Quantitative Quality Measure for a Machine Translation System”, COLING 1992
Acronyms, acronyms … • Official: Broad Learning Adaptation for NumericCriteria • Inspiration: white light contains light of all frequencies • Fun: Building on Legacy Acronym Naming Conventions • Bleu, Rouge, Orange, Pourpre … Blanc?