Re-evaluating Bleu

Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006

Overview • The Weaknesses of Bleu • Introduction • Precision and Recall • Fluency and Adequacy • Variations Allowed by Bleu • Bleu and Tides 2005 • An Improved Model • Overview of the Model • Experiment • Results • Conclusions Spring 2006 MT Seminar

Introduction • Bleu has been shown to have high correlations with human judgments • Bleu has been used by MT researchers for five years, sometimes in place of manual human evaluations • But does the minimization of the error rate accurately show improvements in translation quality? Spring 2006 MT Seminar

Precision and Bleu • Of my answers, how many are right/wrong? • Precision = B  C / C or A/C A C B Reference Translation Hypothesis Translation Spring 2006 MT Seminar

Precision and Bleu Bleu is a precision based metric • The modified precision score, pn: Pn = ∑sc ∑ngramsCountmatched(ngram) ∑sc ∑ngramsCount(ngram) Spring 2006 MT Seminar

Recall and Bleu • Of the potential answers how many did I retrieve/miss? • Recall = B  C / B or A/B A C B Reference Translation Hypothesis Translation Spring 2006 MT Seminar

Recall and Bleu • Because Bleu uses multiple reference translations at once, recall cannot be calculated Spring 2006 MT Seminar

Fluency and Adequacy to Evaluators • Fluency • “How do you judge the fluency of this translation” • Judged with no reference translation and to the standard of written English • Adequacy • “How much of the meaning expressed in the reference is also expressed in the hypothesis translation?” Spring 2006 MT Seminar

Variations • Bleu allows for variations in word and phrase order that lead to less fluency • No constraints occur on the order of matching n-grams Spring 2006 MT Seminar

Variations Spring 2006 MT Seminar

Variations The above two translations have the same bigram score. Spring 2006 MT Seminar

Bleu and Tides 2005 • Bleu scores showed significant divergence from human judgments in the 2005 Tides Evaluation • It ranked the system considered the best by humans as sixth in performance Spring 2006 MT Seminar

Bleu and Tides 2005 • Reference: Iran had already announced Kharazi would boycott the conference after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs • System A: Iran has already stated that Kharazi’s statements to the conference because of the Jordanian King Abdullah II in which he stood accused Iran of interfering in Iraqi affairs. • N-gram matches: 1-gram: 27; 2-gram: 20; 3-gram: 15; 4 gram: 10 • Human scores: Adequacy: 3,2; Fluency 3,2 From Callison-Burch 2005 Spring 2006 MT Seminar

Bleu and Tides 2005 • Reference: Iran had already announced Kharazi would boycott the conference after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs • System B: Iran already announced that Kharazi will not attend the conference because of statements made by Jordanian Monarch Abdullah II who has accused Iran of interfering in Iraqi affairs. • N-gram matches: 1-gram: 24; 2-gram: 19; 3-gram: 15; 4 gram: 12 • Human scores: Adequacy: 5,4; Fluency 5,4 From Callison-Burch 2005 Spring 2006 MT Seminar

An Experiment with Bleu Spring 2006 MT Seminar

Bleu and Tides 2005 • “This opens the possibility that in order to for Bleu to be valid only sufficiently similar systems should be compared with one another” Spring 2006 MT Seminar

Additional Flaws • Multiple Human reference translations are expensive • N-grams showing up in multiple reference translations are weighted the same • Content words are weighed the same as common words • ‘The’ counts the same as ‘Parliament’ • Bleu accounts for the diversity of human translations, but not synonyms Spring 2006 MT Seminar

An Extension of Bleu • Described in Babych & Hartley, 2004 • Adds weights to matched items using • tf/idf • S-score Spring 2006 MT Seminar

Addressing Flaws • Can work with only one human translation • Can actually calculate recall • The paper is not very clear about this sentence is selected • Content words are weighed the differently than common words • ‘The’ does not count the same as ‘Parliament’ Spring 2006 MT Seminar

Calculating the tf/idf Score • tf.idf(i,j) = (1 + log (tfi,j)) log (N / dfi), • if tfi,j ≥ 1; where: • tfi,jis the number of occurrences of the word wiin the document dj; • dfi is the number of documents in the corpus where the word wioccurs; • N is the total number of documents in the corpus. From Babych 2004 Spring 2006 MT Seminar

Calculating the S-Score • The S-score was calculated as: • Pdoc(i,j)is the relative frequency of the word in the text • Pcorp-doc(i) is the relative frequency of the same word in the rest of the corpus, without this text; • (N – df(i)) / Nis the proportion of texts in the corpus, where this word does not occur • Pcorp(i)is the relative frequency of the word in the whole corpus, including this particular text. Spring 2006 MT Seminar

Integrating the S-Score • If for a lexical item in a text the S‑score > 1, all counts for the N-grams containing this item are increased by the S-score (not just by 1, as in the baseline BLEU approach). • If the S-score ≤1; the usual N-gram count is applied: the number is increased by 1. From Babych 2004 Spring 2006 MT Seminar

The Experiment • Used 100 French-English texts from the DARPA-94 evaluation corpus • Included two reference translations • Results from 4 Different MT systems Spring 2006 MT Seminar

The Experiment • Stage 1: • tf/idf & S-scores are calculated on the two reference translations • Stage 2: • N-gram based evaluation using Precision, Recall of n-grams in MT output • N-gram matches were adjusted to N-gram weights or S-Score • Stage 3: • Comparison with human scores Spring 2006 MT Seminar

Results for tf/idf Spring 2006 MT Seminar

Results for S-Score Spring 2006 MT Seminar

Results • The n-gram model beats BLEU in adequacy • The f-score metric is more strongly correlated with fluency • Single Reference translations are stable (add stability chart?) Spring 2006 MT Seminar

Conclusions • The Bleu model can be too coarse to show differentiate between very different MT systems • Adequacy is harder to predict than fluency • Adding weights and using recall and f-scores can bring higher correlations with adequacy and fluency scores Spring 2006 MT Seminar

References • Chris Callison-Burch, Miles Osborne and Philipp Koehn. 2006. Re-evaluating the Role of Bleu in Machine Translation Research, to appear in EACL-06. • Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02). Philadelphia, PA. July 2002. pp. 311-318. • Babych B, Hartley A. 2004. Extending BLEU MT Evaluation Method with Frequency Weighting, In Proceedings of the 42th Annual Meeting of the Association for Computational Linguistics (ACL-04). Barcelona, Spain. July 2004. • Dan Melamed, Ryan Green, and joseph P. Turian. Precision and recall of machine translation. In Proceedings of the Human Language Technology Conference (HLT), pages 61--63, Edmonton, Alberta, May 2003. HLT-NAACL. http://citeseer.csail.mit.edu/melamed03precision.html • Deborah Coughlin. 2003. Correlating automated andhuman assessments of machine translation quality.In Proceedings of MT Summit IX. • LDC. 2005. Linguistic data annotation specification:Assessment of fluency and adequacy in translations.Revision 1.5 Spring 2006 MT Seminar

Precision and Bleu • The Brevity Penalty is designed to compensate for overly terse translations BP = { c = length of corpus of hypothesis translations r = effective corpus length* 1 if c > r e1-r/c if c ≤ r Spring 2006 MT Seminar

Precision and Bleu • Thus, the total Bleu score is this: BLEU = BP * exp(∑ wn log pn) n n=1 Spring 2006 MT Seminar

Flaws in the Use of Bleu • Experiments with Bleu, but no manual evaluation (Callison-Burch 2005) Spring 2006 MT Seminar

Re-evaluating Bleu

Re-evaluating Bleu

Presentation Transcript

Overview of BLEU

Re-Evaluating Minnesota’s Gang Classification System

Re-Entry Re-verse Culture Shock Re-Evaluating Identity

Re-evaluating

TEAM BLEU BULLET

Bleu cheese

Cordon Bleu

Evaluating and re-evaluating agent modeling: simulation and design

niveau bleu

Unité 8 - Bleu

Bleu Cheese

Bleu U2L3

Re-evaluating Bleu