Contextual Bitext-derived Paraphrases in Automatic MT Evaluation

Contextual Bitext-derived Paraphrases in Automatic MT Evaluation Karolina Owczarzak, Declan Groves, Josef Van Genabith, Andy Way National Centre for Language Technology, Dublin City University HLT-NAACL, 09 June 2006

Overview • Automatic MT evaluation and its limitations • Generation of paraphrases from word alignments • Using paraphrases in evaluation • Correlation with human judgments • Paraphrase quality

T R1 R2 R3 BLEU Automatic evaluation of MT quality • Most popular metrics: BLEU and NIST

T R1 R2 R3 BLEU Automatic evaluation of MT quality • Most popular metrics: BLEU and NIST But we don’t have an answer But we have no answer to it However we cannot react However we have no reply to that

BLEU Automatic evaluation of MT quality • Most popular metrics: BLEU and NIST But we don’t have an answer But we have no answer to it However we cannot react 7-grams: 0/1 6-grams: 0/2 5-grams: 0/3 4-grams: 0/4 However we have no reply to that 3-grams: 1/5 2-grams: 3/6 1-grams: 6/7 = 0.0000 or smoothed 0.4392

Automatic evaluation of MT quality • Insensitive to admissible lexical differences answer ≠ reply • Insensitive to admissible syntactic differences yesterday it was raining ≠ it was raining yesterday we don’t have ≠ we have no

Automatic evaluation of MT quality • Attempts to come up with better metrics: - word order: Translation Error Rate (Snover et al. 2005) Maximum Matching String (Turian et al. 2003) - lexical and word-order issues: CDER (Leusch et al. 2006) METEOR (Banerjee and Lavie 2005) linear regression model (Russo-Lassner et al. 2005) • Need POS taggers, stemmers, thesauri, WordNet

…le… …agréable… …the… …nice… …ce jour… …agréable… …that day… …pleasant… …country… …pays… …agréable… …good… …agréable… …I am… …je suis… …nice… Word and phrase alignment • Statistical Machine Translation Source Lg Text Target Lg Text agréable {nice, pleasant, good} nous n’avons pas {we don’t have, we have no}

agréable 0.5 bon 0.25 bonne 0.25 nice Generating paraphrases • For each word/phrase ei find all words/phrases fi1, …, fin that ei aligns with, then for each fi find all words/phrases ek≠i1, …, ek≠in that fi aligns with (Bannard and Callison-Burch 2005) pleasant 0.75 agreeable 0.25 good 0.8 great 0.2 good 0.99 0.5 * 0.75 = 0.375 0.5 * 0.25 = 0.125 0.25 * 0.8 = 0.2 0.25 * 0.2 = 0.05 0.25 * 0.99 = 0.2475 nice = {good (0.4475), pleasant (0.375), agreeable (0.125), great (0.05)}

Src Trans Ref (fr) (en) (en) Paraphrases in automatic MT evaluation ea = {pea1, …., pean} … ez = {pezn , …., pezn} + =

w3 w7 w8 w6 w5 w4 w2 w9 w1 w0 Ref3 Ref Ref1 Ref4 Ref2 (en) (en) (en) (en) (en) … Paraphrases in automatic MT evaluation For each segment: ea = {pea1, …., pean} … ez = {pezn , …., pezn} + =

Experiment 1 • Test set: 2000 sentences, French-English Europarl • Two translations: • Pharaoh – phrase-based SMT • Logomedia – rule-based MT • Scored with BLEU and NIST • Original reference • Best-matching reference using paraphrases derived from the test set • Paraphrase lists generated using GIZA++ and refined word alignment strategy (Och and Ney, 2003; Koehn et al., 2003; Tiedemann, 2004) • Subset of 100 sentences from each translation scored by two human judges (accuracy, fluency)

Not:{country}, {sphere, domain, orbit, field, arena}, {region}, {expanse, surface area} Syntactic variation Examples of paraphrases area – field, this area, sector, aspect, this sector above all – specifically, especially agreement – accordance believe that – believe, think that, feel that, think extensive – widespread, broad, wide make progress on – can move forward risk management – management of risks

Examples of reference segments Example 1 Candidate translation: the question of climates with is a good example Original reference: the climate issue is a good example of this Best-match reference: the climate question is a good example of this Example 2 Candidate translation: thank you very much mr commissioner Original reference: thank you commissioner Best-match reference: thank you very much commissioner

Results Translation by Pharaoh on EP 2000 sent

Pearson’s correlation with human judgment Subset of 100 sentences from the translation by Pharaoh (EP 2000 sent)

Paraphrase quality • 700,000 sentence pairs, French-English Europarl • Paraphrase lists generated using GIZA++ and refined word alignment strategy (Och and Ney, 2003; Koehn et al., 2003; Tiedemann, 2004) • Quality of paraphrases evaluated with respect to syntactic and semantic accuracy Bannard and Callison-Burch, 2005

Results Syntactic accuracy Semantic accuracy

Filtering paraphrases • Some inaccuracy still useful be supported – support, supporting • Filters: - exclude closed class items: prepositions, personal pronouns, possessive pronouns, auxiliary verbs have and be Fr.à » Eng. to, in, at to ≠ in ≠ at - prevent paraphrases of the form: ei – (w) ei (w), where w  (prepositions, pronouns, auxiliary verbs, modal verbs, negation, conjunction) *aspect – aspect is *hours – hours for *available – not available - POS taggers, parsers

References • Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization: 65-73. • Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with Bilingual Parallel Corpora. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005): 597-604. • Philipp Koehn, Franz Och and Daniel Marcu. 2003. Statistical Phrase-Based Translation. Proceedings of the Human Language Technology Conference (HLT-NAACL 2003): 48-54. • Grazia Russo-Lassner, Jimmy Lin, and Philip Resnik. 2005. A Paraphrase-based Approach to Machine Translation Evaluation. Technical Report LAMP-TR-125/CS-TR-4754/UMIACS-TR-2005-57, University of Maryland, College Park, MD. • Mathew Snover, Bonnie Dorr, Richard Schwartz, John Makhoul, Linnea Micciula and Ralph Weischedel. 2005. A Study of Translation Error Rate with Targeted Human Annotation. Technical Report LAMP-TR-126, CS-TR-4755, UMIACS-TR-2005-58, University of Maryland, College Park. MD. • Jörg Tiedemann. 2004. Word to word alignment strategies. Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004): 212-218. • Gregor Leusch, Nicola Ueffing and Hermann Ney. 2006. CDER: Efficient MT Evaluation Using Block Movements. To appear in Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006). • Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Modes. Computational Linguistics, 29:19—51. • Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar Kumar, Libin Shen, David Smith, Katherine Eng, Viren Jain, Zhen Jin, and Dragomir Radev. 2003. Syntax for statistical machine translation. Technical report, Center for Language and Speech Processing, John Hopkins University, Baltimore, MD.

Contextual Bitext-derived Paraphrases in Automatic MT Evaluation