Presented by SUN Jun

MEANT: semi-automatic metric for evaluatingfor MTevaluation via semantic framesan asembling of ACL11,IJCAI11,SSST11Chi-kiu Lo & Dekai Wu Presented by SUN Jun

MT’s often bad • MT3: So far , the sale in the mainland of China for nearly two months of SK – II line of products • MT1: So far , nearly two months sk –ii the sale of products in the mainland of China to resume sales. • MT2: So far, in the mainland of China to stop selling nearly two months of SK – 2 products sales resumed. • Ref: Until after their sales had ceased in mainland China for almost tow months, sales of the complete range of SK – II products have now been resumed. BLEU: 0.124 BLEU: 0.012 BLEU: 0.013

Metrics besides BLEU have problems • Lexical similarity based metrics (eg. NIST, METEOR) • Good at capturing fluency • Correlate poorly with human judgment on adequacy • Syntax based (eg. STM, Liu and Gildea, 2005) • Much better at capturing grammaticality • Still more fluency oriented than adequacy-oriented • Non-automatic metrics (eg. HTER) • Use human annotators to solve non-trivial problem of finding min edit distance to evaluate adequacy • Human-training & Labor intensive

ＭＥＡＮＴ：SRL for MT evaluation • Intuition behind the idea: • Useful translation help users accurately understand the basic event structure of source utterances—“ who did what to whom, when, where and why” . • Hypothesis of the work: • MT utility can best be evaluated via SRL • Better than: • N-gram based metrics like BLEU (adequacy) • Human training intensive metrics like HTER (time cost) • Complex aggregate metrics like ULC (representation transparency)

Q. Do PRED & ARGj correlate to human adequacy judgments?

Experimental settings • Exp settings 1 -- Corpus • ACL11: draw 40 sentences from Newswire datasets in GALE P2.5 (with SRL in ref/src, 3-output) • IJCAI11: draw 40, draw 35 from previous data set and draw 39 from broadcast news WMT2010-MetricsMaTr

Experimental settings • Exp settings 2 – Annotation of SRL on MT reference and output • SRL: Propbank style

Experimental settings • Exp settings 3 –SRL evaluation as MT evaluation • Correct, incorrect, partial (predicate & argument) • Partial: part of the meaning is correctly translated • Extra meaning in a role filler is not penalized unless it belongs in another role • Incorrectly translated predicate means the entire frame is wrong (no count of arguments)

Experimental settings • Exp settings 3 –SRL evaluation as MT evaluation • F-measure Based scores • weights tuned by confusion Matrix on dev

Experimental settings • Exp settings 4 – Evaluation of evaluation • WMT and NIST MetricsMaTr (2010) • Kendall’s τ rank correlation coefficient • evaluate the correlation of the proposed metric with human judgments on translation adequacy ranking. • A higher value for τ indicates more similarity to the ranking by the evaluation metric to the human judgment. • The range of possible values of correlation coefficient is [-1,1], where 1 means the systems are ranked

Observations • HMEANT vs other metric

Observations • HMEANT on CV data

Observations • HMEANT annotated via Mono vs Bi-lingual Error analysis: annotators drop parts of the meaning in the translation when trying to align them to the source input

Observations • HMEANT vs MEANT (automatic SRL) • SRL tool: ASSERT, 87% (Pradhan et al. 2004) 80 %

Q2: Impact of each individual semantic role to the metric’s correlation • A preliminary exp • For each ARGj , PRED, we manually compared each English MT output against its reference translation. Using the counts thus obtained, we computed the precision, recall, and f-score for PRED and each ARGj type.

IJCAI 11: evaluation the individual impact • The preliminary exp suggest effectiveness • Propose metrics for evaluating individual impact

IJCAI 11: evaluation the individual impact • The preliminary exp suggest effectiveness

IJCAI 11: evaluation the individual impact • Results

IJCAI 11: evaluation the individual impact • Results2 Automatic SRL tool 76-93%

Q: Can be Even more Accurate? • SSST11...

Conclusion • ACL11 • Bring MEANT, HMEANT • HMEANT correlates well to human judges, as well as more expensive HTER • Automatic SRL yields 80% correlations • IJCAI11 • Study impact of each individual semantic roles • SSST11 • Propose Length based weighting scheme to evaluate contribution of each semantic frame

END

Presented by SUN Jun

Presented by SUN Jun

Presentation Transcript

PRESENTED BY

Presented By

Dodo By young jun eo

by Jun

Presented by Hyoung Jun

Presented by

PRESENTED BY :

Sun Rich Overview Presented by Marvin Edris September 2011

Presented By:

Presented by

Presented by

presented by

Presented By:

Presented by:

Presented by

By Katrina Diamond Jun 2013.

Presented by Jiwen Sun, Lihui Zhao 24/3/2004

Presented by

Presented By:

Presented By

Presented by Hitoshi TANAKA , Contributors: Jun SCHIMIZU, Kouichi SOUTOME, Masaru TAKAO

Written by Lall(1996) Presented by Jun, Yoo-Duk