Confidence Estimation for Machine Translation
Confidence Estimation for Machine Translation. J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki. Abstract. Detailed study of CE for machine translation Various machine learning methods CE for sentences and for words Different definitions of correctness Experiments
Confidence Estimation for Machine Translation
E N D
Presentation Transcript
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki
Abstract • Detailed study of CE for machine translation • Various machine learning methods • CE for sentences and for words • Different definitions of correctness • Experiments • NIST 2003 Chinese-to-English MT evaluation
1 Introduction • CE can improve usability of NLP based systems • CE techniques is not well studied in Machine translation • Investigate sentence and word level CE
CE Score CE Score Threshold Threshold Binary output Binary output 2 Background Strong vs. weak CE Strong CE: require probability Correctness probabilities Weak CE: require only binary classification NOT necessary probability
2 Background Has CE layer or not No distinct CE layer NLP system Has distinct CE Layer NLP system • Require a training corpus • Powerful and modular CE module Naïve Bayes, NN, SVM etc…
3 Experimental Setting Correct or Not Input sentences N-best C Hyp Src Train Translation system ISI Alignment Template MT system Validation Test Reference sentences
3.1 Corpora • Chinese-to-English • Evaluation sets from NIST MT competitions • Multi reference corpus from LDC
3.2 CE Techniques • Data : A collection of pairs (x,c) • X: feature vector, c: correctness • Weak CE • X score • X MLP score (Regressing MT evaluation score) • Strong CE • X naïve Bayes P(c=1|x) • X MLP P(c=1|x)
C x1 x2 xD 3.2 Naïve Bayes (NB) • Assume features are statistically independent • Apply absolute discounting
3.2 Multi Layer Perceptron • Non-linear mapping of input features • Linear transformation layers • Non-linear transfer functions • Parameter estimation • Weak CE (Regression) • Target: MT evaluation score • Minimizing a squared error loss • Strong CE (Classification) • Target: Binary correct/incorrect class • Minimizing negative log likelihood
3.3 Metrics for Evaluation • Strong CE metric:Evaluates probability distribution • Normalized cross entropy (NCE) • Weak CE metrics:Evaluates discriminability • Classification error rate (CER) • Receiver operating characteristic (ROC)
3.3 Normalized Cross Entropy • Cross Entropy (negative log-likelihood) Estimated probability from CE module • Normalized Cross Entropy (NCE) Empirical probability obtained from test set
3.3 Classification Error Rate • CER: Ratio of samples with wrong binary (Correct/Incorrect) prediction • Threshold optimization • Sentence-level experiments: test set • Word-level experiments: validation set • Baseline
3.3 Receiver operating characteristic Prediction 1 ROC curve Better Correct-accept-ratio Fact IROC random 0,0 1 Correct-reject-ratio Cf.
4 Sentence Level Experiments • MT evaluation measures • WERg: normalized word error rate • NIST: sentence-level NIST score • “Correctness” definition • Thresholding WERg • Thresholding NIST • Threshold value • 5% “correct” examples • 30% “correct” examples
4.1 Features • Total of 91 sentence level features • Base-Model-Intrinsic • Output from 12 functions for Maximum entropy based base-system • Pruning statistics • N-best List • Rank, score ratio to the best, etc… • Source Sentence • Length, ngram frequency statistics, etc… • Target Sentence • LM scores, parenthesis matching, etc… • Source/Target Correspondence • IBM model1 probabilities, semantic similarity, etc…
4.2 MLP Experiments • MLPs are trained on all features for the four problem settings • Classification models are better than regression model • Performance is better than baseline N:NIST BASE CER W:WERg 3.21 32.5 5.65 32.5 Strong CE (Classification) Weak CE (Regression) N/A Table 2
4.3 Feature Comparison • Compare contributions of features • Individual feature • Group of features • All: All features • Base: base model scores • BD: base-model dependent • BI: base model independent • S: apply to source sentence • T: apply to target sentence • ST: apply to source and target sentence
ALL Base BD BI S T ST 4.3 Feature Comparison (results) • Base All • BD > BI • T>ST>S • CE Layer > No CE Layer Exp. Condition: NIST 30% Table 3 Figure 1
5 Word Level Experiments • Definition of word correctnessA word is correct if: • Pos: occurs exactly at the same position as reference • WER: aligned to reference • PER: occurs in the reference • Select a “best” transcript from multiple references • Ratio of “correct” words • Pos(15%) < WER(43%) < PER(64%)
5.1 Features • Total of 17 features • SMT model based features (2) • Identity of alignment template, whether or not translated by a rule • IBM model 1 (1) • Averaged word translation probability • Word posterior and Related measures (3x3) • Target language based features (3+2) • Semantic features by WordNet • Syntax check, number of occurrences in the sentence WPP-any WPP-source WPP-target
5.2 Performance of Single Features • Experimental setting • Naïve Bayes classifier • PER based correctness • WPP-any give the best results • WPP-any>model1>WPP-source • Top3>any of the single features • No gain for ALL Table 4
5.3 Comparison of Different models • Naïve Bayes, MLPs with different number of hidden units • All features, PER based correctness • Naïve Bayes MLP0 • Naïve Bayes < MLP5 • MLP5 NLP10 NLP20 Figure 2
5.4 Comparison of Word Error Measures • Experimental settings • MLP20 • All features Table 5 • PER is the easiest to lean
6 Conclusion • Separate CE layer is useful • Features derived from base model are better than external ones • N-best based features are valuable • Target based features are more valuable than those not • MLPs with hidden units are better than naïve Bayes