A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang

A Hierarchical Phrase-Based Model for Statistical Machine TranslationAuthor: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted from referenced papers

Outline • Phrase Order in Phrase-based Statistical MT • Using synchronous CFGs to solve the issue • Integrating the idea into an SMT system • Results • Conclusions • Future work • My Thoughts/Questions

Phrase Order in Phrase-based Statistical MT • Example from [Chiang2005]:

Phrase Order in Phrase-based Statistical MT • Translation of the example with a phrase-based SMT system (Pharao, [Koehn2004]) [Aozhou] [shi] [yu] [Bei Han] [you] [bangjiao]1[de shaoshu guojia zhiyi] [Australia] [is] [dipl. rels.]1 [with] [North Korea] [is] [one of the few countries] • Uses learned phrase translations • Accomplishes local phrase-reordering • Fails on overall reordering of phrases • Not only applicable to Chinese, but also Japanese (SOV order), German (scrambling)

Idea: Rules for Subphrases • Motivation “phrases are good for learning reorderings of words, we can use them to learn reorderings of phrases as well” • Rules with “placeholders” for subphrases • <yu [1] you [2], have [2] with [1]> • Learned automatically from bitext without syntactical annotation • Formally syntax-based but not linguistically syntax-based • “the result sometimes resembles a syntactician’s grammar but often does not”

Synchronous CFGs • Developed in the 60’s for programming-language compilation [Aho1969] • Separate tutorial by Chiang describing them [Chiang2005b] • In NLP synchronous CFGs have been used for • Machine translation • Semantic interpretation

Synchronous CFGs • Like CFGs, but production have two right hand sides • Source side • Target side • Related through linked non-terminal symbols • E.g. VP → <V[1] NP[2],NP[2] V[1]> • One-to-one correspondence • Non-terminal of type X is always linked to same type • Productions applied in parallel to both sides to linked non-terminals

Synchronous CFGs

Synchronous CFGs • Limitations • No Chomsky normal form • Has implications for complexity of decoder • Only limited closure under composition • Sister-reordering only

Model • Using the log-linear model [Och2002] • Presented by Bill last week

Model – Rule Features • P(γ|α) and P(α|γ) • Lexical weights Pw(γ|α) and Pw(α|γ) • Estimation how we words in α translate to words in γ • Phrase penalty exp(1) • Allows model to learn longer/shorter derivations • Exception: glue rule weights • w(S → <X[1],X[1] >) =1 • w(S → <S[1]X[2],S[1]X[2]>) = exp(-λg) • Λg controls model’s preference for hierarchical phrases over serial phrase combination

Model – Additional Features • Separated out from rule weights • Notational convenience • Conceptually cleaner (necessary for polynominal-time decoding) • Derivation D • Set of triples <r,i,j>: apply grammar rule r for rewriting a non-terminal in span f(D) from i to j • Ambiguous

Training • Training is starting from a symetrical, word-aligned corpus • Adopted from [Och2004] and [Koehn2003] • How to get from a one-directional alignment to a symetrical alignment • How to find initial phrase pairs • alternative would be Marcu & Wong 2002 that Ping presented [Marcu2002]

Training

Training • Scheme leads unfortunately • To a large number of rules • With false ambiguity • Grammar is filtered to • Balance grammar size and performance • Five filter criteria e.g. • produce only two non-terminals • Initial phrase length limited to 10

Decoding • Our good old friend - the CKY parser • Enhanced with • Beam search • Postprocessor to map French derivations to English derivations

Results • Baseline • Pharao [Koehn2003], [Koehn2004] • Minimum error rate training on BLEU measure • Hierarchical model • 2.2 Million rules after filtering down from 24 Million • 7.5% relative improvement • Additional constituent feature • Additional feature favoring syntactic parses • Trained on 250k sentences Penn Chinese Treebank • Improved accuracy only in development set

Learned Feature Weights • Word = word penalty • Phr = phrase penalty (pp) • λg penalizes glue rules much less than λpp does regular rules • i.e. “This suggests that the model will prefer serial combination of phrases, unless some other factor supports the use of hierarchical phrases ”

Conclusions • Hierarchical phrase pairs that can be learned data without syntactically annotation • Hierarchical phrase pairs improve translation accuracy significantly • Added syntactic information (constituent feature did not provide statistically significant gain

Future Work • Move to more syntactically motivated grammar • Reducing grammar size to allow more aggressive training settings

My Thoughts/Questions • Really interesting approach to bring “syntactic” information into SMT • Example sentence was not translated correctly • Missing words are problematic • Can phrase reordering be also learned by lexicalized phrase reordering models [Och2004]? • Why did constituent feature only improve accuracy in development set, but not in test set? • Does data sparseness influence the learned feature weights? • What syntactical features are already built into Pharao?

References • [Aho1969] Aho, A. V. and J. D. Ullman. 1969. Syntax directed translations and the pushdown assembler. Journal of Computer and System Sciences, 3:37–56. • [Chiang2005]: Chiang, David. 2005. A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proceedings of ACL 2005, pages 263–270. • [Chiang2005b]: http://www.umiacs.umd.edu/~resnik/ling645_fa2005/notes/synchcfg.pdf • [Koehn2003]: Koehn, Philipp. 2003. Noun Phrase Translation. Ph.D. • thesis, University of Southern California. • [Koehn2004]: Koehn, Phillip. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proceedings of the Sixth Conference of the Association for Machine Translation in the Americas, pages 115–124. • [Marcu2002]: Marcu, Daniel and William Wong. 2002. A phrasebased, joint probability model for statistical machine translation. In Proceedings of the 2002 Conference on • Empirical Methods in Natural Language Processing (EMNLP), pages 133–139. • [Och 2002]: Och, Franz Josef and Hermann Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the ACL, pages 295–302. • [Och2004]: Och, Franz Josef, Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics, 30:417–449.

A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang

A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang

Presentation Transcript

Statistical Machine Translation

Statistical Machine Translation

An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation Part III – Phrase- based SMT / Decoding

Statistical Machine Translation Part V – Phrase-based SMT

Morphological Analysis for Phrase-Based Statistical Machine Translation

Machine Translation Phrase Alignment

Soft Syntactic Constraints for Hierarchical Phrase-Based Translation

Statistical Machine Translation

A Path-based Transfer Model for Machine Translation

Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem

Machine Translation Decoder for Phrase-Based SMT

Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure

A Syntax-Driven Bracketing Model for Phrase-Based Translation

Morphological Analysis for Phrase-Based Statistical Machine Translation

Machine Translation Decoder for Phrase-Based SMT

A Tree-to-Tree Alignment-based Model for Statistical Machine Translation

Statistical Machine Translation

Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation