Learning Transfer Rules for Machine Translation with Limited Data

Learning Transfer Rules for Machine Translation with Limited Data Thesis Defense Katharina Probst Committee: Alon Lavie (Chair) Jaime Carbonell Lori Levin Bonnie Dorr, University of Maryland

Introduction (I) • Why has Machine Translation been applied only to few language pairs? • Bilingual corpora available only for few language pairs (English-French, Japanese-English, etc.) • Natural Language Processing tools available only for few language (English, German, Spanish, Japanese, etc.) • Scaling to other languages often difficult, time-consuming, and knowledge-intensive • What can we do to change this?

Introduction (II) • This thesis presents a framework for automatic inference of transfer rules • Transfer rules capture syntactic and morphological mappings between languages • Learned from small, word-aligned training corpus • Rules are learned for unbalanced language pairs, where more data and tools are available for one language(L1)than for the other (L2)

Setting the Stage Rule Learning Experimental Results Conclusions Training Data Example NP DET ADJ N PP the widespread interest PREP NP in DET N the election SL: the widespread interest in the election [the interest the widespread in the election] TL: h &niin h rxb b h bxirwt Alignment:((1,1),(1,3),(2,4),(3,2),(4,5),(5,6),(6,7)) Type: NP Parse: (<NP> (DET the-1) (ADJ widespread-2) (N interest-3) (<PP> (PREP in-4) (<NP> (DET the-5) (N election-6))))

Transfer Rule Formalism ;;L2: h &niin h rxb b h bxirwt ;;L1: the widespread interest in the election NP::NP [“h” N “h” Adj PP] -> [“the” Adj N PP] ((X1::Y1)(X2::Y3) (X3::Y1)(X4::Y2) (X5::Y4) ((Y3 num) = (X2 num)) ((X2 num) = sg) ((X2 gen) = m)) Training example Rule type Component sequences Component alignments Agreement constraints Value constraints

Research Goals (I) • Develop a framework for learning transfer rules from bilingual data • Training corpus: set of sentences/phrases in one language with translation into other language (= bilingual corpus), word-aligned • Rules include a) a context-free backbone and b) unification constraints • Improve of the grammaticality of MT output by automatically learned rules • Learned rules improve translation quality in run-time system

Research Goals (II) • Learn rules in the absence of a parser for one of the languages • Infer syntactic knowledge about minor language using a) projection from major language, b) analysis of word alignments, c) morphology information, and d) bilingual dictionary • Combine a set of different knowledge sources in a meaningful way • Resources (parser, morphology modules, dictionary, etc.) often disagree • Combine conflicting knowledge sources

Research Goals (III) • Address limited-data scenarios with `frugal‘ techniques • “Unbalanced” language pairs with little or no bilingual data • Training corpus is small (~120 sentences and phrases), but carefully designed • Pushing MT research in the direction of incorporating syntax into statistical-based systems • Infer highly involved linguistic information, incorporate with statistical decoderin hybrid system

Thesis Statement (I) • Given bilingual, word-aligned data, and given a parser for one of the languages in the translation pair, we can learn a set of syntactic transfer rules for MT. • The rules consist of a context-free backbone and unification constraints, learned in two separate stages. • The resulting rules form a syntactic translation grammar for the language pair and are used in a statistical transfer system to translate unseen examples.

Thesis Statement (II) • The translation quality of a run-time system that uses the learned rules is • superior to a system that does not use the learned rules • comparable to the performance using a small manual grammar written by an expert on Hebrew->English and Hindi->English translation tasks. • The thesis presents a new approach to learning transfer rules for Machine Translation in that the system learns syntactic models from text in a novel way and in a rich hypothesis space, aiming at emulating a human grammar writer.

Talk Overview • Setting the Stage: related work, system overview, training data • Rule Learning • Step 1: Seed Generation • Step 2: Compositionality • Step 3: Unification Constraints • Experimental Results • Conclusion

Setting the Stage Rule Learning Experimental Results Conclusions Related Work: MT overview Analyze meaning Semantics-based MT Depth of Analysis Analyze structure Syntax-based MT Analyze sequence Statistical MT, EBMT Source Language Target Language

Setting the Stage Rule Learning Experimental Results Conclusions Related Work (I) • Traditional transfer-based MT: analysis, transfer, generation(Hutchins and Somers 1992, Senellart et al. 2001) • Data-driven MT: • EBMT: store database of examples, possibly generalized(Sato and Nagao 1990, Brown 1997) • SMT: usually noisy channel model: translation model + target language model(Vogel et al. 2003, Och and Ney 2002, Brown 2004) • Hybrid (Knight et al. 1995, Habash and Dorr 2002)

Setting the Stage Rule Learning Experimental Results Conclusions Related Work (II) • Structure/syntax for MT • EBMT (Alshawi et al. 2000, Watanabe et al. 2002) • SMT (Yamada and Knight 2001, Wu 1997) • Other approaches (Habash and Dorr 2002, Menezes and Richardson 2001) • Learning from elicited data / small datasets (Nirenburg 1998, McShane et al 2003, Jones and Havrilla 1998)

Setting the Stage Rule Learning Experimental Results Conclusions Training Data Example NP DET ADJ N PP the widespread interest PREP NP in DET N the election SL: the widespread interest in the election [the interest the widespread in the election] TL: h &niin h rxb b h bxirwt Alignment:((1,1),(1,3),(2,4),(3,2),(4,5),(5,6),(6,7)) Type: NP Parse: (<NP> (DET the-1) (ADJ widespread-2) (N interest-3) (<PP> (PREP in-4) (<NP> (DET the-5) (N election-6))))

Setting the Stage Rule Learning Experimental Results Conclusions Transfer Rule Formalism ;;L2: h &niin h rxb b h bxirwt ;;[the interest the widespread in the election] ;;L1: the widespread interest in the election NP::NP [“h” N “h” Adj PP] -> [“the” Adj N PP] ((X1::Y1)(X2::Y3) (X3::Y1)(X4::Y2) (X5::Y4) ((Y3 num) = (X2 num)) ((X2 num) = sg) ((X2 gen) = m)) Training example Rule type Component sequences Component alignments Agreement constraints Value constraints

Setting the Stage Rule Learning Experimental Results Conclusions Training Data Collection • Elicitation Corpora • Generally designed to cover major linguistic phenomena • Bilingual user translates and word aligns • Structural Elicitation Corpus • Designed to cover a wide variety of structural phenomena (Probst and Lavie 2004) • 120 sentences and phrases • Targeting specific constituent types: AdvP, AdjP, NP, PP, SBAR, S with subtypes • Translated into Hebrew, Hindi

Setting the Stage Rule Learning Experimental Results Conclusions Resources • L1 parses: Either from statistical parser (Charniak 1999), or use data from Penn Treebank • L1 morphology: Can be obtained or created (I created one for English) • L1 language model: Trained on a large amount of monolingual data • L2 morphology: If available, use morphology module. If not, use automated techniques, such as (Goldsmith 2001) or (Probst 2003). • Bilingual lexicon: gives word-level correspondences, created from training data or previously existing

Setting the Stage Rule Learning Experimental Results Conclusions Development and Testing Environment • Syntactic transfer engine: takes rules and lexicon and produces all possible partial translations • Statistical decoder: uses word-to-word probabilities and TL language model to extract best combination of partial translations (Vogel et al. 2003)

Setting the Stage Rule Learning Experimental Results Conclusions System Overview Bilingual training data Rule Learner Training time L1 parses & morphology Run time Learned Rules L2 morphology Transfer Engine Bilingual Lexicon L1 Language Model Lattice Statistical Decoder Final Translation L2 test data

Setting the Stage Rule Learning Experimental Results Conclusions Overview of Learning Phases • Seed Generation: create initial guesses at rules based on specific training examples • Compositionality: add context-free structure to rules, rules can combine • Constraint learning: learn appropriate unification constraints

Setting the Stage Rule Learning Experimental Results Conclusions Seed Generation • “Training example in rule format” • Produce rules that closely reflect training examples • But: generalize to POS level when words are 1-1 aligned • Rules are fully functional, but little generalization • Seed rules are intended as input for later two learning phases

Setting the Stage Rule Learning Experimental Results Conclusions Seed Generation – Sample Learned rule ;;L2: TKNIT H @IPWL H HTNDBWTIT ;;[ plan the care the voluntary] ;;L1: THE VOLUNTARY CARE PLAN ;;C-Structure:(<NP> (DET the-1) (<ADJP> (ADJ voluntary-2)) (N care-3)(N plan-4)) NP::NP [N "H" N "H" ADJ] -> ["THE" ADJ N N] ( (X1::Y4) (X3::Y3) (X5::Y2) )

Setting the Stage Rule Learning Experimental Results Conclusions Seed Generation Algorithm • For a given training example, produce a seed rule • For all 1-1 aligned words, enter the POS tag (e.g. “N”) into component sequences • Get POS tags from morphology module and parse • Hypothesis: on unseen data, any words of this POS can fill this slot • For all not 1-1 aligned words, put actual words in component sequences • L2 and L1 type are parse’s root label • Derive alignments from training example

Setting the Stage Rule Learning Experimental Results Conclusions Compositionality • Generalize seed rules to reflect structure • Infer a partial constituent grammar for L2 • Rules map mixture of • Lexical items (LIT) • Parts of speech (PT) • Constituents (NT) • Analyze L1 parse to find generalizations • Produced rules are context-free

Setting the Stage Rule Learning Experimental Results Conclusions Compositionality - Example ;;L2: $ BTWK H M&@PH HIH $M ;;[ that inside the envelope was name] ;;L1: THAT INSIDE THE ENVELOPE WAS A NAME ;;C-Structure:(<SBAR> (SUBORD that-1) (<SINV> (<PP> (PREP inside-2) (<NP> (DET the-3)(N envelope-4))) (<VP> (V was-5)) (<NP> (DET a-6)(N name-7)))) SBAR::SBAR [SUBORD PP V NP] -> [SUBORD PP V NP] ( (X1::Y1) (X2::Y2) (X3::Y3) (X4::Y4) )

Setting the Stage Rule Learning Experimental Results Conclusions Basic Compositionality Algorithm • Traverse parse tree in order to partition sentence • For each sub-tree, if there is previously learned rule that can account for the subtree and its translation, introduce compositional element • Compositional element: subtree’s root label for both L1 and L2 • Adjust alignments • Note: preference for maximum generalization, because tree traversed from top

Setting the Stage Rule Learning Experimental Results Conclusions Maximum Compositionality • Assume that lower-level rules exist Assumption is correct if training data is completely compositional • Introduce compositional elements for direct children of parse root node • Results in higher level of compositionality, thus higher generalization power • Can overgeneralize, but because of strong decoder generally preferable

Setting the Stage Rule Learning Experimental Results Conclusions Other Advanced Compositionality Techniques • Techniques that allow you to generalize to POS not 1-1 aligned words • Techniques that enhance the dictionary based on training data • Techniques that deal with noun compounds • Rule filters to ensure that no learned rules violate axioms

Setting the Stage Rule Learning Experimental Results Conclusions Constraint Learning • Annotate context-free compositional rules with unification constraints a) limit applicability of rules to certain contexts (thereby limiting parsing ambiguity) b) ensure the passing of a feature value from source to target language (thereby limiting transfer ambiguity) c) disallow certain target language outputs (thereby limiting generation ambiguity) • Value constraints and agreement constraints are learned separately

Setting the Stage Rule Learning Experimental Results Conclusions Constraint Learning - Overview • Introduce basic constraints: use morphology module(s) and parses to introduce constraints for words in training example • Create agreement constraints (where appropriate) by merging basic constraints • Retain appropriate value constraints: help in constricting a rule to some contexts or restricting output

Setting the Stage Rule Learning Experimental Results Conclusions Constraint Learning – Agreement Constraints (I) • For example: In an NP, do the adjective and the noun agree in number? • in Hebrew the good boys: • Correct: H ILDIM @WBIM the.det.def boy.pl.m good.pl.m “the good boys” • Incorrect: H ILDIM @WB the.det.def boy.pl.m good.sg.m “the good boys”

Setting the Stage Rule Learning Experimental Results Conclusions Constraint Learning – Agreement Constraints (II) • E.g. number in a determiner and the corresponding noun • Use a likelihood ratio test to determine what value constraints can be merged into agreement constraints • The log-likelihood ratio is defined by proposing distributions that could have given rise to the data: • Null Hypothesis: The values are independently distributed. • Alternative Hypothesis: The values are not independently distributed. • For sparse data, use heuristic test: if more evidence for than against agreement constraint

Setting the Stage Rule Learning Experimental Results Conclusions Constraint Learning – Agreement Constraints (III) • Collect all instances in the training data where an adjective and a noun mark for number • Count how often the feature value is the same, how often different • Feature values are distributed by • Twomultinomial distributions (if they’re independent, e.g. Null hypothesis) • Onemultinomial distribution (if they should agree, e.g. Alternate hypothesis) • Compute log-likelihood under each scenario and perform LL ratio or heuristic test • Generalize to cross-lingual case

Setting the Stage Rule Learning Experimental Results Conclusions Constraint Learning – Value Constraints ;;L2: ild @wb ;;[ boy good] ;;L1: a good boy NP::NP [N ADJ] -> [``A'' ADJ N] (... ((X1 NUM) = SG) ((X2 NUM) = SG) ...) ;;L2: ildim t@wbim ;;[ boys good] ;;L1: good boys NP::NP [N ADJ] -> [ADJ N] (... ((X1 NUM) = PL) ((X2 NUM) = PL) ...) Retain value constraints to distinguish

Setting the Stage Rule Learning Experimental Results Conclusions Constraint Learning – Value Constraints • Retain those value constraints that determine the structure of the L2 translation • If you have two rules with • different L2 component sequences • same L1 component sequence • they differ in only a value constraint • Retain the value constraint to distinguish

Setting the Stage Rule Learning Experimental Results Conclusions Constraint Learning – Sample Learned Rule ;;L2: ANI AIN@LIGN@I ;;[ I intelligent] ;;L1: I AM INTELLIGENT S::S [NP ADJP] -> [NP “AM” ADJP] ( (X1::Y1) (X2::Y3) ((X1 NUM) = (X2 NUM)) ((Y1 NUM) = (X1 NUM)) ((Y1 PER) = (X1 PER)) (Y0 = Y2) )

Setting the Stage Rule Learning Experimental Results Conclusions Dimensions of Evaluation • Learning Phases / Settings: default, Seed Generation only, Compositionality, Constraint Learning • Evaluation: rule-based evaluation + pruning • Test Corpora: TestSet, TestSuite • Run-time Settings: Lengthlimit • Portability: Hindi→English translation

Setting the Stage Rule Learning Experimental Results Conclusions Test Corpora • Test corpora: • Test Corpus: Newspaper text (Haaretz): 65 sentences, 1 reference translation • Test Suite: specific phenomena: 138 sentences, 1 reference translation • Hindi: 245 sentences, 4 reference translations • Compare: statistical system only, system with manually written grammar, system with learned grammar • Manually written grammar: written by expert within about a month (both Hebrew and Hindi)

Setting the Stage Rule Learning Experimental Results Conclusions Test Corpus Evaluation, Default Settings (I)

Setting the Stage Rule Learning Experimental Results Conclusions Test Corpus Evaluation, Default Settings (II) Learned grammar performs statistically significantly better than baseline • Performed one-tailed paired t-test • BLEU with resampling: t-value: 81.98, p-value:0 (df=999) → Significant at 100% confidence level Median of differences: -0.0217 with 95% confidence interval [-0.0383,-0.0056] • METEOR: t-value: 1.73, p-value: 0.044 (df=61) → Significant at higher than 95% confidence level

Setting the Stage Rule Learning Experimental Results Conclusions Test Corpus Evaluation, Default Settings (III)

Setting the Stage Rule Learning Experimental Results Conclusions Test Corpus Evaluation, Different Settings (I)

Setting the Stage Rule Learning Experimental Results Conclusions Test Corpus Evaluation, Different Settings (II) System times in seconds, lattice sizes: → ~ 20% reduction in lattice size!

Setting the Stage Rule Learning Experimental Results Conclusions Evaluation withRule Scoring (I) • Estimate translation power of the rules • Use training data: most training examples are actually unseen data for a given rule • Match arc against the reference translation • A rule’s score is the average of all its arcs’ scores • Order the rules by precision score, prune • Goal of rule scoring: limit run-time • Note trade-off with decoder power

Setting the Stage Rule Learning Experimental Results Conclusions Evaluation with Rule Scoring (II)

Setting the Stage Rule Learning Experimental Results Conclusions Evaluation with Rule Scoring (III)

Setting the Stage Rule Learning Experimental Results Conclusions Test Suite Evaluation (I) • Test suite designed to target specific constructions • Conjunctions of PPs • Adverb phrases • Reordering of adjectives and nouns • AdjP embedded in NP • Possessives • … • Designed in English, translated into Hebrew • 138 sentences, one reference translation

Setting the Stage Rule Learning Experimental Results Conclusions Test Suite Evaluation (II)

Setting the Stage Rule Learning Experimental Results Conclusions Test Suite Evaluation (III) Learned grammar performs statistically significantly better than baseline • Performed one-tailed paired t-test • BLEU with resampling: t-value: 122.53, p-value:0 (df=999) → Statistically significantly better at 100% confidence level Median of differences: -0.0462 with 95% confidence interval [-0.0245,-0.0721] • METEOR: t-value: 47.20, p-value: 0.0 (df=137) → Statistically significantly better at 100% confidence level

Learning Transfer Rules for Machine Translation with Limited Data

Learning Transfer Rules for Machine Translation with Limited Data

Presentation Transcript

Machine Translation

Machine Translation

Discriminative Learning of Extraction Sets for Machine Translation

Machine Translation

Machine Translation

Machine Translation

Statistical Machine Translation with Moses

Rules Based Machine Translation

A Path-based Transfer Model for Machine Translation

Improving Statistical Machine Translation by Means of Transfer Rules

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages

Machine Translation

Machine Translation with Scarce Resources

Machine Translation via Dependency Transfer

Learning to Generate Complex Morphology for Machine Translation

Machine Translation

Data-Driven Machine Translation for Sign Languages

Machine Learning With R | Machine Learning Algorithms | Data Science Training | Edureka

Improving Statistical Machine Translation by Means of Transfer Rules

Machine Translation, Free Machine Translation