(Statistical) Approaches to Word Alignment

(Statistical) Approaches to Word Alignment 11-734 Advanced Machine Translation Seminar Sanjika Hewavitharana Language Technologies Institute Carnegie Mellon University 02/02/2006

Word Alignment Models • We want to learn how to translate words and phrases • Can learn it from parallel corpora • Typically work with sentence aligned corpora • Available from LDC, etc • For specific applications new data collection required • Model the associations between the different languages • Word to word mapping -> lexicon • Differences in word order -> distortion model • ‘Wordiness’, i.e. how many words to express a concept -> fertility • Statistical translation is based on word alignment models

Alignment Example Observations: • Often 1-1 • Often monotone • Some 1-to-many • Some 1-to-nothing

Word Alignment Models • IBM1 – lexical probabilities only • IBM2 – lexicon plus absolut position • IBM3 – plus fertilities • IBM4 – inverted relative position alignment • IBM5 – non-deficient version of model 4 • HMM – lexicon plus relative position • BiBr – Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation • Syntactical alignment models [Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999, Wu 1997, Yamada et al. 2003]

Notation • Source language • f : source (French) word • J : length of source sentence • j : position in source sentence; j = 1,2,...,J • : source sentence • Target language • e : target (English) word • I : length of target sentence • i : position in target sentence; i = 1,2,...,I • : target sentence

SMT - Principle • Translate a ‘French’ stringinto an ‘English’ string • Bayes’ decision rule for translation: • Based on Noisy channel model • We will call f source and e target

Alignment as Hidden Variable • ‘Hidden alignments’ to capture word-to-word correspondences • Number of connections: J * I (each source word with each target word) • Number of alignments: 2JI • Restricted alignment • Each source word has one connection – a function • i = aj: position i of ei which is connected to j • Number of alignments is now: IJ • : whole alignment • Relationship between Translation Model and Alignment Model

Empty Position (Null Word) • Sometimes a word has no correspondence • Alignment function aligns each source word to one target word, i.e. cannot skip source word • Solution: • Introduce empty position 0 with null word e0 • ‘Skip’ source word fj by aligning it to e0 • Target sentence is extended to: • Alignment is extended to:

Translation Model • Sum over all possible alignments • 3 probability distributions: • Length: • Alignment: • Lexicon:

Model Assumptions Decompose interaction into pairwise dependencies • Length: Source length only dependent on target length (very weak) • Alignment: • Zero order model: target position only dependent on source position • First order model: target position only dependent on previous target position • Lexicon: source word only dependent on aligned word

IBM Model 1 • Length: Source length only dependent on target length • Alignment: Assume uniform probability for position alignment • Lexicon: source word only dependent on aligned word • Alignment probability

IBM Model 1 – Generative Process To generate a French string from an English string : • Step 1: Pick the length of • All lengths are equally probable; is a constant • Step 2: Pick an alignment with probability • Step 3: Pick the French words with probability • Final Result:

IBM Model 1 – Training • Parameters of the model: • Training data: parallel sentence pairs • We adjust parameters s.t. it maximize • Normalized for each : • EM Algorithm used for the estimation • Initialize the parameters uniformly • Collect counts for each pair in the corpus • Re-estimate parameters using counts • Repeated for several iterations • Model simple enough to compute over all alignments • Parameters does not depend on initial values

IBM Model 1 Training– Pseudo Code # Accumulation (over corpus) For each sentence pair For each source position j Sum = 0.0 For each target position i Sum += p(fj|ei) For each target position i Count(fj,ei) += p(fj|ei)/Sum # Re-estimate probabilities (over count table) For each target word e Sum = 0.0 For each source word f Sum += Count(f,e) For each source word f p(f|e) = Count(f,e)/Sum # Repeat for several iterations

IBM Model 2 Only Difference from Model 1 is in Alignment Probability • Length: Source length only dependent on target length • Alignment: Target position depends on the source position (in addition to the source length and target length) • Model 1 is a special case of Model 2, where • Lexicon: source word only dependent on aligned word

IBM Model 2 – Generative Process To generate a French string from an English string : • Step 1: Pick the length of • All lengths are equally probable; is a constant • Step 2: Pick an alignment with probability • Step 3: Pick the French words with probability • Final Result:

IBM Model 2 – Training • Parameters of the model: • Training data: parallel sentence pairs • We maximize w.r.t translation and alignment params. • EM Algorithm used for the estimation • Initialize alignment parameters uniformly, and translation probabilities from Model 1 • Accumulate counts, re-estimate parameters • Model simple enough to compute over all alignments

Fertility-based Alignment Models • Models 3-5 are based on Fertility • Fertility: Number of source words connected with a target word : fertility values of = probability that is connected with source words • Alignment: Defined in the reverse-direction (target to source) = probability of French position j given English position is i

IBM Model 3 – Generative Process To generate a French string from an English string : • Step 1: Choose (I+1) fertilities with probability

IBM Model 3 – Generative Process • Step 2: For each , for k =1… , choose a position 1…J and a French word with probability For a given alignment, there are orderings

IBM Model 3 – Example [Knight 99] e0 Mary did not slap the green witch 1 0 1 3 1 1 1 Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde 1 2 3 4 5 6 7 8 9 1 3 4 4 4 0 5 7 6 [e] 1 [choose fertility] [fertility for e0] [choose translation] [choose target positions j ] [aj ]

IBM Model 3 – Training • Parameters of the model: • EM Algorithm used for the estimation • Not possible to compute exact EM updates • Initialize n,d,p uniformly, and translation probabilities from Model 2 • Accumulate counts, re-estimate parameters • Cannot efficiently compute over all alignments • Only Viterbi alignment is used • Model 3 is deficient • Probability mass is wasted on impossible translations

IBM Model 4 • Try to model re-ordering of phrases • is replaced with two sets of parameters: • One for placing the first word (head) of a group of words • One for placing rest of the words relative to the head • Deficient • Alignment can generate source positions outside of sentence length J • Model 5 removes this deficiency

HMM Alignment Model • Idea: relative position model Target Source [Vogel 96]

HMM Alignment • First order model: target position dependent on previous target position(captures movement of entire phrases) • Alignment probability: • Alignment depends on relative position • Maximum approximation:

IBM2 vs HMM [Vogel 96]

Enhancements to HMM & IBM Models • HMM model with empty word • Adding I empty words to the target side • Model 6 • IBM 4: predicts distance between subsequent target positions • HMM: predicts distance between subsequent source positions • Model 6: A log-linear combination of IBM 4 and HMM Models • Smoothing • Alignment prob. – Interpolate with uniform dist. • Fertility prob. – Depends of number of letters in a word • Symmetrization • Heuristic postprocessing to combine alignments in both directions

Experimental Results [Franz 03] • Refined models perform better • Models 4,5,6 better than Model 1 or Dice coefficient model • HMM better than IBM 2 • Alignment quality based on the training method and bootstrap scheme used • IBM 1->HMM->IBM 3 better than IBM 1->IBM 2->IBM 3 • Smoothing and Symmetrization have a significant effect on alignment quality • More alignments in training yields better results • Using word classes • Improvement for large corpora but not for small corpora

References: • Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer (1993). The Mathematics of Statistical Machine Translation , Computational Linguistics, vol. 19, no. 2. • Stephan Vogel, Hermann Ney, Christoph Tillmann (1996). HMM-based Word Alignment in Statistical Translation , COLING, The 16th Int. Conf. on Computational Linguistics, Copenhagen, Denmark, August, pp. 836-841. • Franz Josef Och, Hermann Ney (2003), A Systematic Comparison of Various Statistical Alignment Models , Computational Linguistics, vol. 29, no.1, pp. 19-51. • Knight, Kevin, (1999), A Statistical MT Tutorial Workbook, Available at http://www.isi.edu/natural-language/mt/wkbk.rtf.

(Statistical) Approaches to Word Alignment