Morphology & Machine Translation

Morphology & Machine Translation Eric Davis MT Seminar 02/06/08 Professor Alon Lavie Professor Stephan Vogel

Outline Intro The Issue at Hand Supervised MA Unsupervised MA Integration of Morphology into MT Papers Morfessor Bridging Inflectional Morphological Gap --> Arabic SMT Unsupervised MA w/ Finnish, Swedish, & Danish Turkish SMT Discussion The Good The Bad Future Directions Q&A

Morfessor • morpheme segmentation & simple morphology induction algorithm • utilized Finnish & English data sets used in Morpho challenge • unsupervised method for segmentation of words into morpheme-like units • idea: propose substrings occurring frequently enough in several different word forms as morphs • words = concatenation of morphs • look for optimal balance btwn compactness of morph lexicon & representation of corpus • very compact lexicon = individual letters --> as many morphs as letters in word • short rep of corpus: whole words --> large lexicon • corpus represented as sequence of pointers to entries in morph lexicon • uses probabilistic framework or MDL to produce segmentation resembling linguistic morpheme segmentation • 3 'flavors:' Baseline, Categories ML, Categories-MAP

Morfessor Baseline • context-independent splitting algorithm • optimization criterion = max P(lexicon) P(corpus|lexicon) • = ∏ P(α) ∏ P(μ)‏ • lexicon = all distinct morphs spelled out forming strings of letters • α = strings of letters formed by morphs • P(lexicon) = product of probability of each letter in α string • corpus --> sequence of morphs • morphs --> particular segmentation of words in corpus • prob of segmentation P(corpus|lexicon) = product of probability of each morph token μ • letter & morph probs are max likelihood • 3 errors: • 1) undersegmentation: freq string stored as whole b/c most concise rep • 2) oversegmentation: infreq string best coded in parts • 3) morphotactic violations: b/c model context-independent

Morfessor Categories ML • introduce morph categories • use HMM • transition probabilities between categories • emission probabilities of morphs from categories • 4 categories: use properties of morphs in proposed segmentation • prefix: morph preceding large # of diff morphs (right perplexity)‏ • stem: morph that is not very short • suffix: morph following large # of diff morphs (left perplexity)‏ • noise: morph not obvious prefix, suffix, or stem in pos it occurs in • use heuristics & noise category to remove some errors from baseline • split redundant morphs in lexicon to reduce undersegmentation • prohibit splitting into 'noise' • join morphs tagged as noise w/ neighbors to reduce oversegmentation • introduce context-sensitivity (HMM) to reduce morphotactic violations

Morfessor Categories MAP • 2 probabilities calculated: • P(lexicon) & P(representation of corpus conditioned on lexicon)‏ • frequent strings represented as whole words in lexicon • frequent strings now have hierarchical representation • morph --> string of letters or 2 sub-morphs • expand morphs into sub-morphs to avoid undersegmentation • do not expand nodes in tree if next level = 'noise' to avoid oversegmentation

Experiments & Results • baseline entirely unsupervised • ML & MAP not unsupervised • optimize perplexity threshold • separately for 3 lang • run 3 models on Challenge data • ML & MAP > baseline • baseline did best on English • MAP had much higher precision than other models BUT lower recall • MAP & ML great improvement in recall BUT lower precision • explanation: different complexities of morphology • Turkish/Finnish: high type/token ratio • word formation --> concat of morphemes • So, proportion of frequently occurring word forms is lower • English: word formation --> fewer morphemes • So, proportion of frequently occurring word forms is higher

BAMA & Arabic MT • take advantage of source & target lang context when conducting MA • preprocess data w/ BAMA • morphological analysis at word level • analyzes word --> returns all possible segmentations for word • segmentations --> prefixes, stems, suffixes • built in word-based heuristics --> rank candidates • gloss info provided by BAMA's manually constructed lexicon • 3 methods to analysis • 1) BAMA only • 2) BAMA & context • 3) BAMA & corresponding match

BAMA & Arabic MT 3 Methods of Analysis • 1) BAMA only • Replace each Arabic word 1st possible split returned by BAMA • 2) BAMA & context • Take full advantage of gloss info provided by BAMA’s lexicon • Each split  particular prefix, stem, suffix existing in lexicon • Set of possible translations (glosses) for each fragment • Select fragment (split for source word) using context • winner = split w/ most target side matches in translation of full sentence • Save choice of split & use for all occurrences of surface form of word in training & testing • 3) BAMA & corresponding match • Arabic  info in surface form not present in English • Confusing for word-alignment unless fragments assigned to null • Remove fragments w/ lexical info not present in English • Find b/c English translations in BAMA lexicon empty

BAMA & Arabic MT Data & System • Data  BTEC IWSLT05 Arabic language data • 20,000 Arabic/English sentence pairs (training)‏ • DevSet/Test05  500 arabic sentences each w/ 16 reference translations per Arabic sentence • Also evaluated on randomly sampled dev & test sets • worried test & dev sets too similar • Used Vogel system w/ reordering & future cost estimation • baseline --> Normalize (merge Alif, tar marbuta, ee)‏ • Trained translation parameters for 10 scores (LM, word & phrase count, & 6 translation models)‏ • Used MERT on dev set • Optimized system (separately) for both BLEU & NIST

Results • NIST scores  steady improvement w/ better splitting techniques • (up to 5% relative)‏ • Improvements statistically significant • Better improvements for NIST than BLEU • NIST  sensitivity to correctly translating certain high gain words in test corpus • Unknown word  inflectional splitting technique  correct translation  increase score

Unsupervised MA for Finnish, Swedish, & Danish SMT used morphological information found in unsupervised way in SMT 3 languages: Danish, Swedish & Finnish Danish, Swedish very close to each other trained system on corpus containing 300,000 sentences from EuroParl typical IBM model --> trans model & LM used morphs as tokens NOT words used Morfessor Categories-MAP to find morpheme-like units even works w/ agglutinative languages, e.g., Finnish Reasoning: speech recognition  using morph-based vocabulary shown to improve results used MAP because: 1) has better segmentation accuracy than Morfessor Baseline or ML 2) can handle unseen words word = (PRE* STM SUF*)+

Language Models & Phrases used basic n-gram LM --> base on sub-word units NOT words used varigram model --> gets smaller n-gram model w/o restricting n too much model grows incrementally & includes longer contexts only when necessary used 3 types of LM 2 baseline 3-gram & 4-gram models trained w/ SRL LM toolkit 3rd --> varigram model trained w/ VariKN LM toolkit based on (Siivola, 2007) observed --> trans quality improved by translating seq of words (phrases) used Moses --> generalized phrase-based approach to work w/ morphology used morphs w/o modifications to Moses similar phrases constructed from morphs as words morphs suitable for translating compound words in parts morph category info (pre, stm, suf) part of morph label + --> not last morph of word --> necessary to reconstruct words from morphs in output

Data & Experiments ran all experiments on Moses & used BLEU to score data = European Parliament from 1996-2001 --> strip bi-texts of XML tags & converted letters to lowercase test --> last 3 months of 2000 dev --> sessions of September 2000 training --> rest (excluding above) Trained Morfessor on training set & used to segment dev & test sets created 2 data sets for each alignment pair --> 1 w/ words, 1 w/ morphs used training sets for LM training used dev sets for parameter tuning Moses cleaning script removed mis-aligned sentences: a) 0 tokens b) too many tokens c) bad token ratio test set --> sentences had at least 5 words & at most 15 words

Results morphs shorter than words --> need longer n-gram to cover same amount of context info 4-gram improves scores over 3-gram LM for morphs & for words (3/4) --> use 4-gram LM default phrase length in Moses = 7 --> not long enough for morphs --> increased to 10 varigram model --> mixed: overall --> translation based on morph phrases worse signif icantly worse in 2 cases: Finnish-Swedish & Swedish-Finnish Reasons: only 1 reference translation --> hurts score Finnish has fewer words for same text than Swedish or Danish 1 mistake in suffix of word --> word is error even if can understand

Untranslated Words word-based translation model only translates words present in training data data --> morphs have notably lower type count same vocabulary coverage w/ smaller # more frequently occurring units reduces OOV problem results --> morph-based system translated many more sentences fully--> morph-based system translated more words higher # compound words & inflected word forms left untranslated by word-based system

Performance on Baseforms translating into Fin --> word & morph models trouble getting grammatical endings right Morph-based model translated more words Restored words to baseform  morph-based model improve? used FINTWOL = Finnish MA to produce baseforms for each word in outcome of Swedish-Finnish translation 3.3%(word) & 2.2(morph) & 1.8%(ref) words not recognized by MA left unchanged BLEU scores about 5% higher for modified data Word-based model still outperformed morph-based model no test on other language pairs

Quality of Morfessor’s Segmentation selected 500 words from data randomly & manually segmented precision = proportion of morph boundaries proposed by Morfessor agreeing w/ linguistic segmentation recall = proportion of boundaries in linguistic segmentation found by Morfessor segmentation accuracy for Danish & Swedish very similar Finnish morphology more challenging --> results worse precision around 80% for all languages --> 4/5 morph boundaries suggested by Morfessor correct prefer high precision --> proposed morph boundaries usually correct lower recall --> words generally undersegmented (segmentation more conservative) difference btwn standard word representation & Morfessor segmentation smaller than difference btwn words & linguistic morphs

Closer Look at Segmentation looked for phrases not spanning entire words at least 1 phrase boundary = morph boundary w/in word 3 categories: 1) same structure across languages compound words common in 3 languages studied Danish & Swedish similar  similar morphological structure parallel structures when translating to or from Finnish N & V 2) differing structures across languages morph-based model captures fairly well need way to re-order phrases interesting: Finnish (written) turns V to N 3) lexicalized forms split into phrases Swedish & Danish: translate phrase piece by piece even though phrases may be very short & not morphologically productive data: 2/3 translated sent btwn Swedish & Danish have at least 1 phrase boundary w/in word --> only 1/3 in Finnish

Conclusion unsupervised MA flexible --> provide language independence generalization ability increased through more refined phrases Improvements: specialize alignment algorithm for morphs instead of words rescore translations with word-based LM combine allomorphs of same morpheme into equivalence classes use factored translation models to combine in translation

English-Turkish SMT looked at sub-lexical structure b/c Turkish word aligns to complete phrase on English side phrase on English side may be discontinuous Turkish  150 diff suffixes & 30,000 root words use morphs to alleviate sparseness Abstract away from word-internal details w/ morph representation words at morph level that appear different may be similar on surface Turkish has many more distinct word forms (2X Eng) but fewer distinct content words May overload distortion mechanisms b/c account for both word-internal morph sequence & sentence level word ordering Segmentation of word might not be unique generate representation with lexical & morphological features for all possible segmentations & interpretations of word Disambiguate analyses w/ statistical disambiguator using morph features

Exploiting Turkish Morphology Process docs: 1) improve statistical alignment  segment words into lexical morphemes to remove differences b/c of word-internals 2) tag English side w/ TreeTagger  lemma & POS for each word Remove any tags not implying morpheme or exceptional form 3) extract sequence of roots for open class content words from morph- segmented data Remove all closed-class words as well as tags signaling morph on open class word Processing  bolsters training corpus, improves alignment Goal: align roots w/o additional noise from morphs or function words

Framework & Systems used monolingual Turkish text of 100,000 sentences & training data for LM decoded & rescored n-best list surface words directly recoverable from concatenated representation of segmentation used word-based representation for word-based LM used for rescoring used phrase-based SMT framework (Koehn) & Moses toolkit (Koehn) & SRILM LM toolkit (Stolke) evaluated decoded translations w/ BLEU using single reference translation 3 Systems: Baseline Fully morphologically segmented model Selectively segmented model

Baseline System Trained model using default Moses parameters w/ word-based training corpus Decoded English test set w/ default decoder parameters & w/ distortion limit set to unlimited Also tried distortion weight set to 0.1 to allow for long distance distortions Tried MERT but did not improve scores Added content word data & trained 2nd baseline model Adding content word hurt performance (16.29 vs. 16.13 & 20.16 vs. 19.77)

Fully Morphologically Segmented Model Trained model w/ morphs & w/ & w/o adding content words Used 5-gram morpheme based LM for decoding goal: capture local morphotactic constraints & sentence level ordering of words 2 morph per word  covers 2 wordsdecoded 1000-best lists Converted 1000 sentences into words & rescored w/ 4-gram word-based LM goal: enforce distant word sequence constraints Experimented w/ parameters & various linear combos of word-based LM and trans model w/ tuning Default decoding parameters used by Moses decoder provided bad results English & Turkish word order very different  need distortion Allow longer distortions w/ less penalty  7 point BLEU improvement Add content words  6.2% improvement (no rescoring)  better alignment Rescored 1000-best sentence output w/ 4-gram word-based LM 4% relative improvement (.79 BLEU points) Best: allow distortion & rescore  1.96 BLEU points (9.4% relative)

Selectively Segmented Model Analyzed GIZA++ files certain morphemes on Turkish side almost never aligned w/ anything Only derivational MA on Turkish side Nominalization, agreement markers, et al mostly unaligned For above cases  attach morphemes to root (intervening morphs for V too) Case morphemes did align w/ prepositions on English side, so left alone Trained model w/ added content words & parameters from best scoring model in last slide 2.43 pts (11% rel) improvement BLEU over best model above

Model Iteration used iterative approach to use multiple models  like post-editing used selective segmentation model & decoded English training & English test sets to obtain T1 test & train trained next model on T1 train and T train data & built model aim: T1 < model < T model applied to T1 train & T1 test  produces T2 train & T2 test  repeat did not include content word corpus in experiments: Preliminary experiments  word-based models perform better than morpheme-based models in next iterations Adding content words for word-based models not helpful decoded data on original test data using 3-gram word-based LM Re-ranked 1000-best outputs using 4-gram LM 2nd iteration  4.86 (24% relative) improvement in BLEU over 1st fully morph-segmented model (no rescoring)

Errors & Word Repair Errors in any translated morpheme or morphotactics  word incorrect 1-gram precision score  get almost 65% of root words correct 1-gram precision score only about 52% w/ best model Mismatches  poorly formed  root correct but morphs not applicable or in wrong position Many cases  mismatches only 1 morpheme edit distance away from correct word Solution: Utilize morpheme level ‘spelling corrector’ operating on segmented representations Corrects forms w/ minor morpheme errors  form lattice & use to rescore contextually correct form Used BLEU+ to investigate recover all words 1 or 2 morphs away  raise word BLEU score to 29.86 and 30.48 Oracle scores BUT very close to root word BLEU scores

Other Scoring Methods BLEU very harsh on Turkish & morph-based approach all-or-none nature of token comparison Possible to have almost interchangeable words w/ very similar semantics not exact match  BLEU marks as wrong Solution: use stems & synonyms (METEOR) alter notion of token similarity  score increases to 25.08 use root word synonymy & Wordnet  score increase to 25.45 combine rules & Wordnet  score increases to 25.46

Conclusions Morphology & rescoring  significant boost in BLEU score Other solutions to morphotactics problem: use skip LM in SRILM toolkit  content word order directly used by decoder identify morphologically correct OOV words or assigned low probability by LM using posterior probabilities generate additional ‘close’ morphological words & construct lattice that can be rescored

Morphology &amp; Machine Translation

Morphology &amp; Machine Translation

Presentation Transcript

Morphology & Machine Translation

Morphology & Machine Translation