Learning Translation Lexicons from Comparable Corpora
Learning Translation Lexicons from Comparable Corpora. Ling 575 Presentation, Ankit K. Srivastava. Comparable Corpora. Definition Examples Applications in Machine Translation. Translation Lexicon. Definition Examples How to learn a TL? State-of-the-Art. Link to Paper. Primary Paper.
Learning Translation Lexicons from Comparable Corpora
E N D
Presentation Transcript
LearningTranslationLexiconsfrom ComparableCorpora Ling 575 Presentation, Ankit K. Srivastava
Comparable Corpora • Definition • Examples • Applications in Machine Translation Translation Lexicon - Ankit
Translation Lexicon • Definition • Examples • How to learn a TL? • State-of-the-Art Translation Lexicon - Ankit
Link to Paper Primary Paper July 2002 “Learning a Translation Lexicon from Monolingual Corpora” Philipp KOEHN & Kevin KNIGHT Translation Lexicon - Ankit
Contents of KoehnKnight2002 • Introduction • Clues • Experiments • Conclusion Translation Lexicon - Ankit
KK02 - Introduction SWOT analysis of SMT Translation Lexicon - Ankit
KK02 - Introduction Objective Generally “Build a translation lexicon solely from monolingual corpora.” Specifically “Automatically generate a one-to-one mapping of German & English nouns.” Translation Lexicon - Ankit
KK02 - Introduction Data & Evaluation Metrics CORPORA • ENG: Wall Street Journal, 1990-1992 • GER: German News Wire, 1995-1996 VERIFY • Bilingual Lexicon of 9,206 German & 10,645 English nouns Translation Lexicon - Ankit
Contents of KoehnKnight2002 • Introduction • Clues • Experiments • Conclusion Translation Lexicon - Ankit
KK02 - Clues Find mappings in corpora • Identical words • Similar spelling • Similar context • Similar words • Frequent words Translation Lexicon - Ankit
KK02 - Clues Identical Words • To build a seed lexicon • Words have identical spellings -eg OR • Words adapted through well-established transformation rules –eg • Both these strategies were used to find 976 + 363 word mappings Translation Lexicon - Ankit
KK02 - Clues Identical Words Identical translations & Length of the word Mappings of words >= length 6 results in 622 word pairs or 96% total accuracy Translation Lexicon - Ankit
KK02 - Clues Similar Spelling • Common language roots & Adopted words • Different from non-verbatim words above • Cognates • Greedy fashion • Longest Common Subsequence Ratio • Limitations • Other approaches Translation Lexicon - Ankit
KK02 - Clues Similar Spelling # letters common in sequence Length of the longer word SIM = Translation Lexicon - Ankit
KK02 - Clues Context • Similar Context window based on frequency of context words in surrounding positions. • Context paradigm is a 3 STEP process. • Step 2 = chicken-egg => SEED • If NO SEED then HI TIME COMPLEXITY • This approach • Other approaches Translation Lexicon - Ankit
KK02 - Clues Context – Greedy Example Translation Lexicon - Ankit
KK02 - Clues Similarity • Similar words in one language are similar in another language • Example: days of the week • Strategies to measure word similarity Translation Lexicon - Ankit
KK02 - Clues Frequency • In comparable corpora, same concepts used with similar frequency • Not sequential Order • Ratio of word frequencies normalized by corpus sizes. Translation Lexicon - Ankit
Contents of KoehnKnight2002 • Introduction • Clues • Experiments • Conclusion Translation Lexicon - Ankit
KK02 - Experiments Testing Grounds • Greedy search preferred to O(n!) possible traversals. • Evaluation 1: # correct word-pair mappings • Evaluation 2: Against a word-level translation Translation Lexicon - Ankit
Contents of KoehnKnight2002 • Introduction • Clues • Experiments • Conclusion Translation Lexicon - Ankit
KK02 - Conclusion Remarks • Identical words • Similar spelling • Similar context • Similar words • Frequent words Translation Lexicon - Ankit