220 likes | 431 Vues
Learning Translation Lexicons from Comparable Corpora. Ling 575 Presentation, Ankit K. Srivastava. Comparable Corpora. Definition Examples Applications in Machine Translation. Translation Lexicon. Definition Examples How to learn a TL? State-of-the-Art. Link to Paper. Primary Paper.
E N D
LearningTranslationLexiconsfrom ComparableCorpora Ling 575 Presentation, Ankit K. Srivastava
Comparable Corpora • Definition • Examples • Applications in Machine Translation Translation Lexicon - Ankit
Translation Lexicon • Definition • Examples • How to learn a TL? • State-of-the-Art Translation Lexicon - Ankit
Link to Paper Primary Paper July 2002 “Learning a Translation Lexicon from Monolingual Corpora” Philipp KOEHN & Kevin KNIGHT Translation Lexicon - Ankit
Contents of KoehnKnight2002 • Introduction • Clues • Experiments • Conclusion Translation Lexicon - Ankit
KK02 - Introduction SWOT analysis of SMT Translation Lexicon - Ankit
KK02 - Introduction Objective Generally “Build a translation lexicon solely from monolingual corpora.” Specifically “Automatically generate a one-to-one mapping of German & English nouns.” Translation Lexicon - Ankit
KK02 - Introduction Data & Evaluation Metrics CORPORA • ENG: Wall Street Journal, 1990-1992 • GER: German News Wire, 1995-1996 VERIFY • Bilingual Lexicon of 9,206 German & 10,645 English nouns Translation Lexicon - Ankit
Contents of KoehnKnight2002 • Introduction • Clues • Experiments • Conclusion Translation Lexicon - Ankit
KK02 - Clues Find mappings in corpora • Identical words • Similar spelling • Similar context • Similar words • Frequent words Translation Lexicon - Ankit
KK02 - Clues Identical Words • To build a seed lexicon • Words have identical spellings -eg OR • Words adapted through well-established transformation rules –eg • Both these strategies were used to find 976 + 363 word mappings Translation Lexicon - Ankit
KK02 - Clues Identical Words Identical translations & Length of the word Mappings of words >= length 6 results in 622 word pairs or 96% total accuracy Translation Lexicon - Ankit
KK02 - Clues Similar Spelling • Common language roots & Adopted words • Different from non-verbatim words above • Cognates • Greedy fashion • Longest Common Subsequence Ratio • Limitations • Other approaches Translation Lexicon - Ankit
KK02 - Clues Similar Spelling # letters common in sequence Length of the longer word SIM = Translation Lexicon - Ankit
KK02 - Clues Context • Similar Context window based on frequency of context words in surrounding positions. • Context paradigm is a 3 STEP process. • Step 2 = chicken-egg => SEED • If NO SEED then HI TIME COMPLEXITY • This approach • Other approaches Translation Lexicon - Ankit
KK02 - Clues Context – Greedy Example Translation Lexicon - Ankit
KK02 - Clues Similarity • Similar words in one language are similar in another language • Example: days of the week • Strategies to measure word similarity Translation Lexicon - Ankit
KK02 - Clues Frequency • In comparable corpora, same concepts used with similar frequency • Not sequential Order • Ratio of word frequencies normalized by corpus sizes. Translation Lexicon - Ankit
Contents of KoehnKnight2002 • Introduction • Clues • Experiments • Conclusion Translation Lexicon - Ankit
KK02 - Experiments Testing Grounds • Greedy search preferred to O(n!) possible traversals. • Evaluation 1: # correct word-pair mappings • Evaluation 2: Against a word-level translation Translation Lexicon - Ankit
Contents of KoehnKnight2002 • Introduction • Clues • Experiments • Conclusion Translation Lexicon - Ankit
KK02 - Conclusion Remarks • Identical words • Similar spelling • Similar context • Similar words • Frequent words Translation Lexicon - Ankit