550 likes | 725 Vues
Introduction to MT. Ling 580 Fei Xia Week 1: 1/03/06. Outline. Course overview Introduction to MT Major challenges Major approaches Evaluation of MT systems Overview of word-based SMT. Course overview. General info. Course website:
E N D
Introduction to MT Ling 580 Fei Xia Week 1: 1/03/06
Outline • Course overview • Introduction to MT • Major challenges • Major approaches • Evaluation of MT systems • Overview of word-based SMT
General info • Course website: • Syllabus (incl. slides and papers): updated every week. • Message board • ESubmit • Office hour: Fri: 10:30am-12:30pm. • Prerequisites: • Ling570 and Ling571. • Programming: C or C++, Perl is a plus. • Introduction to probability and statistics
Expectations • Reading: • Papers are online • Finish reading before class. Bring your questions to class. • Grade: • Leading discussion (1-2 papers): 50% • Project: 40% • Class participation: 10% • No quizzes, exams
Leading discussion • Indicate your choice via EPost by Jan 8. • You might want to read related papers. • Make slides with PowerPoint. • Email me your slides by 3:30am on the Monday before your presentation. • Present the paper in class and lead the discussion: 40-50 minutes.
Project • Details will be available soon. • Project presentation: 3/7/06 • Final report: due on 3/12/06 • Pongo account will be ready soon.
A brief history of MT (Based on work by John Hutchins) • Before the computer: In the mid 1930s, a French-Armenian Georges Artsrouni and a Russian Petr Troyanskii applied for patents for ‘translating machines’. • The pioneers (1947-1954): the first public MT demo was given in 1954 (by IBM and Georgetown University). • The decade of optimism (1954-1966): ALPAC (Automatic Language Processing Advisory Committee) report in 1966: "there is no immediate or predictable prospect of useful machine translation."
A brief history of MT (cont) • The aftermath of the ALPAC report (1966-1980): a virtual end to MT research • The 1980s: Interlingua, example-based MT • The 1990s: Statistical MT • The 2000s: Hybrid MT
Where are we now? • Huge potential/need due to the internet, globalization and international politics. • Quick development time due to SMT, the availability of parallel data and computers. • Translation is reasonable for language pairs with a large amount of resource. • Start to include more “minor” languages.
What is MT good for? • Rough translation: web data • Computer-aided human translation • Translation for limited domain • Cross-lingual IR • Machine is better than human in: • Speed: much faster than humans • Memory: can easily memorize millions of word/phrase translations. • Manpower: machines are much cheaper than humans • Fast learner: it takes minutes or hours to build a new system. Erasable memory • Never complain, never get tired, …
Translation is hard • Novels • Word play, jokes, puns, hidden messages • Concept gaps: go Greek, bei fen • Other constraints: lyrics, dubbing, poem, …
Major challenges • Getting the right words: • Choosing the correct root form • Getting the correct inflected form • Inserting “spontaneous” words • Putting the words in the correct order: • Word order: SVO vs. SOV, … • Unique constructions: • Divergence
Lexical choice • Homonymy/Polysemy: bank, run • Concept gap: no corresponding concepts in another language: go Greek, go Dutch, fen sui, lame duck, … • Coding (Concept lexeme mapping) differences: • More distinction in one language: e.g., kinship vocabulary. • Different division of conceptual space:
Choosing the appropriate inflection • Inflection: gender, number, case, tense, … • Ex: • Number: Ch-Eng: all the concrete nouns: ch_book book, books • Gender: Eng-Fr: all the adjectives • Case: Eng-Korean: all the arguments • Tense: Ch-Eng: all the verbs: ch_buy buy, bought, will buy
Inserting spontaneous words • Function words: • Determiners: Ch-Eng: ch_book a book, the book, the books, books • Prepositions: Ch-Eng: … ch_November … in November • Relative pronouns: Ch-Eng: … ch_buy ch_book de ch_person the person who bought /book/ • Possessive pronouns: Ch-Eng: ch_he ch_raise ch_hand He raised his hand(s) • Conjunction: Eng-Ch: Although S1, S2 ch_although S1, ch_but S2 • …
Inserting spontaneous words (cont) • Content words: • Dropped argument: Ch-Eng: ch_buy le ma Has Subj bought Obj? • Chinese First name: Eng-Ch: Jiang … ch_Jiang ch_Zemin … • Abbreviation, Acronyms: Ch-Eng: ch_12 ch_big the 12th National Congress of the CPC (Communist Party of China) • …
Major challenges • Getting the right words: • Choosing the correct root form • Getting the correct inflected form • Inserting “spontaneous” words • Putting the words in the correct order: • Word order: SVO vs. SOV, … • Unique construction: • Structural divergence
Word order • SVO, SOV, VSO, … • VP + PP PP VP • VP + AdvP AdvP + VP • Adj + N N + Adj • NP + PP PP NP • NP + S S NP • P + NP NP + P
“Unique” Constructions • Overt wh-movement: Eng-Ch: • Eng: Why do you think that he came yesterday? • Ch: you why think he yesterday come ASP? • Ch: you think he yesterday why come? • Ba-construction: Ch-Eng • She ba homework finish ASP She finished her homework. • He ba wall dig ASP CL hole He digged a hole in the wall. • She ba orange peel ASP skin She peeled the orange’s skin.
Translation divergences • Source and target parse trees (dependency trees) are not identical. • Example: I like Mary S: Marta me gusta a mi (‘Mary pleases me’) • More discussion next time.
How humans do translation? • Learn a foreign language: • Memorize word translations • Learn some patterns: • Exercise: • Passive activity: read, listen • Active activity: write, speak • Translation: • Understand the sentence • Clarify or ask for help (optional) • Translate the sentence Training stage Translation lexicon Templates, transfer rules Reinforced learning? Reranking? Decoding stage Parsing, semantics analysis? Interactive MT? Word-level? Phrase-level? Generate from meaning?
What kinds of resources are available to MT? • Translation lexicon: • Bilingual dictionary • Templates, transfer rules: • Grammar books • Parallel data, comparable data • Thesaurus, WordNet, FrameNet, … • NLP tools: tokenizer, morph analyzer, parser, … More resources for major languages, less for “minor” languages.
Major approaches • Transfer-based • Interlingua • Example-based (EBMT) • Statistical MT (SMT) • Hybrid approach
The MT triangle Meaning (interlingua) Synthesis Analysis Transfer-based Phrase-based SMT, EBMT Word-based SMT, EBMT word Word
Transfer-based MT • Analysis, transfer, generation: • Parse the source sentence • Transform the parse tree with transfer rules • Translate source words • Get the target sentence from the tree • Resources required: • Source parser • A translation lexicon • A set of transfer rules • An example: Mary bought a book yesterday.
Transfer-based MT (cont) • Parsing: linguistically motivated grammar or formal grammar? • Transfer: • context-free rules? A path on a dependency tree? • Apply at most one rule at each level? • How are rules created? • Translating words: word-to-word translation? • Generation: using LM or other additional knowledge? • How to create the needed resources automatically?
Interlingua • For n languages, we need n(n-1) MT systems. • Interlingua uses a language-independent representation. • Conceptually, Interlingua is elegant: we only need n analyzers, and n generators. • Resource needed: • A language-independent representation • Sophisticated analyzers • Sophisticated generators
Interlingua (cont) • Questions: • Does language-independent meaning representation really exist? If so, what does it look like? • It requires deep analysis: how to get such an analyzer: e.g., semantic analysis • It requires non-trivial generation: How is that done? • It forces disambiguation at various levels: lexical, syntactic, semantic, discourse levels. • It cannot take advantage of similarities between a particular language pair.
Example-based MT • Basic idea: translate a sentence by using the closest match in parallel data. • First proposed by Nagao (1981). • Ex: • Training data: • w1 w2 w3 w4 w1’ w2’ w3’ w4’ • w5 w6 w7 w5’ w6’ w7’ • w8 w9 w8’ w9’ • Test sent: • w1 w2 w6 w7 w9 w1’ w2’ w6’ w7’ w9’
EMBT (cont) • Types of EBMT: • Lexical (shallow) • Morphological / POS analysis • Parse-tree based (deep) • Types of data required by EBMT systems: • Parallel text • Bilingual dictionary • Thesaurus for computing semantic similarity • Syntactic parser, dependency parser, etc.
EBMT (cont) • Word alignment: using dictionary and heuristics exact match • Generalization: • Clusters: dates, numbers, colors, shapes, etc. • Clusters can be built by hand or learned automatically. • Ex: • Exact match: 12 players met in Paris last Tuesday 12 Spieler trafen sich letzen Dienstag in Paris • Templates: $num players met in $city $time $num Spieler trafen sich $time in $city
Statistical MT • Basic idea: learn all the parameters from parallel data. • Major types: • Word-based • Phrase-based • Strengths: • Easy to build, and it requires no human knowledge • Good performance when a large amount of training data is available. • Weaknesses: • How to express linguistic generalization?
Hybrid MT • Basic idea: combine strengths of different approaches: • Syntax-based: generalization at syntactic level • Interlingua: conceptually elegant • EBMT: memorizing translation of n-grams; generalization at various level. • SMT: fully automatic; using LM; optimizing some objective functions. • Types of hybrid HT: • Borrowing concepts/methods: • SMT from EBMT: phrase-based SMT; Alignment templates • EBMT from SMT: automatically learned translation lexicon • Transfer-based from SMT: automatically learned translation lexicon, transfer rules; using LM • … • Using two MTs in a pipeline: • Using transfer-based MT as a preprocessor of SMT • Using multiple MTs in parallel, then adding a re-ranker.
Evaluation • Unlike many NLP tasks (e.g., tagging, chunking, parsing, IE, pronoun resolution), there is no single gold standard for MT. • Human evaluation: accuracy, fluency, … • Problem: expensive, slow, subjective, non-reusable. • Automatic measures: • Edit distance • Word error rate (WER), Position-independent WER (PER) • Simple string accuracy (SSA), Generation string accuracy (GSA) • BLEU
Edit distance • The Edit distance (a.k.a. Levenshtein distance) is defined as the minimal cost of transforming str1 into str2, using three operations (substitution, insertion, deletion). • Use DP and the complexity is O(m*n).
WER, PER, and SSA • WER (word error rate) is edit distance, divided by |Ref|. • PER (position-independent WER): same as WER but disregards word ordering • SSA (Simple string accuracy) = 1 - WER • Previous example: • Sys: w1 w2 w3 w4 • Ref: w1 w3 w2 • Edit distance = 2 • WER=2/3 • PER=1/3 • SSA=1/3
Generation string accuracy (GSA) Example: Ref: w1 w2 w3 w4 Sys: w2 w3 w4 w1 Del=1, Ins=1 SSA=1/2 Move=1, Del=0, Ins=0 GSA=3/4
BLEU • Proposal by Papineni et. al. (2002) • Most widely used in MT community. • BLEU is a weighted average of n-gram precision (pn) between system output and all references, multiplied by a brevity penalty (BP).
N-gram precision • N-gram precision: the percent of n-grams in the system output that are correct. • Clipping: • Sys: the the the the the the • Ref: the cat sat on the mat • Unigram precision: • Max_Ref_count: the max number of times a ngram occurs in any single reference translation.
N-gram precision i.e. the percent of n-grams in the system output that are correct (after clipping).
Brevity Penalty • For each sent si in system output, find closest matching reference ri (in terms of length). • Longer system output is already penalized by the n-gram precision measure.
An example • Sys: The cat was on the mat • Ref1: The cat sat on a mat • Ref2: There was a cat on the mat • Assuming N=3 • p1=5/6, p2=3/5, p3=1/4, BP=1 BLEU=0.50 • What if N=4?
Summary • Course overview • Major challenges in MT • Choose the right words (root form, inflection, spontaneous words) • Put them in right positions (word order, unique constructions, divergences)
Summary (cont) • Major approaches • Transfer-based MT • Interlingua • Example-based MT • Statistical MT • Hybrid MT • Evaluation of MT systems • Edit distance • WER, PER, SSA, GSA • BLEU