LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 17 • 3/20/2013

Recommended reading • Jurafsky & Martin Chapter 25, Machine Translation • Warren Weaver. 1949. Translation. Reprinted in: Locke, W.N. and Booth, A.D. (eds.) Machine translation of languages: fourteen essays (Cambridge, Mass.: Technology Press of the Massachusetts Institute of Technology, 1955), pp. 15-23. • P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, J. Lafferty, R. Mercer, and P. Roossin. 1990. A Statistical Approach to Machine Translation. Computational Linguistics 16(2). • P. Brown, S. Della Pietra, V. Della Pietra, & R. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2). • William Gale and Kenneth Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1). • K. Papieni et al. 2002. BLEU: a method for automatic evaluation of machine translation. Proceedings of ACL.

Outline • Machine translation techniques • Evaluating translation • Noisy channel model of statistical translation • Alignments • Estimating alignments from parallel corpora • IBM Model 1

Machine translation • Translate from a source language to a target language • Usually text to text translation • Also speech-to-speech translation, involving ASR (automatic speech recognition) and speech synthesis • Source language utterance  (ASR)  source language text  (MT)  target language text  (speech synthesis)  target language utterance • Application: develop wearable computers for soldiers in foreign countries

Interlingua Semantic Composition Semantic Decomposition Semantic Structure Semantic Structure Semantic Analysis Semantic Generation Semantic Transfer Syntactic Structure Syntactic Structure Syntactic Transfer Syntactic Analysis Syntactic Generation Word Structure Word Structure Direct Morphological Analysis Morphological Generation Target Text Source Text Three approaches to MT:direct, transfer, interlingual(Vauquois triangle)

Approaches to MT • Direct (word-for-word) translation • Syntactic transfer • Interlingua • Example-based • Statistical

Word-for-word translation • Use a machine-readable bilingual dictionary to translate each word in a text • Advantages • Easy to implement, results give a rough idea about what the text is about • Disadvantages • Problems with word order means that this results in low-quality translation

Syntactic transfer rules • Three steps: • Parse sentence  Rearrange constituents  Translate words • Advantage: deals with the word-order problem • Disadvantages • For a set of languages, must construct transfer rules for each pair of languages • Mapping isn’t always straightforward  English word order is subject - verb - object Japanese order is subject - object - verb

Interlingua • Represent sentence with an abstract logical form • John must not go = OBLIGATORY(NOT(GO(JOHN))) • John may not go = NOT(PERMITTED(GO(JOHN))) • Use logical form to generate a sentence in another language • Advantages • Use a single logical form to translate between all languages (interlingua) • Disadvantages • Difficult to define a single logical form for all languages • Lose stylistic information

Example-based MT • Idea • People do not translate by doing deep linguistic analysis • They translate by decomposing sentence into fragments, translating each of those, and then composing them appropriately • Use a corpus of aligned examples • May be human-produced

Example of example-based MT • Translate • He buys a book on international politics • With these examples • (He buys) a notebook. (Kare ha) nouto (wokau). • I read (a book on international politics). Watashi ha (kokusaiseijinitsuitekakaretahon) woyomu  (Kare ha) (kokusaiseijinitsuitekakaretahon) (wokau).

Challenges for example-based MT • Corpus issues • Automatic alignment: locate similar sentences, align sub-sentential fragments • Human-produced examples may have limited coverage • Translation issues • Combining multiple fragments of example translations into a single sentence • Determining when it is appropriate to substitute one fragment for another • Selecting the best translation out of many candidates

Statistical MT • Find most probable target sentence given a source foreign language sentence • Probabilities are determined automatically by training a statistical model using a parallel corpus • Automatically align words and phrases within sentence pairs in a parallel corpus

Statistical MT • Advantages: • Can handle lexical ambiguity • Can be created for any language pair that has enough training data • Requires minimal human effort • No need for experts in the language • Can deal with idioms that occur in the training data • Disadvantages: • Does not explicitly deal with syntax

Outline • Machine translation techniques • Evaluating translation • Noisy channel model of statistical translation • Alignments • Estimating alignments from parallel corpora • IBM Model 1

Evaluation of MT • Human-based metrics • Semantic Invariance • Pragmatic Invariance • Lexical Invariance • Structural Invariance • Spatial Invariance • Fluency • Accuracy • “Do you understand it?” • Automatic metrics: BLEU

BLEU: BiLingual Evaluation Understudy(Papineni et al., 2002) • Compares MT output to a set of reference translations • Reference translation: produced by humans • Judge “closeness” of translation numerically • Compare n-gram matches between candidate translation and 1 or more reference translations

Chinese-English translation example: Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Human-produced reference translations: Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party.

Look for matching n-grams Chinese-English translation example: Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Human-produced reference translations: Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party.

Compute Bleu unigram precision scores: candidate #1 It(1), is(1), a(1), guide(1), to(1), action(1), which(1), ensures(1), that(2), the(4), military(1), always(1), obeys(0), the commands(1), of(1), the,party(1) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. 15 / 16 unigrams matched = 93.8%

Unigram precision: candidate #2 It(1), is(1), to(1), insure(0), the(4), troops(0), forever(1), hearing(0), the,activity(0), guidebook(0), that(2), party(1), direct(0) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. 7 / 13 unigrams matched = 53.8%

Bigram precision: candidate #1 It is(1), is a(1), a guide(1), guide to(1), to action(1), action which(0), which ensures(0), ensures that(1), that the(1), the military(1), military always(0), always obeys(0), obeys the(0), the commands(0), commands of(0), of the(1), the party(1) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. 10 / 17 bigrams matched = 58.8%

Bigram precision: candidate #2 It is(1), is to(0), to insure(0), insure the(0), the troops(0), troops forever(0), forever hearing(0), hearing the(0), the activity(0), activity guidebook(0), guidebook that(0), that party(0), party direct(0) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. 1 / 13 bigrams matched = 7.7%

BLEU score: interpolates between n-gram precisions

Outline • Machine Translation techniques • Evaluating translation • Noisy channel model of statistical translation • Alignments • Estimating alignments from parallel corpora • IBM Model 1

Statistical machine translation • http://www.nytimes.com/2010/03/09/technology/09translate.html • Franz Och, with a copy of the Rosetta Stone, said Google’s translation tool “can make the language barrier go away.”

Translating a document statistically

Warren Weaver, Translation (1949) • Warren Weaver • Mathematician/engineer • Worked on communication, information theory, statistics • Wrote a memorandum, Translation, in 1949 • Speculated on possible applications of computers in language • Sent memorandum to his colleagues, initiated research in machine translation

Weaver 1949 • Proposes local word sense disambiguation (will see in class later) • ‘If one examines the words in a book, one at a time through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of words. "Fast" may mean "rapid"; or it may mean "motionless"; and there is no way of telling which. • But, if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then, if N is large enough one can unambiguously decide the meaning. . .’

Weaver 1949 • Proposes interlingua for machine translation • ‘Thus it may be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communication—the real but as yet undiscovered universal language—and—then re-emerge by whatever particular route is convenient.’

Weaver 1949 • Proposes statistical machine translation using information theory • ‘It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the "Chinese code." If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation?’

Return to statistical MT • Brown et al. 1990: • ‘In 1949, Warren Weaver proposed that statistical techniques from the emerging field of information theory might make it possible to use modern digital computers to translate text from one natural language to another automatically. Although Weaver's scheme foundered on the rocky reality of the limited computer resources of the day, a group of IBM researchers in the late 1980's felt that the increase in computer power over the previous forty years made reasonable a new look at the applicability of statistical techniques to translation. Thus the "Candide" project, aimed at developing an experimental machine translation system, was born at IBM TJ Watson Research Center.’

Return to statistical MT • ‘The Candide group adopted an information-theoretic perspective on the MT problem, which goes as follows. In speaking a French sentence F, a French speaker originally thought up a sentence E in English, but somewhere in the noisy channel between his brain and mouth, the sentence E got "corrupted" to its French translation F. The task of an MT system is to discover E* = argmax(E') p(F|E') p(E'); that is, the MAP-optimal English sentence, given the observed French sentence. This approach involves constructing a model of likely English sentences, and a model of how English sentences translate to French sentences. Both these tasks are accomplished automatically with the help of a large amount of bilingual text.

How a statistical MT system learns

Statistical MT Systems Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Broken English Spanish English What hunger have I, Hungry I am so, I am so hungry, Have I that hunger … Que hambre tengo yo I am so hungry

Statistical MT Systems Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Broken English Spanish English Translation Model p(f|e) Language Model p(e) Que hambre tengo yo I am so hungry Decoding algorithm argmaxp(e) * p(f|e) e

On voit Jon à la télévision Table by Jason Eisner

Probability model for translation • Probability of a translation pair: p(e, f) = p(e)*p(f|e) • Most likely English translation for a source sentence in Foreign language: • argmaxe p(e, f) = argmaxe p(e)*p(f|e)

Three Problems for Statistical MT • Language model • Assigns a higher probability to fluent / grammatical sentences • Estimated using monolingual corpora • Good English string  high p(e) • Random word sequence  low p(e) • Translation model • Assigns higher probability to sentences that have corresponding meanings • Estimated using bilingual corpora • <f,e> look like translations  high p(f|e) • <f,e> don’t look like translations  low p(f|e) • Decoding algorithm • Given a language model, a translation model, and a new sentence f, find the translation e maximizing p(e) * p(f|e)

Components of statistical translation • Probability of a translation pair: p(e, f) = p(e)*p(f|e) • Word reordering in translation handled by p(e) • p(e) factor frees p(f|e) from worrying about word order in the source language • Word choice in translation handled by p(f|e) • p(f|e) factor frees p(e) from worrying about picking the right translation • Example of a noisy channel model • Also called source-channel model

(from D. Klein)

(from D. Klein) POS Tagging p(tags | words) pp(tags) p(words|tags)

Outline • Machine Translation techniques • Evaluating translation • Noisy channel model of statistical translation • Alignments • Estimating alignments from parallel corpora • IBM Model 1

An example alignment • Fertility: a word may be aligned with multiple words • Distortion: change of position in sentence distortion fertility

Where will we get p(F|E)? Learn alignments Books in English Same books, in French p(F|E) model We call collections stored in two languages parallel corpora or parallel texts

The Rosetta Stone (196 BC) Egyptian: hieroglyphs (used from 3300 BC – 400 AD) Egyptian: Demotic (a late cursive script) Greek (the language of Ptolemy V, ruler of Egypt) 1799 a stone with Egyptian text and its translation into Greek was found  Humans could learn how to translated Egyptian

What is “heaven” in Vietnamese? English: In the beginning God created the heavens and the earth. Vietnamese: Ban dâuDúcChúaTròi dung nêntròidât. English: God called the expanse heaven. Vietnamese: DúcChúaTròidattênkhoangkhông la tròi. English: … you are this day like the stars of heaven in number. Vietnamese: … cácnguoidôngnhusaotrêntròi. Example by Jason Eisner

Variation in literalness of translations, from French-English European Parliament proceedings French, English pairs Closely Literal English Translation Le débatestclos . The debate is closed . The debate is closed. Accepteriez - vousceprincipe ? Would you accept that principle ? Accept-you that principle? Merci , chère collègue . Thank you , Mrs Marinucci . Thank you, dear colleague. Avez - vous donc une autre proposition ? Can you explain ? Have you therefore another proposal?

LING / C SC 439/539 Statistical Natural Language Processing