1 / 48

Machine Translation

Machine Translation. Introduction to MT. Machine Translation. Helping human translators. Fully automatic. Enter Source Text:.  这 不过 是 一 个 时间 的 问题. Translation from Stanford’s Phrasal :. This is only a matter of time. Google Translate. Fried ripe plantains:

amie
Télécharger la présentation

Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Translation Introduction to MT

  2. Machine Translation • Helping human translators • Fully automatic Enter Source Text:  这 不过 是 一 个 时间 的 问题 . Translation from Stanford’s Phrasal: This is only a matter of time.

  3. Google Translate • Fried ripe plantains: • http://laylita.com/recetas/2008/02/28/platanos-maduros-fritos/

  4. Machine Translation • The Story of the Stone (“The Dream of the Red Chamber”) • Cao Xueqin 1792 • Chinese gloss: Dai-yu alone at bed on think-of-with-gratitude Bao-chai… again listen to window outside bamboo tip plantain leaf of on, rain sound sigh drop, clear cold penetrate curtain, not feeling again fall down tears come. • Hawkes translation: As she lay there alone, Dai-yu’s thoughts turned to Bao-chai… Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without noticing it she had begun to cry.

  5. Difficulties in Chinese to English translation • Long Chinese sentences: 4 English sentences to 1 Chinese • Chinese no pronouns or articles (English the, a) • Chinese has locative post-positions, English prepositions • Chinese bed on, window outside, English on the bed, outside the window • Chinese rarely marks tense: • English as, turned to, had begun, • Chinesetou, ‘penetrate’ -> English penetrated • Chinese relative clauses are before the noun, English after • Chinese: [window outside bamboo on] rain • English: rain [on the bamboo outside the window] • Stylistic and cultural differences • Chinese bamboo tip plaintain leaf -> bamboos and plantains • Chinese rain sound sigh drop -> insistent rustle of the rain • Chinese ma ‘curtain’ -> curtains of her bed

  6. Alignment in Machine Translation

  7. Early MT History 1946 Booth and Weaver discuss MT in New York 1947-48 idea of dictionary-based direct translation 1947 Warren Weaver suggests translation by computer 1949 Weaver memorandum 1952 all 18 MT researchers in world meet at MIT 1954 IBM/Georgetown Demo Russian-English MT 1955-65 lots of labs take up MT http://www.hutchinsweb.me.uk/PPF-TOC.htm

  8. 1949 Weaver memorandum • http://www.mt-archive.info/Weaver-1949.pdf • “There are certain invariant properties which are… common to all languages” • ‘When I look at an article in Russian, I say "This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”’ • “[If] one can see… N words on either side, then, if N is large enough, one can unambiguously decide the meaning of the central word.”

  9. The History of MT: Pessimism • 1959/1960 • Yehoshua Bar-Hillel “Report on the state of MT in US and GB” • FAHQ MT too hardbecause we would have to encode all of human knowledge • Instead we should work on computer tools for human translators

  10. The claim that fully automatic high quality MT is impossible • Pen1: Enclosure for small children Yehoshua Bar-Hillel. 1960. A Demonstration of the Nonfeasibility of Fully Automatic High Quality Translation. • Little John was looking for his toy box. Finally he found it. The box was in the pen. John was very happy. • Pen1: Enclosure for small children • Pen2: Writing utensil

  11. The box was in the pen.

  12. The claim that fully automatic high quality MT is impossible Yehoshua Bar-Hillel, 1960 “I now claim that no existing or imaginable program will enable an electronic computer to determine…”

  13. The state of the art in MT

  14. The state of the art in MT

  15. History of MT: Further PessimismThe ALPAC report • Headed by John R. Pierce of Bell Labs • Conclusions: • MT doesn’t work • MT a failure: all current MT work had to be post-edited • Intelligibility and informativenessworse than human • We don’t need MT anyhow • Already too many human translators from Russian • Results: MT research suffered • Funding loss • Number of research labs declined • Association for Machine Translation and Computational Linguistics dropped MT from its name

  16. MT in the modern age • 1975-1985 Resurgence of MT in Europe and Japan • Domain-specific rule-based systems • 1990-present • Rise of Statistical Machine Translation

  17. Machine Translation Introduction to MT

  18. Machine Translation Language Divergences

  19. Language Similarities and Divergences • Typology: • the study of systematic cross-linguistic similarities and differences • What are the dimensions along which human languages vary?

  20. Syntactic Variation: Basic Word Orders In many languages one word order is more basic • SVO (Subject-Verb-Object) languages English, German, French, Mandarin I baked a pizza • SOV Languages Japanese, Hindi English: He adores listening to music Japanese: kare ha ongakuwokiku no gadaisukidesu he music to listening adores • VSO languages • Irish, Classical Arabic, Tagalog

  21. Morphology • Morpheme:“Minimal meaningful unit of language” Word = Morpheme + Morpheme + Morpheme +… • Stems: (base form, root) hope+ing hoping hop  hopping • Affixes • Prefixes: Antidisestablishmentarianism • Suffixes: Antidisestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German

  22. Morphemes per Word Joseph Greenberg. 1954. A Quantitative Approach to the Morphological Typology of Language. IJAL 26:3. isolating synthetic 1 2 3 4 1.06 1.68 2.17 2.55 3.72 Vietnamese English Yakut (Turkic) Swahili West Greenlandic (Eskimo-Inuit)

  23. Fewmorphemes per word: Cantonese “He said this was the biggest building in the whole country” Each word in this sentence has one morpheme (and one syllable): keuiwachyuhngwokjeuidaaihgaanngukhaih li gaan he say entire country most big bldg house is this bldg

  24. Many Morphemes per word: Turkish uygarlaştıramadıklarımızdanmışsınızcasına uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına Behaving as if you are among those whom we could not cause to become civilized

  25. Word SegmentationAre word boundaries marked in writing? • Some writing systems: boundaries between words not marked • Chinese, Japanese, Thai • Word segmentation becomes an important part of text normalization for MT • Some languages tend to have sentences that are quite long, closer to English paragraphs than sentences: • Modern Standard Arabic, Chinese • Sentence segmentation may be necessary for MT between these languages and languages like English

  26. Inferential Load: cold vs. hot languages BalthasarBickel. 2003. Referential density in discourse and syntactic typology. Language 79:2, 708-36 • Hot languages: • Who did what to whom is marked explicitly • English • Cold languages: • The hearer has more “figuring out” of who the various actors in the various events are • Japanese, Chinese

  27. Inferential Load: The blue noun phrases are not in the Chinese original 飓风丽塔已经减弱为第三级飓风, Rita weakened and was downgraded to a Category 3 storm; ø迫近美国德课萨斯州和路易斯安那州, [Rita/it/the storm]is moving close to Texas and Louisiana; 当局表示, the authorities announced; 虽然 ø在登陆前可能再稍微减弱, although [Rita/it/the storm] might weaken again before landing, 但 ø仍然会非常危险, [Rita/it/the storm] is still very dangerous; ø预料 ø会在当地时间星期六凌晨在德州和路易斯安那州之间登陆, [the authorities] predict [Rita/it/the storm] will arrive at the Texas-Louisiana border on Saturday morning local time; ø直接吹袭休斯敦市东面的主要炼油设施。 [Rita/it/the storm] will directly hit the oil-refining industry east of Houston.

  28. Lexical Divergences • Word to phrases: • English computer science • French informatique • Part of Speech divergences • English She likes to sing • German Siesingtgerne[She sings likefully] • English I’m hungry • Spanish Tengohambre[I have hunger]

  29. Lexical Specificity Divergences • Grammatical specificity • Spanish: plural pronouns have gender (ellos/ellas) • English: plural pronouns no gender (they) • So translating “they” from English to Spanish, need to figure out gender of the referent!

  30. Lexical Divergences: Semantic Specificity English brother Mandarin gege(older brother), didi(younger brother) English wall German Wand(inside) Mauer(outside) English fish Spanish pez(the creature) pescado(fish as food) Cantonesengau Englishcow beef

  31. Predicate Argument divergences L. Talmy. 1985. Lexicalization patterns: Semantic Structure in Lexical Form. • English Spanish The bottle floated out. La botellasalióflotando. The bottle exited floating • Satellite-framed languages: • direction of motion is marked on the satellite • Crawl out, float off, jump down, walk over to, run after • Most of Indo-European, Hungarian, Finnish, Chinese • Verb-framed languages: • direction of motion is marked on the verb • Spanish, French, Arabic, Hebrew, Japanese, Tamil, Polynesian, Mayan, Bantu families

  32. Predicate Argument divergences:Heads and Argument swapping Dorr, Bonnie J., "Machine Translation Divergences: A Formal Description and Proposed Solution," Computational Linguistics, 20:4, 597--633 Arguments: Spanish: Yme gusta English: I like Y German: Der Terminfälltmirein English: Iforgetthe date Heads: English: X swim across Y Spanish: X crucar Y nadando English: I liketo eat German: Ichessegern English: I’d prefer vanilla German: Mir wäreVanillelieber

  33. Predicate-Argument Divergence Counts B.Dorr et al. 2002. DUSTer: A Method for Unraveling Cross-Language Divergences for Statistical Word-Level Alignment Found divergences in 32% of sentences in UN Spanish/English Corpus

  34. Machine Translation Language Divergences

  35. Machine Translation Three classical methods for MT

  36. 3 Classical methods for MT Direct Transfer Interlingua

  37. Three MT Approaches: Direct, Transfer, Interlingual

  38. Direct Translation • Proceed word-by-word through text • Translating each word • No intermediate structures except morphology • Knowledge is in the form of • Huge bilingual dictionary • word-to-word translation information • After word translation, can do simple reordering • Adjective ordering English -> French/Spanish

  39. Direct MT Dictionary entry

  40. Direct MT

  41. Problems with direct MT German Chinese

  42. The Transfer Model • Idea: apply contrastive knowledge, i.e., knowledge about the difference between two languages • Steps: • Analysis: Syntactically parse source language • Transfer: Rules to turn this parse into parse for target language • Generation: Generate target sentence from parse tree

  43. English to French English: Adjective Noun French: Noun Adjective • This is not always true Route mauvaise ‘bad road, badly-paved road’ Mauvaise route ‘wrong road’ • But is a reasonable first approximation • Rule:

  44. Transfer rules

  45. Transferring the green witch….

  46. Interlingua • Instead of N2 sets of transfer rules • Use meaning as a representation language • Parse source sentence into meaning representation • Generate target sentence from meaning. • Intuition: Use other NLP applications to do MT work • Englishbookto Spanish: libroorreservar • Disambiguate book into conceptsBOOKVOLUME and RESERVE • Need 2N systems (a parser and generator for each language)

  47. Interlingua for Mary did not slap the green witch

  48. Machine Translation Three classical methods for MT

More Related