1 / 48

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing. Lecture 11: Machine Translation (I) November 2 , 2004 Dan Jurafsky. Thanks to Bonnie Dorr for some of these slides!!. Outline for MT Week. Intro and a little history Language Similarities and Divergences

sabrinao
Télécharger la présentation

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING 138/238 SYMBSYS 138Intro to Computer Speech and Language Processing Lecture 11: Machine Translation (I) November 2, 2004 Dan Jurafsky Thanks to Bonnie Dorr for some of these slides!! LING 138/238 Autumn 2004

  2. Outline for MT Week • Intro and a little history • Language Similarities and Divergences • Four main MT Approaches • Transfer • Interlingua • Direct • Statistical • Evaluation LING 138/238 Autumn 2004

  3. What is MT? • Translating a text from one language to another automatically. LING 138/238 Autumn 2004

  4. Machine Translation • Dai-yu alone on bed top think-of-with-gratitude Bao-chai again listen to window outside bamboo tip plantain leaf of on-top rain sound sigh drop clear cold penetrate curtain not feeling again fall down tears come • As she lay there alone, Dai-yu’s thoughts turned to Bao-chai… Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without noticing it she had begun to cry. LING 138/238 Autumn 2004

  5. Machine Translation • The Story of the Stone • =The Dream of the Red Chamber (Cao Xueqin 1792) • Issues: • Breaking up into words • Breaking up into sentences • Zero-anaphora • Penetrate -> penetrated • Bamboo tip plaintain leaf -> bamboos and plantains • Curtain -> curtains of her bed • Rain sound sigh drop -> insistent rustle of the rain LING 138/238 Autumn 2004

  6. What is MT not good for? • Really hard stuff • Literature • Natural spoken speech (meetings, court reporting) • Really important stuff • Medical translation in hospitals, 911 LING 138/238 Autumn 2004

  7. What is MT good for? • Tasks for which a rough translation is fine • Web pages, email • Tasks for which MT can be post-edited • MT as first pass • “Computer-aided human translation • Tasks in sublanguage domains where high-quality MT is possible LING 138/238 Autumn 2004

  8. Sublanguage domain • Weather forecasting • “Cloudy with a chance of showers today and Thursday” • “Low tonight 4” • Can be modeling completely enough to use raw MT output • Word classes and semantic features like MONTH, PLACE, DIRECTION, TIME POINT LING 138/238 Autumn 2004

  9. MT History • 1946 Booth and Weaver discuss MT at Rockefeller foundation in New York; • 1947-48 idea of dictionary-based direct translation • 1949 Weaver memorandum popularized idea • 1952 all 18 MT researchers in world meet at MIT • 1954 IBM/Georgetown Demo Russian-English MT • 1955-65 lots of labs take up MT LING 138/238 Autumn 2004

  10. History of MT: Pessimism • 1959/1960: Bar-Hillel “Report on the state of MT in US and GB” • Argued FAHQT too hard (semantic ambiguity, etc) • Should work on semi-automatic instead of automatic • His argumentLittle John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy. • Only human knowledge let’s us know that ‘playpens’ are bigger than boxes, but ‘writing pens’ are smaller • His claim: we would have to encode all of human knowledge LING 138/238 Autumn 2004

  11. History of MT: Pessimism • The ALPAC report • Headed by John R. Pierce of Bell Labs • Conclusions: • Supply of human translators exceeds demand • All the Soviet literature is already being translated • MT has been a failure: all current MT work had to be post-edited • Sponsored evaluations which showed that intelligibility and informativeness was worse than human translations • Results: • MT research suffered • Funding loss • Number of research labs declined • Association for Machine Translation and Computational Linguistics dropped MT from its name LING 138/238 Autumn 2004

  12. History of MT • 1976 Meteo, weather forecasts from English to French • Systran (Babelfish) been used for 40 years • 1970’s: • European focus in MT; mainly ignored in US • 1980’s • ideas of using AI techniques in MT (KBMT, CMU) • 1990’s • Commercial MT systems • Statistical MT • Speech-to-speech translation LING 138/238 Autumn 2004

  13. Language Similarities and Divergences • Some aspects of human language are universal or near-universal, others diverge greatly. • Typology: the study of systematic cross-linguistic similarities and differences • What are the dimensions along with human languages vary? LING 138/238 Autumn 2004

  14. Morphological Variation • Isolating languages • Cantonese, Vietnamese: each word generally has one morpheme • Vs. Polysynthetic languages • Siberian Yupik (`Eskimo’): single word may have very many morphemes • Agglutinative languages • Turkish: morphemes have clean boundaries • Vs. Fusion languages • Russian: single affix may have many morphemes LING 138/238 Autumn 2004

  15. Syntactic Variation • SVO (Subject-Verb-Object) languages • English, German, French, Mandarin • SOV Languages • Japanese, Hindi • VSO languages • Irish, Classical Arabic • SVO lgs generally prepositions: to Yuriko • VSO lgs generally postpositions: Yuriko ni LING 138/238 Autumn 2004

  16. Segmentation Variation • Not every writing system has word boundaries marked • Chinese, Japanese, Thai, Vietnamese • Some languages tend to have sentences that are quite long, closer to English paragraphs than sentences: • Modern Standard Arabic, Chinese LING 138/238 Autumn 2004

  17. Inferential Load • Some languages require the hearer to do more “figuring out” of who the various actors in the various events are: • Japanese, Chinese, • Other languages are pretty explicit about saying who did what to whom. • English LING 138/238 Autumn 2004

  18. Lexical Divergences • Word to phrases: • English “computer science” = French “informatique” • POS divergences • Eng. ‘she likes/VERB to sing’ • Ger. Sie singt gerne/ADV • Eng ‘I’m hungry/ADJ • Sp. ‘tengo hambre/NOUN LING 138/238 Autumn 2004

  19. Lexical Divergences: Specificity • Grammatical constraints • English has gender on pronouns, Mandarin not. • So translating “3rd person” from Chinese to English, need to figure out gender of the person! • Similarly from English “they” to French “ils/elles” • Semantic constraints • English `brother’ • Mandarin ‘gege’ (older) versus ‘didi’ (younger) • English ‘wall’ • German ‘Wand’ (inside) ‘Mauer’ (outside) • German ‘Berg’ • English ‘hill’ or ‘mountain’ LING 138/238 Autumn 2004

  20. Lexical Divergence: one-to-many LING 138/238 Autumn 2004

  21. Lexical Divergence: lexical gaps • Japanese: no word for privacy • English: no word for Cantonese ‘haauseun’ or Japanese ‘oyakoko’ (something like `filial piety’) • English ‘cow’ versus ‘beef’, Cantonese ‘ngau’ LING 138/238 Autumn 2004

  22. Event-to-argument divergences • English • The bottle floated out. • Spanish • La botella salió flotando. • The bottle exited floating • Verb-framed lg: mark direction of motion on verb • Spanish, French, Arabic, Hebrew, Japanese, Tamil, Polynesian, Mayan, Bantu familiies • Satellite-framed lg: mark direction of motion on satellite • Crawl out, float off, jump down, walk over to, run after • Rest of Indo-European, Hungarian, Finnish, Chinese LING 138/238 Autumn 2004

  23. Structural divergences • G: Wir treffen unsam Mittwoch • E: We’ll meeton Wednesday LING 138/238 Autumn 2004

  24. Head Swapping • E: X swim across Y • S: X crucar Y nadando • E: I like to eat • G: Ich esse gern • E: I’d prefer vanilla • G: Mir wäre Vanille lieber LING 138/238 Autumn 2004

  25. Thematic divergence • Y me gusto • I like Y • G: Mir fällt der Termin ein • E: Iforget the date LING 138/238 Autumn 2004

  26. Divergence counts from Bonnie Dorr • 32% of sentences in UN Spanish/English Corpus (5K) LING 138/238 Autumn 2004

  27. MT on the web • Babelfish: • http://babelfish.altavista.com/ LING 138/238 Autumn 2004

  28. 3 methods for MT • Direct • Transfer • Interlingua LING 138/238 Autumn 2004

  29. Three MT Approaches: Direct, Transfer, Interlingual Interlingua This slide from Bonnie Dorr! Original metaphor due to Bernard Vauquois Semantic Composition Semantic Decomposition Semantic Structure Semantic Structure Semantic Analysis Semantic Generation Semantic Transfer Syntactic Structure Syntactic Structure Syntactic Transfer Syntactic Analysis Syntactic Generation Word Structure Word Structure Direct Morphological Generation Morphological Analysis Target Text Source Text LING 138/238 Autumn 2004

  30. The Transfer Model • Idea: apply contrastive knowledge, i.e., knowledge about the difference between two languages • Steps: • Analysis: Syntactically parse Source language • Transfer: Rules to turn this parse into parse for Target language • Generation: Generate Target sentence from parse tree LING 138/238 Autumn 2004

  31. Transfer architecture LING 138/238 Autumn 2004

  32. English to French • Generally • English: Adjective Noun • French: Noun Adjective • Note: not always true • Route mauvaise ‘bad road, badly-paved road’ • Mauvaise route ‘wrong road’) • But is a reasonable first approximation • Rule: LING 138/238 Autumn 2004

  33. Example: English to Japanese Transfer LING 138/238 Autumn 2004

  34. English to Japanese Transfer • From “niqa no teire o suru ojiisan ita” • Add “ga” to mark subject • Chose verb to agree with suject • Inflect verbs • Linearize tree: • Niwa no teire o shite ita ojiisan ga ita • Garden GEN upkeep OBJ do PASTPROG old man SUBJ was LING 138/238 Autumn 2004

  35. E-to-J Transfer: rules used • Existential-There-Sentence • There1 Verb2 NP3 Postnominal4 • -> • (NP -> NP3 Relative-Clause4) Verb2 • NP -> Np1 Relative-Clause2 • -> • NP -> Relative-Clause2 NP1 LING 138/238 Autumn 2004

  36. Lexical Transfer • Man: • Ojisan ‘old man’ • Man is the only linguistic animal -> • Ningen ‘man, human being’ • Or • Hito ‘person, persons’ • Can treat like lexical ambiguity, • Disambiguate during parsing LING 138/238 Autumn 2004

  37. Transfer: some problems • N2 sets of transfer rules! • Grammar and lexicon full of language-specific stuff • Hard to build, hard to maintain LING 138/238 Autumn 2004

  38. MT Method 2: Interlingua • Intuition: Instead of lg-lg knowledge rules, use the meaning of the sentence to help • Steps: • 1) translate source sentence into meaning representation • 2) generate target sentence from meaning. LING 138/238 Autumn 2004

  39. Interlingua forthere was an old man gardening EVENT: GARDENING AGENT: MAN NUMBER SG DEFINITENESS INDEF ASPECT: PROGRESSIVE TENSE: PAST LING 138/238 Autumn 2004

  40. Interlingua • Idea is that some of the MT work that we need to do is part of other NLP tasks • E.g., disambiguating E:book S:‘libro’ from E:book S:‘reservar’ • So we could have concepts like BOOKVOLUME and RESERVE and solve this problem once for each language LING 138/238 Autumn 2004

  41. Vauqois diagram LING 138/238 Autumn 2004

  42. Direct Translation • Idea: more robust, word-specific models • Start with a Source language sentence • Write little transformations, directly on words, to turn it into a Target language sentence. LING 138/238 Autumn 2004

  43. Direct MT J-to-E Watashihatsukuenouenopenwojonniageta. 1. Morphological analysis Watashi h tsukue no ue no pen wo jon ni ageru PAST 2) lexical transfer of content words I ha desk no ue no pen wo John ni give PAST 3) various preposition work I ha pen on desk wo John to give PAST. 4) SVO rearrangements I give PAST pen on desk John to. 5) miscellany I give PAST the pen on the desk to John. 6) morphological generation I gave the pen on the desk to John. LING 138/238 Autumn 2004

  44. Direct MT stage 2, (ex. from Panov 1960 via Hutchins 1986) Function direct-translate-much/many If preceding word is ‘how’ Return skol’ko Else if preceding word is ‘as’ Return skol’ko zhe Else if word is ‘much’ If preceding words is ‘very’; Return nil (not translated) Else if following word is a noun Return ‘mnogo’ Else /*word is many*/ If preceding word is PREP and following is NOUN Return ‘mnogii’ Else return ‘mnogo’ LING 138/238 Autumn 2004

  45. Three MT Approaches: Direct, Transfer, Interlingual Interlingua This slide from Bonnie Dorr! Original metaphor due to Bernard Vauquois Semantic Composition Semantic Decomposition Semantic Structure Semantic Structure Semantic Analysis Semantic Generation Semantic Transfer Syntactic Structure Syntactic Structure Syntactic Transfer Syntactic Analysis Syntactic Generation Word Structure Word Structure Direct Morphological Generation Morphological Analysis Target Text Source Text LING 138/238 Autumn 2004

  46. 3 methods pros and cons • Thanks to Bonnie Dorr! LING 138/238 Autumn 2004

  47. Direct MT: pros and cons (thanks to Bonnie Dorr) • Pros • Fast • Simple • Cheap • No translation rules hidden in lexicon • Cons • Unreliable • Not powerful • Rule proliferation • Requires lots of context • Major restructuring after lexical substitution LING 138/238 Autumn 2004

  48. Interlingual MT: pros and cons (from B. Dorr) • Pros • Avoids the N2 problem • Easier to write rules • Cons: • Semantics is HARD • Useful information lost (paraphrase) LING 138/238 Autumn 2004

More Related