1 / 39

Introduction to Machine Translation

CIS 530 - Intro to NLP. 2. Why use computers in translation?. Too much translation for humansTechnical materials too boring for humansGreater consistency requiredNeed results more quicklyNot everything needs to be top qualityReduce costsAny one of these may justify machine translatio

kaycee
Télécharger la présentation

Introduction to Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Introduction to Machine Translation Mitch Marcus CIS 530 Some slides adapted from slides by John Hutchins, Bonnie Dorr, Martha Palmer Language Weaver, Kevin Knight

    2. CIS 530 - Intro to NLP 2 Why use computers in translation? Too much translation for humans Technical materials too boring for humans Greater consistency required Need results more quickly Not everything needs to be top quality Reduce costs Any one of these may justify machine translation or computer aids

    3. CIS 530 - Intro to NLP 3 The Early History of NLP (Hutchins): MT in the 1950s and 1960s Sponsored by government bodies in USA and USSR (also CIA and KGB) assumed goal was fully automatic quality output (i.e. of publishable quality) [dissemination] actual need was translation for information gathering [assimilation] Survey by Bar-Hillel of MT research: criticised assumption of FAHQT as goal demonstrated non-feasibility of FAHQT (without unrealisable encyclopedic knowledge bases) advocated man-machine symbiosis, i.e. HAMT and MAHT ALPAC 1966, set up by disillusioned funding agencies compared latest systems with early unedited MT output (IBM-GU demo, 1954), criticised for still needing post-editing advocated machine aids, and no further support of MT research but failed to identify the actual needs of funders [assimilation] therefore failed to see that output of IBM-USAF Translator and Georgetown systems were used and appreciated

    4. CIS 530 - Intro to NLP 4 Consequences of ALPAC MT research virtually ended in US identification of actual needs assimilation vs. dissemination recognition that perfectionism (FAHQT) had neglected: operational factors and requirements expertise of translators machine aids for translators henceforth three strands of MT: translation tools (HAMT, MAHT) operational systems (post-editing, controlled languages, domain-specific systems) research (new approaches, new methods) computational linguistics born in the aftermath

    5. Review from Introductory Lectures

    6. CIS 530 - Intro to NLP 6

    7. CIS 530 - Intro to NLP 7 How A Statistical MT System Learns

    8. CIS 530 - Intro to NLP 8 Translating a New Document

    9. CIS 530 - Intro to NLP 9 Latest Methods: Language Weaver Chinese Source Text Sample1: ??,??????????????????????????????????24????1200??????,8???40??????,???????????,?????????????????? ????????????????????,???????????????????????,?????????????????? ???????????????????????????????,35????????35%;62.1%??????,????????????????,????????????????????;76%????????????????? ????,??????????????????????????????,?4??,???????26.79??,??27.6%;???4??,? ???9??,????50.5%,???????????????????????,???,?????????????18.8??,???? 22%,????105%;?????????1.1??,?????25%? Language Weaver Experimental Syntax MT Sample 1 : The motor show, has always been the' barometer' of a national car consumption and market potential. The Beijing International Auto Show has more than 1,200 exhibitors from 24 countries and 8 days of receiving more than 40 million visitors, setting a new record in China's auto show, are deeply aware of the automobile market signals. "China is one of the largest automobile market in the world. Over the years, this phrase implies more auto businessmen. But now, more and more facts indicates that it is to become a reality. Data from the Motor Show is very convincing. The Beijing Qingnian Bao Report on-the-spot investigation showed that about 35 percent of 35-year-old visitors, 62.1 percent of the respondents said that the truck was mainly to buy a car in the near future to collect information, even at the exhibition may purchase or suitable products; 76% of respondents indicated in the past two years to buy private cars. Since the beginning of this year, the strong growth of the domestic car market. According to the figures released by the National Bureau of Statistics, in the first four months, the country produced 267,900 vehicles, up 27.6 percent; in particular, in April, the production of 90,000 vehicles, an increase of 50.5% over the same period last year, setting a record high for the monthly output growth over the past 10-odd years. In terms of sales in the first quarter, manufacturing enterprises in the country sold 188,000 cars, up 22 percent over the same period of last year, up 10.5 percent; 11,000 vehicles, dropping by nearly 25 percent lower than the beginning of the year.

    10. CIS 530 - Intro to NLP 10

    11. Introduction: Approaches & Difficulties

    12. CIS 530 - Intro to NLP 12 MT Challenges: Ambiguity Syntactic Ambiguity I saw the man on the hill with the telescope Lexical Ambiguity E: book S: libro, reservar Semantic Ambiguity Homography: ball(E) = pelota, baile(S) Polysemy: kill(E), matar, acabar (S) Semantic granularity esperar(S) = wait, expect, hope (E) be(E) = ser, estar(S) fish(E) = pez, pescado(S)

    13. CIS 530 - Intro to NLP 13 MT Challenges: Divergences

    14. CIS 530 - Intro to NLP 14

    15. CIS 530 - Intro to NLP 15 Divergence Frequency 32% of sentences in UN Spanish/English Corpus (5K) 35% of sentences in TREC El Norte Corpus (19K) Divergence Types Categorial (X tener hambre ? X have hunger) [98%] Conflational (X dar pualadas a Z ? X stab Z) [83%] Structural (X entrar en Y ? X enter Y) [35%] Head Swapping (X cruzar Y nadando ? X swim across Y) [8%] Thematic (X gustar a Y ? Y like X) [6%]

    16. CIS 530 - Intro to NLP 16 MT Lexical Choice- WSD Iraq lost the battle. Ilakuka centwey ciessta. [Iraq ] [battle] [lost]. John lost his computer. John-i computer-lul ilepelyessta. [John] [computer] [misplaced].

    17. CIS 530 - Intro to NLP 17 WSD with Source Language Semantic Class Constraints

    18. CIS 530 - Intro to NLP 18 Lexical Gaps: English to Chinese break smash shatter snap ? da po - irregular pieces da sui - small pieces pie duan -line segments

    19. CIS 530 - Intro to NLP 19 Three MT Approaches: Direct, Transfer, Interlingual (Vauquois triangle)

    20. CIS 530 - Intro to NLP 20 Examples of Three Approaches Direct: I checked his answers against those of the teacher ? Yo compar sus respuestas a las de la profesora Rule: [check X against Y] ? [comparar X a Y] Transfer: Ich habe ihn gesehen ? I have seen him Rule: [clause agt aux obj pred] ? [clause agt aux pred obj] Interlingual: I like Mary? Mary me gusta a m Rep: [BeIdent (I [ATIdent (I, Mary)] Like+ingly)]

    21. CIS 530 - Intro to NLP 21 Direct MT: Pros and Cons Pros Fast Simple Inexpensive Cons Unreliable Not powerful Rule proliferation Requires too much context Major restructuring after lexical substitution

    22. CIS 530 - Intro to NLP 22 Transfer MT: Pros and Cons Pros Dont need to find language-neutral rep No translation rules hidden in lexicon Relatively fast Cons N2 sets of transfer rules: Difficult to extend Proliferation of language-specific rules in lexicon and syntax Cross-language generalizations lost

    23. CIS 530 - Intro to NLP 23 Interlingual MT: Pros and Cons Pros Portable (avoids N2 problem) Lexical rules and structural transformations stated more simply on normalized representation Explanatory Adequacy Cons Difficult to deal with terms on primitive level: universals? Must decompose and reassemble concepts Useful information lost (paraphrase) (Is thought really language neutral??)

    24. An Gentle Introduction to Statistical MT: Core ideas

    25. CIS 530 - Intro to NLP 25 Warren Weaver 1949 Memorandum I Proposes Local Word Sense Disambiguation! If one examines the words in a book, one at a time through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of words. "Fast" may mean "rapid"; or it may mean "motionless"; and there is no way of telling which. But, if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then, if N is large enough one can unambiguously decide the meaning. . .

    26. CIS 530 - Intro to NLP 26 Warren Weaver 1949 Memorandum II Proposes Interlingua for Machine Translation! Thus it may be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communicationthe real but as yet undiscovered universal languageandthen re-emerge by whatever particular route is convenient.

    27. CIS 530 - Intro to NLP 27 Warren Weaver 1949 Memorandum III Proposes Machine Translation using Information Theory! It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the "Chinese code." If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation? Weaver, W. (1949): Translation. Repr. in: Locke, W.N. and Booth, A.D. (eds.) Machine translation of languages: fourteen essays (Cambridge, Mass.: Technology Press of the Massachusetts Institute of Technology, 1955), pp. 15-23.

    28. CIS 530 - Intro to NLP 28 IBM Adopts Statistical MT Approach I (early 1990s) In 1949, Warren Weaver proposed that statistical techniques from the emerging field of information theory might make it possible to use modern digital computers to translate text from one natural language to another automatically. Although Weaver's scheme foundered on the rocky reality of the limited computer resources of the day, a group of IBM researchers in the late 1980's felt that the increase in computer power over the previous forty years made reasonable a new look at the applicability of statistical techniques to translation. Thus the "Candide" project, aimed at developing an experimental machine translation system, was born at IBM TJ Watson Research Center.

    29. CIS 530 - Intro to NLP 29 IBM Adopts Statistical MT Approach II The Candide group adopted an information-theoretic perspective on the MT problem, which goes as follows. In speaking a French sentence F, a French speaker originally thought up a sentence E in English, but somewhere in the noisy channel between his brain and mouth, the sentence E got "corrupted" to its French translation F. The task of an MT system is to discover E* = argmax(E') p(F|E') p(E'); that is, the MAP-optimal English sentence, given the observed French sentence. This approach involves constructing a model of likely English sentences, and a model of how English sentences translate to French sentences. Both these tasks are accomplished automatically with the help of a large amount of bilingual text. As wacky as this perspective might sound, it's no stranger than the view that an English sentence gets corrupted into an acoustic signal in passing from the person's brain to his mouth, and this perspective is now essentially universal in automatic speech recognition.

    30. CIS 530 - Intro to NLP 30 The Channel Model for Machine Translation

    31. CIS 530 - Intro to NLP 31 Noisy Channel - Why useful? Word reordering in translation handled by P(S) P(S) factor frees P(T | S) from worrying about word order in the Source language Word choice in translation handled by P (T|S) P(T| S) factor frees P(S) from worrying about picking the right translation

    32. CIS 530 - Intro to NLP 32 An Alignment

    33. CIS 530 - Intro to NLP 33 Fertilities and Lexical Probabilities for not

    34. CIS 530 - Intro to NLP 34 Fertilities and Lexical Probabilities for hear

    35. CIS 530 - Intro to NLP 35 Schematic of Translation Model

    36. CIS 530 - Intro to NLP 36 How do we evaluate MT? Human-based Metrics Semantic Invariance Pragmatic Invariance Lexical Invariance Structural Invariance Spatial Invariance Fluency Accuracy: Number of Human Edits required HTER: Human Translation Error Rate Do you get it? Automatic Metrics: Bleu

    37. CIS 530 - Intro to NLP 37 BiLingual Evaluation Understudy (BLEU Papineni, 2001) Automatic Technique, but . Requires the pre-existence of Human (Reference) Translations Compare n-gram matches between candidate translation and 1 or more reference translations

    38. CIS 530 - Intro to NLP 38 Bleu Metric

    39. CIS 530 - Intro to NLP 39 Bleu Metric

    40. Thanks! CIS 530 - Intro to NLP 40

More Related