1 / 23

An Introduction to Machine Translation

An Introduction to Machine Translation. Andy Way, DCU. The Rise & Fall of Different MT Paradigms. Three main approaches to RBMT. language-neutral interlingua. TRANSFER. GENERATION. ANALYSIS. direct translation. target text. source text. The Vauquois Pyramid. System Design: Concerns.

kaida
Télécharger la présentation

An Introduction to Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Machine Translation Andy Way, DCU

  2. The Rise & Fall of Different MT Paradigms

  3. Three main approaches to RBMT language-neutral interlingua TRANSFER GENERATION ANALYSIS direct translation target text source text The Vauquois Pyramid

  4. System Design: Concerns • Multilingual vs. Bilingual • Multilingual: • Extreme: Eurotra, i.e. 72 language pairs Modest: EN DE,FR,ES, i.e. 3 language pairs • Intermediate: EN,FR,DE,ES,JP, but not all combinations • Bilingual: • Unidirectional vs. Bidirectional • ENFR or FREN • Reversible vs. Non-reversible • ENFR, same EN,FR components for Analysis & Generation, and reversible transfer module • ENFR & FREN, but different EN, FR components for Analysis & Generation, and different transfer modules, NB, lack of modularity … • Direct vs. Transfer vs. Interlingua • Batch vs. Interactive

  5. Advantages/Disadvantages of Direct Systems • Advantages • Engine's competence lies in its comparative grammar. • Highly robust. Does not break down or stop whenit encounters unknown words, unknown grammaticalconstructs, or ill-formed Input • Designed for unidirectional translation between one pair of langs. Not conducive to genuine multilingual MT design. • Disadvantages • ‘word-for-word' translation + local reordering = poortranslation, using cheap bilingual dictionary & rudimentary knowledge of target language. • Linguistically, computationally naive. No analysis of internalstructure of Input, especially w.r.t. the grammatical relationships between the main parts of sentences.

  6. Advantages/Disadvantages of Interlingual Systems • Advantages • Intermediate representation (IR) fully specified, i.e. no need to ‘look back' at Source in order to generate Target. • Easy to extend to other langs. • Built-in backtranslation: useful for testing. • Disadvantages • How to define an Interlingua for closely related languages? • Truly universal Interlingua possible?

  7. Advantages/Disadvantages of Transfer Systems • Advantages • No language-independent representations: source IR specific to a particular lang., as is the target lang. IR. • So Complexity of Analysis & Generation components much reduced … • Also, no necessary equivalence between source and targetIRs for the same language! • Disadvantages • Not so easy to extend to other languages: n analysis modules, n generation modules, n x n-1 transfer modules, i.e. not much less than n² … • No guaranteed built-in back translation.

  8. Direct, or Indirect? • Direct: • From manufacturer's viewpoint, better, as it's more robust … • Indirect: • Falls over more easily. • Development phase can be trying. • Commercially, must be supplemented with techniques for dealing with unseen Input. • What about Translation Quality? • Indirect systems clearly better in principle. • However, constructing MT engine requires considerable effort. • Direct Systems can achieve good performance. • Summary • Research: mostly Transfer-based, with rules automatically acquired from data • Industrially: we can expect highly-developed Direct Systems to survive for some years to come …

  9. Other Material • Arnold, D. et al. (1994): Machine Translation - An Introductory Guide; NCC Blackwell, Oxford • Hutchins, J. & H. Somers (1992): An Introduction to MT; Academic Press, London • Trujillo, A. (1999): Translation Engines; Springer, London • Newer books include: • Bowker, L. (2002): Computer-Aided Translation Technology, U. of Ottawa Press. • Somers, H. (2003): Computers and Translation: A translator's guide, John Benjamins. • Bond, F. (2005): Translating the Untranslatable, CSLI. • Quah, C. (2006): Translation and Technology, Palgrave MacMillan.

  10. Why Corpus-Based MT? • the (relative) failure of rule-based approaches • the increasing availability of machine-readable text • the increase in capability of hardware (CPU, memory, disk space) with associated decrease in cost

  11. Corpus-Based MT is here to stay These approaches are now mainstream: • Most researchers are developing corpus-based systems; • First company to use SMT now exists: http://www.languageweaver.com; • CNGL partner Traslán uses EBMT/SMT hybrid; • In recent large-scale evaluations, corpus-based MT systems come first. Two caveats: • Most industrial systems are still rule-based (but cf. Google’s systems now all SMT); • Current mainstream evaluation metrics favour n-gram-based systems (i.e. bias towards SMT).

  12. Thanks to Kevin Knight …

  13. Centauri/Arcturan Exercise Slides already on CA446 webpage …

  14. Centauri/Arcturan [Knight, 1997] Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp} • There are 6! different orders possible, so 720 different translations. • Best order (according to placement in TL side of the corpus is as given above): • Not just unigrams, but n-grams also …

  15. 1a. Garcia and associates . 1b. Garcia y asociados . 7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos . 2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 8a. the company has three groups . 8b. la empresa tiene tres grupos . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 5a. its clients are angry . 5b. sus clientes estan enfadados . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos . It’s Really Spanish—English! Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa

  16. Some more to try … • iat lat pippat eneat hilat oloat at-yurp. • totat nnat forat arrat mat bat. • wat dat quat cat uskrat at-drubel.

  17. Some more to try … • iat lat pippat eneat hilat oloat at-yurp. • totat nnat forat arrat mat bat. • wat dat quat cat uskrat at-drubel. … if you have trouble sleeping at nights!

  18. What have we just seen? • what parallel corpora look like; • how relevant parallel corpora are for MT; • how to build bilingual dictionaries from parallel corpora; • how cognate information may be useful in MT; • how to do word alignment.

  19. What else do we need to know? about word alignment on a larger scale; about phrasal alignment, the norm in real translation data; about unknown words; the importance of knowing the target language (vs. source) in making fluent translations; about locality in word order shifts; how to guess the meanings/translations of unknown words; about how much uncertainty the machine faces in working with limited data; about working on different domains; …

  20. Do such methods scale to ‘real’ MT? • Availability of monolingual and bilingual corpora? • Possibility of sentence-aligning bilingual corpora? • Can we write an algorithm to extract the translation dictionary? • Can we write an algorithm to extract the monolingual word pair counts? • Can we write an algorithm to generate translations using our translation dictionary and word pair counts?

  21. Do such methods scale to ‘real’ MT? • Availability of monolingual and bilingual corpora? • Possibility of sentence-aligning bilingual corpora? • Can we write an algorithm to extract the translation dictionary? • Can we write an algorithm to extract the monolingual word pair counts? • Can we write an algorithm to generate translations using our translation dictionary and word pair counts? • WILL THE TRANSLATIONS PRODUCED BE ANY GOOD?

  22. Parallel Corpora • Hugely important … but not available in a wide range of language pairs: • Chinese—English: Hong Kong data • French—English: Canadian Hansards • Older EU pairs: Europarl [Koehn 04] • Newer EU pairs: JRC-Acquis Communautaire, very recently distributed updated Europarl • Arabic—English: LDC Data • NIST, IWSLT, TC-STAR Evaluations • …

  23. Caveat interpres! • Beware of sparse data! • Beware of unrepresentative corpora! • Beware of poor quality language! If the corpora are small, or of poor quality, or are unrepresentative, then our statistical language models will be poor, so any results we achieve will be poor.

More Related