1 / 37

Automating machine translation from poorly studied languages

Automating machine translation from poorly studied languages. John Goldsmith Departments of Linguistics and Computer Science. Outline. The goal of automatic translation The history of automatic translation From the cybernetics era (1948 – 1960) To the statistical era (1993 – date)

arnold
Télécharger la présentation

Automating machine translation from poorly studied languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automating machine translation from poorly studied languages John Goldsmith Departments of Linguistics and Computer Science John Goldsmith: DTRA Meeting

  2. Outline • The goal of automatic translation • The history of automatic translation • From the cybernetics era (1948 – 1960) • To the statistical era (1993 – date) • The problem of complex word structure in most languages • and our solution • Where we stand today John Goldsmith: DTRA Meeting

  3. 1. The goal of automatic translation There are over 6,000 human languages in use today. • Researchers need access to documents in all of these languages over the long run. • Defense analysts may need access to documents in any of these languages with pressing time needs. • Languages can be intentionally used as encryption systems. John Goldsmith: DTRA Meeting

  4. 6,000 natural languages of the world John Goldsmith: DTRA Meeting

  5. Just the major languages of Africa alone John Goldsmith: DTRA Meeting

  6. Computational linguistic research • When we specify the problem that we tackle, they often sound super-humanly difficult. • When we begin to explain the methods, they can sound far too simple. • In fact, the methods are conceptually elegant, and highly quantitative. The goals of linguistics, with the tools of computer science. John Goldsmith: DTRA Meeting

  7. 2. History of MT • A reminder of where computers actually came from: • World War II, and their uses • more accurately aim artillery • efforts to break German encryption systems • Post-war period of industrialization John Goldsmith: DTRA Meeting

  8. Warren Weaver’s memo (July 1949) Director, Natural Sciences Division of the Rockefeller Foundation It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the "Chinese code." If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation? John Goldsmith: DTRA Meeting

  9. Efforts in the 1950s ...stymied by the lack of sufficient computing power, and immature computing technology John Goldsmith: DTRA Meeting

  10. Example They hadn’t reckoned on ambiguity when they set out to translate human languages…. January, 1954 John Goldsmith: DTRA Meeting

  11. Progress during the 1970s and 1980s was incremental. • In the 1990s, a major sea-change in computational linguistics occurred, based on data-driven statistical techniques. • IBM Research developed an approach to translation based on systems that learn from examples. John Goldsmith: DTRA Meeting

  12. Statistical Machine Translation (MT) John Goldsmith: DTRA Meeting

  13. 1999: The Egypt system • NSF funded a summer project at Johns Hopkins University: Egypt. • Open source and widely used in research. • Difficult to use in practice. John Goldsmith: DTRA Meeting

  14. What do we translate? • Do we translate sentences? • In a sense, yes. • l’entourage de Chirac est plus imperméable que celui de Nicolas Sarkozy. • Chirac’s inner circle is more tightly knit than that of Nicolas Sarkozy. John Goldsmith: DTRA Meeting

  15. Sentences = W + C • A sentence is a collection of words and constructions. • We translate the words and the constructions. • We will break the problem down into these two parts, then. John Goldsmith: DTRA Meeting

  16. Word-level alignments Given a parallel sentence pair we can link (align) words or phrases that are translations of each other: Le chien se est assis sur le tapis System is given 2 sentences, but without any information about how the words are aligned: these lines are inferred, not given. the dog sat down on the rug John Goldsmith: DTRA Meeting

  17. MT: first two tasks • Figure out word-to-word matchings (translations) • Figure out common alignments across the source and target languages: how their word orders differ. • French and English: quite similar • Japanese, Korean: verb appears at the end of the sentence. John Goldsmith: DTRA Meeting

  18. Just a taste… • This is our corpus: John Goldsmith: DTRA Meeting

  19. “NULL”? We often find that a word in one language corresponds to nothing in the other language: so we include “NULL” as an ever-present possibility of translation. John Goldsmith: DTRA Meeting

  20. NULL the dog :: le chien j=1 (le) total = P(le | NULL)+P(le | the)+P(le | dog)= 2/3 + ½ = 7/6= 1.17 tc=total count tc(a|b) = total expected count of this joint occurrence John Goldsmith: DTRA Meeting

  21. Changes in probabilities Initialized values Iteration 2 After 5 iterations: John Goldsmith: DTRA Meeting

  22. 3. What is morphology? • Morphology studies the internal structure of words • English: words = [word + s ] findings = [ find + ing ] + s - Swahili: tunakisema “we speak it” John Goldsmith: DTRA Meeting

  23. European languages are outliers. • From the morphological point of view, most languages of the world are much more complex than European languages. John Goldsmith: DTRA Meeting

  24. Linguistica • Computational linguistic project under development since 1997 • http://linguistica.uchicago.edu • Core engine: automatic morphology analyzer • Learns the morphological structure of a language directly from a (written) sample, with no human intervention. John Goldsmith: DTRA Meeting

  25. English illustration • Bear in mind: the system has no initial knowledge at all about English. • It takes about 15 seconds to analyze 200,000 words of English. • C++ code is highly optimized, and operates 2 orders of magnitude faster than other comparable computational linguistic systems. John Goldsmith: DTRA Meeting

  26. Signatures Adjectives Verbs We find these automatically Nouns John Goldsmith: DTRA Meeting

  27. Compounds • English makes heavy use of compounds, which are best handled if we can break them apart: • Eastward • eggshell • farmhouse • headdress John Goldsmith: DTRA Meeting

  28. Compounds ** Selected ** ** Rejected ** ** Rejected ** John Goldsmith: DTRA Meeting

  29. 4. Where we stand today Our project is working on: • improving automatic morphology • integrating Egypt statistical machine translation into our package for easy application • improving translation by using morphology • testing with Swahili-English John Goldsmith: DTRA Meeting

  30. 4.1 Improving automatic morphology • Swahili, Somali, Urdu, Finnish • Compounds: English, Finnish John Goldsmith: DTRA Meeting

  31. Swahili nilimupenda nitakamupenda John Goldsmith: DTRA Meeting

  32. John Goldsmith: DTRA Meeting

  33. Swahili verb John Goldsmith: DTRA Meeting

  34. 4.2 Integrating Egypt MT software into our front-end • Linguistica has a user-friendly front end • Linguistica is written in C++, compiles under Windows, MacOS, and Linux • Open source John Goldsmith: DTRA Meeting

  35. 4.3 Improving translations using morphology • Developing mathematical models • A small amount of work has been done by other researchers, but the goal has largely been to use morphology to strip off affixes. John Goldsmith: DTRA Meeting

  36. 4.4 Testing with Swahili • 8 books from the New Testament available on the internet. John Goldsmith: DTRA Meeting

  37. the end John Goldsmith: DTRA Meeting

More Related