370 likes | 535 Vues
Automating machine translation from poorly studied languages. John Goldsmith Departments of Linguistics and Computer Science. Outline. The goal of automatic translation The history of automatic translation From the cybernetics era (1948 – 1960) To the statistical era (1993 – date)
E N D
Automating machine translation from poorly studied languages John Goldsmith Departments of Linguistics and Computer Science John Goldsmith: DTRA Meeting
Outline • The goal of automatic translation • The history of automatic translation • From the cybernetics era (1948 – 1960) • To the statistical era (1993 – date) • The problem of complex word structure in most languages • and our solution • Where we stand today John Goldsmith: DTRA Meeting
1. The goal of automatic translation There are over 6,000 human languages in use today. • Researchers need access to documents in all of these languages over the long run. • Defense analysts may need access to documents in any of these languages with pressing time needs. • Languages can be intentionally used as encryption systems. John Goldsmith: DTRA Meeting
6,000 natural languages of the world John Goldsmith: DTRA Meeting
Just the major languages of Africa alone John Goldsmith: DTRA Meeting
Computational linguistic research • When we specify the problem that we tackle, they often sound super-humanly difficult. • When we begin to explain the methods, they can sound far too simple. • In fact, the methods are conceptually elegant, and highly quantitative. The goals of linguistics, with the tools of computer science. John Goldsmith: DTRA Meeting
2. History of MT • A reminder of where computers actually came from: • World War II, and their uses • more accurately aim artillery • efforts to break German encryption systems • Post-war period of industrialization John Goldsmith: DTRA Meeting
Warren Weaver’s memo (July 1949) Director, Natural Sciences Division of the Rockefeller Foundation It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the "Chinese code." If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation? John Goldsmith: DTRA Meeting
Efforts in the 1950s ...stymied by the lack of sufficient computing power, and immature computing technology John Goldsmith: DTRA Meeting
Example They hadn’t reckoned on ambiguity when they set out to translate human languages…. January, 1954 John Goldsmith: DTRA Meeting
Progress during the 1970s and 1980s was incremental. • In the 1990s, a major sea-change in computational linguistics occurred, based on data-driven statistical techniques. • IBM Research developed an approach to translation based on systems that learn from examples. John Goldsmith: DTRA Meeting
Statistical Machine Translation (MT) John Goldsmith: DTRA Meeting
1999: The Egypt system • NSF funded a summer project at Johns Hopkins University: Egypt. • Open source and widely used in research. • Difficult to use in practice. John Goldsmith: DTRA Meeting
What do we translate? • Do we translate sentences? • In a sense, yes. • l’entourage de Chirac est plus imperméable que celui de Nicolas Sarkozy. • Chirac’s inner circle is more tightly knit than that of Nicolas Sarkozy. John Goldsmith: DTRA Meeting
Sentences = W + C • A sentence is a collection of words and constructions. • We translate the words and the constructions. • We will break the problem down into these two parts, then. John Goldsmith: DTRA Meeting
Word-level alignments Given a parallel sentence pair we can link (align) words or phrases that are translations of each other: Le chien se est assis sur le tapis System is given 2 sentences, but without any information about how the words are aligned: these lines are inferred, not given. the dog sat down on the rug John Goldsmith: DTRA Meeting
MT: first two tasks • Figure out word-to-word matchings (translations) • Figure out common alignments across the source and target languages: how their word orders differ. • French and English: quite similar • Japanese, Korean: verb appears at the end of the sentence. John Goldsmith: DTRA Meeting
Just a taste… • This is our corpus: John Goldsmith: DTRA Meeting
“NULL”? We often find that a word in one language corresponds to nothing in the other language: so we include “NULL” as an ever-present possibility of translation. John Goldsmith: DTRA Meeting
NULL the dog :: le chien j=1 (le) total = P(le | NULL)+P(le | the)+P(le | dog)= 2/3 + ½ = 7/6= 1.17 tc=total count tc(a|b) = total expected count of this joint occurrence John Goldsmith: DTRA Meeting
Changes in probabilities Initialized values Iteration 2 After 5 iterations: John Goldsmith: DTRA Meeting
3. What is morphology? • Morphology studies the internal structure of words • English: words = [word + s ] findings = [ find + ing ] + s - Swahili: tunakisema “we speak it” John Goldsmith: DTRA Meeting
European languages are outliers. • From the morphological point of view, most languages of the world are much more complex than European languages. John Goldsmith: DTRA Meeting
Linguistica • Computational linguistic project under development since 1997 • http://linguistica.uchicago.edu • Core engine: automatic morphology analyzer • Learns the morphological structure of a language directly from a (written) sample, with no human intervention. John Goldsmith: DTRA Meeting
English illustration • Bear in mind: the system has no initial knowledge at all about English. • It takes about 15 seconds to analyze 200,000 words of English. • C++ code is highly optimized, and operates 2 orders of magnitude faster than other comparable computational linguistic systems. John Goldsmith: DTRA Meeting
Signatures Adjectives Verbs We find these automatically Nouns John Goldsmith: DTRA Meeting
Compounds • English makes heavy use of compounds, which are best handled if we can break them apart: • Eastward • eggshell • farmhouse • headdress John Goldsmith: DTRA Meeting
Compounds ** Selected ** ** Rejected ** ** Rejected ** John Goldsmith: DTRA Meeting
4. Where we stand today Our project is working on: • improving automatic morphology • integrating Egypt statistical machine translation into our package for easy application • improving translation by using morphology • testing with Swahili-English John Goldsmith: DTRA Meeting
4.1 Improving automatic morphology • Swahili, Somali, Urdu, Finnish • Compounds: English, Finnish John Goldsmith: DTRA Meeting
Swahili nilimupenda nitakamupenda John Goldsmith: DTRA Meeting
Swahili verb John Goldsmith: DTRA Meeting
4.2 Integrating Egypt MT software into our front-end • Linguistica has a user-friendly front end • Linguistica is written in C++, compiles under Windows, MacOS, and Linux • Open source John Goldsmith: DTRA Meeting
4.3 Improving translations using morphology • Developing mathematical models • A small amount of work has been done by other researchers, but the goal has largely been to use morphology to strip off affixes. John Goldsmith: DTRA Meeting
4.4 Testing with Swahili • 8 books from the New Testament available on the internet. John Goldsmith: DTRA Meeting
the end John Goldsmith: DTRA Meeting