220 likes | 452 Vues
EBMT. Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman. Outline. EBMT in outline? What data do we need? How do we create a lexicon? Indexing the corpus. Finding chunks to translate. Matching a chunk against the target.
 
                
                E N D
EBMT • Example Based Machine Translationas used in the Pangloss system at Carnegie Mellon University • Dave Inman EBMT
Outline • EBMT in outline? • What data do we need? • How do we create a lexicon? • Indexing the corpus. • Finding chunks to translate. • Matching a chunk against the target. • Quality of translation. • Speed of translation. • Good and bad points • Conclusions. EBMT
EBMT in outline - Corpus • Corpus • S1: The cat eats a fish. Le chat mange un poisson. • S2: A dog eats a cat. Un chien mange un chat. • ….. • S99,999,999 …. • Index • the: S1 • cat: S1 • eats: S1 • … • dog: S2 EBMT
EBMT in outline – find chunks • A source language sentence is input. • The cat eats a dog. • Chunks of this sentence are matched against the corpus. • The cat : S1 • The cat eats: S1 • The cat eats a: S1 • a dog : S2 EBMT
How does EBMT work in outline - Corpus • 1. The target language sentences are retrieved for each chunk. • The cat eats : S1 • Corpus • S1: The cat eats a fish. Le chat mange un poisson • 2. The chunks are aligned with target sentences (hard!). • The cat eats Le chat mange EBMT
How does EBMT work in outline - Corpus • Chunks are scored to find good match… • The cat eats Le chat mange Score 78% • The cat eats Le chat dorme Score 43% • … • a dog un chien Score 67% • a dog le chien Score 56% • a dog un arbre Score 22% • The best translated chunks are put together to make the final translation. • The cat eats Le chat mange • a dog un chien EBMT
What data do we need? • A large corpus of parallel sentences. …if possible in the same domain as the translations. • A bilingual dictionary …but we can induce this from the corpus. • A target language root/synonym list. … so we can see similarity between words and inflected forms (e.g. verbs) • Classes of words easily translated … such as numbers, towns, weekdays. EBMT
How to create a lexicon. • Take each sentence pair in the corpus. • For each word in the source sentence, add each word in the target sentence and increment the frequency count. • Repeat for as many sentences as possible. • Use a threshold to get possible alternative translations. EBMT
How to create a lexicon..example • The cat eats a fish. Le chat mange un poisson. EBMT
Create a lexicon…after many sentences • the le,956 • la,925 • un,235 • ------ Threshold ---------- • chat,47 • mange,33 • poisson,28 • .... • arbre,18 EBMT
Create a lexicon…after many sentences • cat chat,963 • ------ Threshold ---------- • le,604 • la,485 • un,305 • mange,33 • poisson,28 • .... • arbre,47 EBMT
Indexing the corpus. • For speed the corpus is indexed on the source language sentences. • Each word in each source language sentence is stored with info about the target sentence. • Words can be added to the corpus and the index easily updated. • Tokens are used for common classes of words (e.g. numbers). This makes matching more effective. EBMT
Finding chunks to translate. • Look up each word in the source sentence in the index. • Look for chunks in the source sentence (at least 2 words adjacent) which match the corpus. • Select last few matches against the corpus (translation memory). • Pangloss uses the last 5 matches for any chunk. EBMT
Matching a chunk against the target. • For each source chunk found previously, retrieve the target sentences from the corpus (using the index). • Try to find the translation for the source chunk from these sentences. • This is the hard bit! • Look for the minimum and maximum segments in the target sentences which could correspond with the source chunk. Score each of these segments. EBMT
Scoring a segment… • Unmatched Words : Higher priority is given to sentences containing all the words in an input chunk. • Noise : Higher priority is given to corpus sentences which have fewer extra words. • Order : Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk. • Morphology : Higher priority is given to sentences in which words match exactly rather than against morphological variants. EBMT
Whole sentence match… • If we are lucky the whole sentence will be found in the corpus! • In that case the target sentence is used without previous alignment. • Useful if translation memory is available (sentences recently translated are added to the corpus). EBMT
Quality of translation. • Pangloss was tested against source sentences in a different domain to the examples in the corpus. • Pangloss “covered” about 70% of the sentences input. • This means a match was found against the corpus…. • …but not necessarily a good match. • Others report around 60% of the translation can be understood by a native speaker. Systran manages about 70%. EBMT
Speed of translation. • Translations are much faster than for Systran. • Simple sentences translated in seconds. • Corpus can be added to (translation memory) at about 6MBytes per minute (Sun Sparc Station) • A 270 Mbytes corpus takes 45 minutes to index. EBMT
Good points. • Fast • Easy to add a new language pair • No need to analyse languages (much) • Can induce a dictionary from the corpus • Allows easy implementation of translation memory • Graceful degradation as size of corpus reduced EBMT
Bad points. • Quality is second best at present • Depends on a large corpus of parallel, well translated sentences • 30% of source has no coverage (translation) • Matching of words is brittle – we can see a match Pangloss cannot. • Domain of corpus should match domain to be translated - to match chunks EBMT
Conclusions. • An alternative to Systran • Faster • Lower quality • Quick to develop for a new language pair – if corpus exists! • Needs no linguistics • Might improve as bigger corpora become available? EBMT