680 likes | 830 Vues
Bridging the Gap: Machine Translation for Lesser Resourced Languages. Christian Monson, Ariadna Font Llitjós, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell, Robert Frederking, Erik Peterson, Kathrin Probst. Mapudungun 900,000 Speakers.
E N D
Bridging the Gap:Machine Translation for Lesser Resourced Languages Christian Monson, Ariadna Font Llitjós, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell, Robert Frederking, Erik Peterson, Kathrin Probst
Mapudungun 900,000 Speakers Inupiaq 100’s of Speakers Katrina 100’s of Speakers Quechua 6 Million Speakers
Machine Translation (MT) Source Language Target Language
Machine Translation (MT) Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Transfer Rule Based MT Morphologial Analysis Text Generation Syntactic Parsing + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua Semantic Analysis Sentence Planning Transfer Rule Based MT Morphologial Analysis Text Generation Syntactic Parsing + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua + High quality - Expertise intensive development cycle Semantic Analysis Transfer Rule Based MT Morphologial Analysis Text Generation Syntactic Parsing + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua + Short development time - Requires large bilingual corpus Semantic Analysis Transfer Rule Based MT Morphologial Analysis Text Generation Syntactic Parsing + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua Semantic Analysis Our Approach Transfer Rule Based MT Morphologial Analysis Text Generation Syntactic Parsing + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua + High quality - Expertise intensive development cycle Semantic Analysis Transfer Rule Based MT Morphologial Analysis Text Generation Syntactic Parsing + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua + High quality - Expertise intensive development cycle Semantic Analysis Morphologial Analysis Text Generation Syntactic Parsing + Automate the development of deep-analysis MT Source Language Target Language
Our Position Linguistic Structure and Bilingual Informants help automate the development of deep-analysis machine translation systems
Sub-Problems • Morphology Induction • Syntax Refinement
Morphology Induction 1. Linguistic Structure 2. Bilingual Informants
Morphology Induction 1. Linguistic Structure 2. Bilingual Informants
e.er.erá.ido.ieron.ió 28: deb, escog, ofrec, roconoc, vend, ... e.ido.ieron.ir.irá.ió 28: asist, dirig, exig, ocurr, sufr, ... azar.e.ido.ieron.ir.ió 1: sal e.er.erá.ieron.ió 32: deb, padec, romp, ... e.erá.ido.ieron.ió 28: deb, escog, ... e.er.ido.ieron.ió 46: deb, parec, recog... e.ido.ieron.irá.ió 28: asist, dirig, ... e.ido.ieron.ir.ió 39: asist, bat, sal, ... e.ido.ieron.ió 86: asist, deb, hund,... e.erá.ieron.ió 32: deb, padec, ... er.ido.ieron.ió 58: ascend, ejerc, recog, ... ido.ieron.ir.ió 44: interrump, sal, ... Paradigm Discovery in 3 Steps • Search out partial paradigms in a network of candidates • Cluster overlapping partial paradigms • Filter the clusters, keeping the largest clusters most likely to model true paradigms A portion of a Spanish paradigm candidate network
Morpho Challenge 2007 Unsupervised Morphology Induction Competition • English • 3rd Place Overall • Bested the Strong Baseline Morfessor (Creutz, 2006) • German • 1st Place when Combined with Morfessor
Morpho Challenge 2007 Unsupervised Morphology Induction Competition • English • 3rd Place Overall • Bested the Strong Baseline Morfessor (Creutz, 2006) • German • 1st Place when Combined with Morfessor • No Mapudungun yet • Agglutinative sequences of suffixes coming soon
Our Machine Translation Architecture INPUT TEXT
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Morphology Analysis
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Morphology Generation Lexicon Morphology Generation
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Morphology Generation Lexicon Morphology Generation OUTPUT TEXT
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Morphology Generation Lexicon Morphology Generation OUTPUT TEXT
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Morphology Generation Lexicon Morphology Generation OUTPUT TEXT
Sub-Problems • Morphology Induction • Syntax Refinement
Syntax Refinement 1. Linguistic Structure 2. Bilingual Informants
Syntax Refinement 1. Linguistic Structure 2. Bilingual Informants
Linguistic Structure: Syntax • English • I didn’t see Maria Mapudungun pelafiñ Maria Spanish No vi a María
Linguistic Structure: Syntax • English • I didn’t see Maria Mapudungun pelafiñ Maria pe -la -fi -ñ Maria see -neg -3.obj -1.subj.indicative Maria Spanish No vi a María No vi a María neg see.1.subj.past.indicative acc Maria
pe-la-fi-ñ Maria V pe
pe-la-fi-ñ Maria V pe VSuff Negation = + la
pe-la-fi-ñ Maria V pe VSuffG Pass all features up VSuff la
pe-la-fi-ñ Maria V pe VSuffG VSuff object person = 3 fi VSuff la
pe-la-fi-ñ Maria V VSuffG pe Pass all features up from both children VSuffG VSuff fi VSuff la
pe-la-fi-ñ Maria V VSuffG VSuff pe person = 1 number = sg mood = ind VSuffG VSuff ñ fi VSuff la
pe-la-fi-ñ Maria V VSuffG VSuffG VSuff pe Pass all features up from both children VSuffG VSuff ñ fi VSuff la
V pe-la-fi-ñ Maria Pass all features up from both children Check that: 1) negation = + 2) tense is undefined V VSuffG VSuffG VSuff pe VSuffG VSuff ñ fi VSuff la
V N pe-la-fi-ñMaria NP V VSuffG person = 3 number = sg human = + VSuffG VSuff N pe VSuffG VSuff Maria ñ fi VSuff la
Pass features up from V V N pe-la-fi-ñ Maria S Check that NP is human = + VP NP V VSuffG VSuffG VSuff N pe VSuffG VSuff ñ Maria fi VSuff la
V N Transfer to Spanish: Top-Down S S VP VP NP V VSuffG VSuffG VSuff N pe VSuffG VSuff ñ Maria fi VSuff la
V V N Transfer to Spanish: Top-Down Pass all features to Spanish side S S VP VP NP “a” NP V VSuffG VSuffG VSuff N pe VSuffG VSuff ñ Maria fi VSuff la
V V N Transfer to Spanish: Top-Down S S Pass all features down VP VP NP “a” NP V VSuffG VSuffG VSuff N pe VSuffG VSuff ñ Maria fi VSuff la
V V N Transfer to Spanish: Top-Down S S Pass object features down VP VP NP “a” NP V VSuffG VSuffG VSuff N pe VSuffG VSuff ñ Maria fi VSuff la
V V N Transfer to Spanish: Top-Down S S VP VP NP “a” NP V VSuffG Accusative marker on objects is introduced because human = + VSuffG VSuff N pe VSuffG VSuff ñ Maria fi VSuff la
V V N Transfer to Spanish: Top-Down S S VP VP VP::VP [VBar NP] -> [VBar "a" NP] ( (X1::Y1) (X2::Y3) ((X2 type) = (*NOT* personal)) ((X2 human) =c +) (X0 = X1) ((X0 object) = X2) (Y0 = X0) ((Y0 object) = (X0 object)) (Y1 = Y0) (Y3 = (Y0 object)) ((Y1 objmarker person) = (Y3 person)) ((Y1 objmarker number) = (Y3 number)) ((Y1 objmarker gender) = (Y3 gender))) NP “a” NP V VSuffG VSuffG VSuff N pe VSuffG VSuff ñ Maria fi VSuff la
V V N Transfer to Spanish: Top-Down S S Pass person, number, and mood features to Spanish Verb VP VP NP “a” NP Assign tense = past V VSuffG “no” V VSuffG VSuff N pe VSuffG VSuff ñ Maria fi VSuff la
V V N Transfer to Spanish: Top-Down S S VP VP NP “a” NP V VSuffG “no” V VSuffG VSuff N pe VSuffG VSuff ñ Maria Introduced because negation = + fi VSuff la
V V N Transfer to Spanish: Top-Down S S VP VP NP “a” NP V VSuffG “no” V VSuffG VSuff N pe ver VSuffG VSuff ñ Maria fi VSuff la