460 likes | 614 Vues
Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection. Vamshi Ambati 14 Sept 2007 Student Research Symposium. Agenda. Rule Learning for MT Syntax Projection Task Word Alignment Task Bootstrapping Word Alignment Experiment and Results.
E N D
Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection Vamshi Ambati 14 Sept 2007 Student Research Symposium
Agenda • Rule Learning for MT • Syntax Projection Task • Word Alignment Task • Bootstrapping Word Alignment • Experiment and Results
Machine Translation for Resource poor Languages • A major portion of the human languages are ‘resource-poor’ • Less parallel corpus into Major languages • Less monolingual corpus • Less annotation tools • Less grammarians • Less bilingual speakers • Machine Translation in such a scenario is extremely difficult
Machine Translation for Resource poor Languages • AVENUE [Alavie’03 et.al] –
Rule Learning for MT NP::NP [PP NP] -> [NP PP] ( (X1::Y2) (X2::Y1) (X0 = X2) ((Y1 NUM) = (X2 NUM)) ((Y1 NUM) = (Y2 NUM)) ((Y1 PERS) = (Y2 PERS)) (Y0 = Y1) ) PP::PP [ADVP NP POSTP] -> [ADVP PREP NP] ( (X1::Y1) (X2::Y3) (X3::Y2) (X0 = X3) (Y0 = Y2) )
How can such rules be learnt? • Given annotated data for TL we have creative ways to do this • Nothing more valuable than annotated data • But, these are “resource-poor” languages • Can we look from the ‘Source side’ and transfer annotation ?
Syntax Projection • Named Entity Projection [Rama] ate an apple rAma ne ek apple khaya
Syntax Projection • Named Entity Projection [Rama] ate an apple [rAma] ne ek apple khaya
Syntax Projection • Base NP Projection [Rama] ate [an apple] rAma ne ek apple khaya
Syntax Projection • Base NP Projection [Rama] ate [an apple] [rAma] ne [ek apple] khaya
Syntax Projection Constituent Phrase Projection rAma ne ek apple khaya
Syntax Projection Constituent Phrase Projection
Rule Learning Goal • English: Rama ate an apple • Hindi: raMa ne apple khaya • S::S [NP NP VP] -> [NP VP NP] • S::S [NP NP ‘khaya’] -> [NP ‘ate’ NP] • S::S [‘rAma’ ‘ne’ NP ‘khaya’] -> [‘Rama’ ‘ate’ NP]
Word Alignment Task • Training data • Source language • f : source sentence (Hindi) • j = 1,2,...,J • Target language • e : target sentence (English) • i = 1,2,...,I
Word Alignment Models • IBM1 – lexical probabilities only • IBM2 – lexicon plus absolut position • IBM3 – plus fertilities • IBM4 – inverted relative position alignment • IBM5 – non-deficient version of model 4 • HMM – lexicon plus relative position [Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999]
Our Approach • Better Syntax Projection requires better Word Alignment • Our Hypothesis: Word Alignment can be improved using Syntax projection • Project Base NPs to TL and obtain a clean NP table • Perform a constrained alignment in Parallel Corpus using the NP table
Why Base NPs ? • NPs are semantically and syntactically cohesive across languages • NPs show minimal categorical divergence when compared to its colleagues • NPs are building blocks of a sentence and their translation improves MT quality [Philipp Koehn, PhD thesis 2003]
Constrained Alignment [PESA: Phrase Pair Extraction as Sentence Splitting, Vogel ’05 ]
Constrained Alignment • Ex: Rama ate [an apple] rAma ne [ek apple] khaya
Constrained Alignment • Ex: Rama ate [an apple] rAma ne khaya [ek apple]
NP based Bootstrapping: Algorithm • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table
There are quite a large number of Malayalees living here . is found in the west coast of Great Nicobar called the Magapod Island . Plotemy calls them ' Nagadip ' , a Hindu name for naked island malayAlama logoM kI bahuwa badZI saMKyA hE . xvIpa ke paScimI wata para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . Corpus
There are quite a large number of Malayalees living here . is found in the west coast of Great Nicobar called the Magapod Island . Plotemy calls them ' Nagadip ' , a Hindu name for naked island (S1 (S (NP (EX There)) (VP (AUX are) (NP (NP (PDT quite) (DT a) (JJ large) (NN number)) (PP (IN of) (NP (NP (NNS Malayalees)) (VP (VBG living) (ADVP (RB here))))))) (. .))) (S1 (S (VP (AUX is) (VP (VBN found) (PP (IN in) (NP (NP (DT the) (JJ west) (NN coast)) (PP (IN of) (NP (NNP Great) (NNP Nicobar))) (VP (VBN called) (S (NP (DT the) (NNP Magapod) (NNP Island)))))))) (. .))) (S1 (S (NP (NNP Plotemy)) (VP (VBZ calls) (SBAR (S (NP (PRP them)) (VP (POS ') (NP (NP (NNP Nagadip) (POS ')) (, ,) (NP (NP (DT a) (NNP Hindu) (NN name)) (PP (IN for) (NP (JJ naked) (NN island))))))))) (. .))) Source side Parsed
NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table
Aligned Corpus ;;Sentence id = 1 SL:There are quite a large number of Malayalees living here . TL:malayAlama logoM kI bahuwa badZI saMKyA hE . Alignment:((1,8),(2,7),(11,8),(3,1),(4,8),(5,5),(6,6),(7,3),(8,1),(9,1),(10,1)) ;;Sentence id = 2 SL:is found in the west coast of Great Nicobar called the Magapod Island . TL:xvIpa ke paScimI wata para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . Alignment:((1,12),(2,10),(11,2),(12,7),(13,1),(14,13),(3,9),(4,2),(5,4),(6,4),(7,2),(8,7),(9,1),(10,11))
NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table
Extract Source NPs • NP:1:There :. • NP:1:quite a large number :. • NP:1:Malayalees : • NP:2:the west coast : • NP:2:Great Nicobar : • NP:2:the Magapod Island : • NP:6:Plotemy : • NP:6:them : • NP:6:Nagadip ' : • NP:6:a Hindu name : • NP:6:naked island :
NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table
Extract NP translation Pairs • NP:1:There :. • NP:1:quite a large number :malayAlama logoM kI bahuwa badZI saMKyA hE . • NP:1:Malayalees :malayAlama • NP:2:the west coast :ke paScimI wata • NP:2:Great Nicobar :xvIpa ke paScimI wata para sWiwa mEgApOda • NP:2:the Magapod Island :xvIpa ke paScimI wata para sWiwa mEgApOda • NP:6:Plotemy :plotemI • NP:6:them :inheM • NP:6:Nagadip ' :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke • NP:6:a Hindu name :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . • NP:6:naked island :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna
NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table
Feature Extraction for NP Pairs • Features • Source length in words • Target length in words • Absolute length difference • Freq Source base np • Freq of Target base np • Freq of the S-T pair • Source 2 Target probability • Target 2 Source probability
Calculate Features of NP pairs • NP:1:There:.:1:1:0:2449:2318:258:0.00129355:0.000321318 • NP:1:quite a large number:malayAlama logoM kI bahuwa badZI saMKyA hE .:4:8:4:3:1:1:1.95591979786667e-13:0 • NP:1:Malayalees:malayAlama:1:1:0:3:2:1:0.614945:0.0935706 • NP:2:the west coast:ke paScimI wata:3:3:0:15:2:1:2.40946933496697e-06:5.34403517215648e-11 • NP:2:Great Nicobar:xvIpa ke paScimI wata para sWiwa mEgApOda:2:7:5:6:2:1:1.28793968923196e-05:0 • NP:2:the Magapod Island:xvIpa ke paScimI wata para sWiwa mEgApOda:3:7:4:1:2:1:2.19930690076524e-06:0 • NP:6:Plotemy:plotemI:1:1:0:1:1:1:1:1 • NP:6:them:inheM:1:1:0:2153:27:16:0.0168737:0 • NP:6:Nagadip ':plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke:2:15:13:1:1:1:3.06461991111111e-05:0 • NP:6:a Hindu name:plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE .:3:20:17:1:1:1:1.31075474321488e-12:0 • NP:6:naked island:plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna:2:13:11:1:1:1:1.16829204884615e-06:0
NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table
Prune based on manual thresholds • NP:1:There:. • NP:1:Malayalees:malayAlama • NP:2:the west coast:ke paScimI wata • NP:6:Plotemy:plotemI • NP:6:them:inheM
NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table (Folding)
There Malayalees NP are quite a large number of NP living here . the west coast is found in NP of Great Nicobar called the Magapod Island . them Plotemy NP calls NP ' Nagadip ' , a Hindu name for naked island . . malayAlama NP logoM kI bahuwa badZI saMKyA NP ke paScimI wata xvIpa NP para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . them plotemI NPNP nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . Constrained Alignment: NP Folding
Experiments • English Hindi (Resource constrained) • English Hindi
Word Alignment Experiments • Training: 5000 sentences • Testing: 200 sentences • Human Extracted NP table – 21,736
Word Alignment Results (5k corpus) • Experiments with 5k training corpus and 200 test sentences
NP Projection Results (5k) • Evaluation: 21736 NP_Table harvested from 5k test bed corpus
Word Alignment Experiments • Training: 50K Eng-Hin Corpus • Testing: 200 Eng-Hin aligned sentences • Human Extracted NP table – 21,736
Word Alignment Results (55k corpus) • Experiments with 55k training corpus and 200 test sentences
NP Projection Results (55k) • Evaluation: 21736 NP_Table created by human alignment
From here.. • Improvements • Reliable NP Projection • Hierarchical Word Alignment • Machine Translation • Rule Learning • Refined Probabilistic translation Lexicon • Clean Linguistically motivated Phrase table with probabilities