Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection

Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection Vamshi Ambati 14 Sept 2007 Student Research Symposium

Agenda • Rule Learning for MT • Syntax Projection Task • Word Alignment Task • Bootstrapping Word Alignment • Experiment and Results

Machine Translation for Resource poor Languages • A major portion of the human languages are ‘resource-poor’ • Less parallel corpus into Major languages • Less monolingual corpus • Less annotation tools • Less grammarians • Less bilingual speakers • Machine Translation in such a scenario is extremely difficult

Machine Translation for Resource poor Languages • AVENUE [Alavie’03 et.al] –

Rule Learning for MT NP::NP [PP NP] -> [NP PP] ( (X1::Y2) (X2::Y1) (X0 = X2) ((Y1 NUM) = (X2 NUM)) ((Y1 NUM) = (Y2 NUM)) ((Y1 PERS) = (Y2 PERS)) (Y0 = Y1) ) PP::PP [ADVP NP POSTP] -> [ADVP PREP NP] ( (X1::Y1) (X2::Y3) (X3::Y2) (X0 = X3) (Y0 = Y2) )

How can such rules be learnt? • Given annotated data for TL we have creative ways to do this • Nothing more valuable than annotated data • But, these are “resource-poor” languages • Can we look from the ‘Source side’ and transfer annotation ?

Syntax Projection • Named Entity Projection [Rama] ate an apple rAma ne ek apple khaya

Syntax Projection • Named Entity Projection [Rama] ate an apple [rAma] ne ek apple khaya

Syntax Projection • Base NP Projection [Rama] ate [an apple] rAma ne ek apple khaya

Syntax Projection • Base NP Projection [Rama] ate [an apple] [rAma] ne [ek apple] khaya

Syntax Projection Constituent Phrase Projection rAma ne ek apple khaya

Syntax Projection Constituent Phrase Projection

Rule Learning Goal • English: Rama ate an apple • Hindi: raMa ne apple khaya • S::S [NP NP VP] -> [NP VP NP] • S::S [NP NP ‘khaya’] -> [NP ‘ate’ NP] • S::S [‘rAma’ ‘ne’ NP ‘khaya’] -> [‘Rama’ ‘ate’ NP]

Word Alignment Task • Training data • Source language • f : source sentence (Hindi) • j = 1,2,...,J • Target language • e : target sentence (English) • i = 1,2,...,I

Word Alignment Models • IBM1 – lexical probabilities only • IBM2 – lexicon plus absolut position • IBM3 – plus fertilities • IBM4 – inverted relative position alignment • IBM5 – non-deficient version of model 4 • HMM – lexicon plus relative position [Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999]

Our Approach • Better Syntax Projection requires better Word Alignment • Our Hypothesis: Word Alignment can be improved using Syntax projection • Project Base NPs to TL and obtain a clean NP table • Perform a constrained alignment in Parallel Corpus using the NP table

Why Base NPs ? • NPs are semantically and syntactically cohesive across languages • NPs show minimal categorical divergence when compared to its colleagues • NPs are building blocks of a sentence and their translation improves MT quality [Philipp Koehn, PhD thesis 2003]

Constrained Alignment [PESA: Phrase Pair Extraction as Sentence Splitting, Vogel ’05 ]

Constrained Alignment • Ex: Rama ate [an apple] rAma ne [ek apple] khaya

Constrained Alignment • Ex: Rama ate [an apple] rAma ne khaya [ek apple]

NP based Bootstrapping: Algorithm • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table

There are quite a large number of Malayalees living here . is found in the west coast of Great Nicobar called the Magapod Island . Plotemy calls them ' Nagadip ' , a Hindu name for naked island malayAlama logoM kI bahuwa badZI saMKyA hE . xvIpa ke paScimI wata para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . Corpus

There are quite a large number of Malayalees living here . is found in the west coast of Great Nicobar called the Magapod Island . Plotemy calls them ' Nagadip ' , a Hindu name for naked island (S1 (S (NP (EX There)) (VP (AUX are) (NP (NP (PDT quite) (DT a) (JJ large) (NN number)) (PP (IN of) (NP (NP (NNS Malayalees)) (VP (VBG living) (ADVP (RB here))))))) (. .))) (S1 (S (VP (AUX is) (VP (VBN found) (PP (IN in) (NP (NP (DT the) (JJ west) (NN coast)) (PP (IN of) (NP (NNP Great) (NNP Nicobar))) (VP (VBN called) (S (NP (DT the) (NNP Magapod) (NNP Island)))))))) (. .))) (S1 (S (NP (NNP Plotemy)) (VP (VBZ calls) (SBAR (S (NP (PRP them)) (VP (POS ') (NP (NP (NNP Nagadip) (POS ')) (, ,) (NP (NP (DT a) (NNP Hindu) (NN name)) (PP (IN for) (NP (JJ naked) (NN island))))))))) (. .))) Source side Parsed

NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table

Aligned Corpus ;;Sentence id = 1 SL:There are quite a large number of Malayalees living here . TL:malayAlama logoM kI bahuwa badZI saMKyA hE . Alignment:((1,8),(2,7),(11,8),(3,1),(4,8),(5,5),(6,6),(7,3),(8,1),(9,1),(10,1)) ;;Sentence id = 2 SL:is found in the west coast of Great Nicobar called the Magapod Island . TL:xvIpa ke paScimI wata para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . Alignment:((1,12),(2,10),(11,2),(12,7),(13,1),(14,13),(3,9),(4,2),(5,4),(6,4),(7,2),(8,7),(9,1),(10,11))

Extract Source NPs • NP:1:There :. • NP:1:quite a large number :. • NP:1:Malayalees : • NP:2:the west coast : • NP:2:Great Nicobar : • NP:2:the Magapod Island : • NP:6:Plotemy : • NP:6:them : • NP:6:Nagadip ' : • NP:6:a Hindu name : • NP:6:naked island :

Extract NP translation Pairs • NP:1:There :. • NP:1:quite a large number :malayAlama logoM kI bahuwa badZI saMKyA hE . • NP:1:Malayalees :malayAlama • NP:2:the west coast :ke paScimI wata • NP:2:Great Nicobar :xvIpa ke paScimI wata para sWiwa mEgApOda • NP:2:the Magapod Island :xvIpa ke paScimI wata para sWiwa mEgApOda • NP:6:Plotemy :plotemI • NP:6:them :inheM • NP:6:Nagadip ' :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke • NP:6:a Hindu name :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . • NP:6:naked island :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna

Feature Extraction for NP Pairs • Features • Source length in words • Target length in words • Absolute length difference • Freq Source base np • Freq of Target base np • Freq of the S-T pair • Source 2 Target probability • Target 2 Source probability

Calculate Features of NP pairs • NP:1:There:.:1:1:0:2449:2318:258:0.00129355:0.000321318 • NP:1:quite a large number:malayAlama logoM kI bahuwa badZI saMKyA hE .:4:8:4:3:1:1:1.95591979786667e-13:0 • NP:1:Malayalees:malayAlama:1:1:0:3:2:1:0.614945:0.0935706 • NP:2:the west coast:ke paScimI wata:3:3:0:15:2:1:2.40946933496697e-06:5.34403517215648e-11 • NP:2:Great Nicobar:xvIpa ke paScimI wata para sWiwa mEgApOda:2:7:5:6:2:1:1.28793968923196e-05:0 • NP:2:the Magapod Island:xvIpa ke paScimI wata para sWiwa mEgApOda:3:7:4:1:2:1:2.19930690076524e-06:0 • NP:6:Plotemy:plotemI:1:1:0:1:1:1:1:1 • NP:6:them:inheM:1:1:0:2153:27:16:0.0168737:0 • NP:6:Nagadip ':plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke:2:15:13:1:1:1:3.06461991111111e-05:0 • NP:6:a Hindu name:plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE .:3:20:17:1:1:1:1.31075474321488e-12:0 • NP:6:naked island:plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna:2:13:11:1:1:1:1.16829204884615e-06:0

Prune based on manual thresholds • NP:1:There:. • NP:1:Malayalees:malayAlama • NP:2:the west coast:ke paScimI wata • NP:6:Plotemy:plotemI • NP:6:them:inheM

NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table (Folding)

There Malayalees NP are quite a large number of NP living here . the west coast is found in NP of Great Nicobar called the Magapod Island . them Plotemy NP calls NP ' Nagadip ' , a Hindu name for naked island . . malayAlama NP logoM kI bahuwa badZI saMKyA NP ke paScimI wata xvIpa NP para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . them plotemI NPNP nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . Constrained Alignment: NP Folding

Experiments • English Hindi (Resource constrained) • English Hindi

Word Alignment Experiments • Training: 5000 sentences • Testing: 200 sentences • Human Extracted NP table – 21,736

Word Alignment Results (5k corpus) • Experiments with 5k training corpus and 200 test sentences

NP Projection Results (5k) • Evaluation: 21736 NP_Table harvested from 5k test bed corpus

Word Alignment Experiments • Training: 50K Eng-Hin Corpus • Testing: 200 Eng-Hin aligned sentences • Human Extracted NP table – 21,736

Word Alignment Results (55k corpus) • Experiments with 55k training corpus and 200 test sentences

NP Projection Results (55k) • Evaluation: 21736 NP_Table created by human alignment

From here.. • Improvements • Reliable NP Projection • Hierarchical Word Alignment • Machine Translation • Rule Learning • Refined Probabilistic translation Lexicon • Clean Linguistically motivated Phrase table with probabilities

Questions ?

Thanks !

Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection

Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection

Presentation Transcript

Noun Phrase Extraction

Word and Phrase Alignment

NP = Noun Phrase

The Noun Phrase

Word/Phrase

THE NOUN PHRASE (NP)

Lecture 4 Noun and Noun Phrase

Learning noun phrase coreference resolution

CS544: Lecture 3: Syntax and Compositional Semantics of the Noun Phrase

Lecture 4 Noun and Noun Phrase

A Syntax-Driven Bracketing Model for Phrase-Based Translation

CS544: Lecture 3: Syntax and Compositional Semantics of the Noun Phrase

Noun Phrase, Part 2

Lecture 3 Noun and Noun Phrase

Word and Phrase Alignment

Word Alignment

CS544: Lecture 3: Syntax and Compositional Semantics of the Noun Phrase

Word and Phrase Alignment

CS544: Lecture 3: Syntax and Compositional Semantics of the Noun Phrase