1 / 46

Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection

Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection. Vamshi Ambati 14 Sept 2007 Student Research Symposium. Agenda. Rule Learning for MT Syntax Projection Task Word Alignment Task Bootstrapping Word Alignment Experiment and Results.

donnan
Télécharger la présentation

Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Noun Phrase driven Bootstrapping of Word Alignment for Syntax Projection Vamshi Ambati 14 Sept 2007 Student Research Symposium

  2. Agenda • Rule Learning for MT • Syntax Projection Task • Word Alignment Task • Bootstrapping Word Alignment • Experiment and Results

  3. Machine Translation for Resource poor Languages • A major portion of the human languages are ‘resource-poor’ • Less parallel corpus into Major languages • Less monolingual corpus • Less annotation tools • Less grammarians • Less bilingual speakers • Machine Translation in such a scenario is extremely difficult

  4. Machine Translation for Resource poor Languages • AVENUE [Alavie’03 et.al] –

  5. Rule Learning for MT NP::NP [PP NP] -> [NP PP] ( (X1::Y2) (X2::Y1) (X0 = X2) ((Y1 NUM) = (X2 NUM)) ((Y1 NUM) = (Y2 NUM)) ((Y1 PERS) = (Y2 PERS)) (Y0 = Y1) ) PP::PP [ADVP NP POSTP] -> [ADVP PREP NP] ( (X1::Y1) (X2::Y3) (X3::Y2) (X0 = X3) (Y0 = Y2) )

  6. How can such rules be learnt? • Given annotated data for TL we have creative ways to do this • Nothing more valuable than annotated data • But, these are “resource-poor” languages • Can we look from the ‘Source side’ and transfer annotation ?

  7. Syntax Projection • Named Entity Projection [Rama] ate an apple rAma ne ek apple khaya

  8. Syntax Projection • Named Entity Projection [Rama] ate an apple [rAma] ne ek apple khaya

  9. Syntax Projection • Base NP Projection [Rama] ate [an apple] rAma ne ek apple khaya

  10. Syntax Projection • Base NP Projection [Rama] ate [an apple] [rAma] ne [ek apple] khaya

  11. Syntax Projection Constituent Phrase Projection rAma ne ek apple khaya

  12. Syntax Projection Constituent Phrase Projection

  13. Rule Learning Goal • English: Rama ate an apple • Hindi: raMa ne apple khaya • S::S [NP NP VP] -> [NP VP NP] • S::S [NP NP ‘khaya’] -> [NP ‘ate’ NP] • S::S [‘rAma’ ‘ne’ NP ‘khaya’] -> [‘Rama’ ‘ate’ NP]

  14. Word Alignment Task • Training data • Source language • f : source sentence (Hindi) • j = 1,2,...,J • Target language • e : target sentence (English) • i = 1,2,...,I

  15. Word Alignment Models • IBM1 – lexical probabilities only • IBM2 – lexicon plus absolut position • IBM3 – plus fertilities • IBM4 – inverted relative position alignment • IBM5 – non-deficient version of model 4 • HMM – lexicon plus relative position [Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999]

  16. Our Approach • Better Syntax Projection requires better Word Alignment • Our Hypothesis: Word Alignment can be improved using Syntax projection • Project Base NPs to TL and obtain a clean NP table • Perform a constrained alignment in Parallel Corpus using the NP table

  17. Why Base NPs ? • NPs are semantically and syntactically cohesive across languages • NPs show minimal categorical divergence when compared to its colleagues • NPs are building blocks of a sentence and their translation improves MT quality [Philipp Koehn, PhD thesis 2003]

  18. Constrained Alignment [PESA: Phrase Pair Extraction as Sentence Splitting, Vogel ’05 ]

  19. Constrained Alignment • Ex: Rama ate [an apple] rAma ne [ek apple] khaya

  20. Constrained Alignment • Ex: Rama ate [an apple] rAma ne khaya [ek apple]

  21. NP based Bootstrapping: Algorithm • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table

  22. There are quite a large number of Malayalees living here . is found in the west coast of Great Nicobar called the Magapod Island . Plotemy calls them ' Nagadip ' , a Hindu name for naked island malayAlama logoM kI bahuwa badZI saMKyA hE . xvIpa ke paScimI wata para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . Corpus

  23. There are quite a large number of Malayalees living here . is found in the west coast of Great Nicobar called the Magapod Island . Plotemy calls them ' Nagadip ' , a Hindu name for naked island (S1 (S (NP (EX There)) (VP (AUX are) (NP (NP (PDT quite) (DT a) (JJ large) (NN number)) (PP (IN of) (NP (NP (NNS Malayalees)) (VP (VBG living) (ADVP (RB here))))))) (. .))) (S1 (S (VP (AUX is) (VP (VBN found) (PP (IN in) (NP (NP (DT the) (JJ west) (NN coast)) (PP (IN of) (NP (NNP Great) (NNP Nicobar))) (VP (VBN called) (S (NP (DT the) (NNP Magapod) (NNP Island)))))))) (. .))) (S1 (S (NP (NNP Plotemy)) (VP (VBZ calls) (SBAR (S (NP (PRP them)) (VP (POS ') (NP (NP (NNP Nagadip) (POS ')) (, ,) (NP (NP (DT a) (NNP Hindu) (NN name)) (PP (IN for) (NP (JJ naked) (NN island))))))))) (. .))) Source side Parsed

  24. NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table

  25. Aligned Corpus ;;Sentence id = 1 SL:There are quite a large number of Malayalees living here . TL:malayAlama logoM kI bahuwa badZI saMKyA hE . Alignment:((1,8),(2,7),(11,8),(3,1),(4,8),(5,5),(6,6),(7,3),(8,1),(9,1),(10,1)) ;;Sentence id = 2 SL:is found in the west coast of Great Nicobar called the Magapod Island . TL:xvIpa ke paScimI wata para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . Alignment:((1,12),(2,10),(11,2),(12,7),(13,1),(14,13),(3,9),(4,2),(5,4),(6,4),(7,2),(8,7),(9,1),(10,11))

  26. NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table

  27. Extract Source NPs • NP:1:There :. • NP:1:quite a large number :. • NP:1:Malayalees : • NP:2:the west coast : • NP:2:Great Nicobar : • NP:2:the Magapod Island : • NP:6:Plotemy : • NP:6:them : • NP:6:Nagadip ' : • NP:6:a Hindu name : • NP:6:naked island :

  28. NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table

  29. Extract NP translation Pairs • NP:1:There :. • NP:1:quite a large number :malayAlama logoM kI bahuwa badZI saMKyA hE . • NP:1:Malayalees :malayAlama • NP:2:the west coast :ke paScimI wata • NP:2:Great Nicobar :xvIpa ke paScimI wata para sWiwa mEgApOda • NP:2:the Magapod Island :xvIpa ke paScimI wata para sWiwa mEgApOda • NP:6:Plotemy :plotemI • NP:6:them :inheM • NP:6:Nagadip ' :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke • NP:6:a Hindu name :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . • NP:6:naked island :plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna

  30. NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table

  31. Feature Extraction for NP Pairs • Features • Source length in words • Target length in words • Absolute length difference • Freq Source base np • Freq of Target base np • Freq of the S-T pair • Source 2 Target probability • Target 2 Source probability

  32. Calculate Features of NP pairs • NP:1:There:.:1:1:0:2449:2318:258:0.00129355:0.000321318 • NP:1:quite a large number:malayAlama logoM kI bahuwa badZI saMKyA hE .:4:8:4:3:1:1:1.95591979786667e-13:0 • NP:1:Malayalees:malayAlama:1:1:0:3:2:1:0.614945:0.0935706 • NP:2:the west coast:ke paScimI wata:3:3:0:15:2:1:2.40946933496697e-06:5.34403517215648e-11 • NP:2:Great Nicobar:xvIpa ke paScimI wata para sWiwa mEgApOda:2:7:5:6:2:1:1.28793968923196e-05:0 • NP:2:the Magapod Island:xvIpa ke paScimI wata para sWiwa mEgApOda:3:7:4:1:2:1:2.19930690076524e-06:0 • NP:6:Plotemy:plotemI:1:1:0:1:1:1:1:1 • NP:6:them:inheM:1:1:0:2153:27:16:0.0168737:0 • NP:6:Nagadip ':plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke:2:15:13:1:1:1:3.06461991111111e-05:0 • NP:6:a Hindu name:plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE .:3:20:17:1:1:1:1.31075474321488e-12:0 • NP:6:naked island:plotemI inheM nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna:2:13:11:1:1:1:1.16829204884615e-06:0

  33. NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table

  34. Prune based on manual thresholds • NP:1:There:. • NP:1:Malayalees:malayAlama • NP:2:the west coast:ke paScimI wata • NP:6:Plotemy:plotemI • NP:6:them:inheM

  35. NP based Bootstrapping • Word Align ‘S’ and ‘T’ using IBM-4 • Extracted NP on source side using the Parse • Extracted translations by harvesting Viterbi Alignment • Calculate features for NP pairs • Prune based on thresholds • Perform Constrained Word Alignment and Lexicon extraction using the NP table (Folding)

  36. There Malayalees NP are quite a large number of NP living here . the west coast is found in NP of Great Nicobar called the Magapod Island . them Plotemy NP calls NP ' Nagadip ' , a Hindu name for naked island . . malayAlama NP logoM kI bahuwa badZI saMKyA NP ke paScimI wata xvIpa NP para sWiwa mEgApOda xvIpa meM pAyA jAwA hE . them plotemI NPNP nAgAxvIpa kahawA hE jo ki hiMxI kA Sabxa hE Ora nagna logoM ke lie prayukwa howA hE . Constrained Alignment: NP Folding

  37. Experiments • English Hindi (Resource constrained) • English Hindi

  38. Word Alignment Experiments • Training: 5000 sentences • Testing: 200 sentences • Human Extracted NP table – 21,736

  39. Word Alignment Results (5k corpus) • Experiments with 5k training corpus and 200 test sentences

  40. NP Projection Results (5k) • Evaluation: 21736 NP_Table harvested from 5k test bed corpus

  41. Word Alignment Experiments • Training: 50K Eng-Hin Corpus • Testing: 200 Eng-Hin aligned sentences • Human Extracted NP table – 21,736

  42. Word Alignment Results (55k corpus) • Experiments with 55k training corpus and 200 test sentences

  43. NP Projection Results (55k) • Evaluation: 21736 NP_Table created by human alignment

  44. From here.. • Improvements • Reliable NP Projection • Hierarchical Word Alignment • Machine Translation • Rule Learning • Refined Probabilistic translation Lexicon • Clean Linguistically motivated Phrase table with probabilities

  45. Questions ?

  46. Thanks !

More Related