1 / 18

Morphological Analysis for Phrase-Based Statistical Machine Translation

HYP update - part1. Morphological Analysis for Phrase-Based Statistical Machine Translation. Luong Minh Than g WING group meeting – 15 Aug, 2008. Agenda. Introduction - what does my project title mean? Language pair English-Finnish challenges Related works Project direction.

alamea
Télécharger la présentation

Morphological Analysis for Phrase-Based Statistical Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HYP update - part1 Morphological Analysis for Phrase-Based Statistical Machine Translation Luong Minh Thang WING group meeting – 15 Aug, 2008

  2. Agenda • Introduction - what does my project title mean? • Language pair • English-Finnish challenges • Related works • Project direction

  3. Introduction I: phrase-based SMT • Statistical: derive statistical information from large data • Phrase-base: capture local constraints Source Target

  4. Introduction II - Morphology • Morpheme: minimal meaning-bearing unit • machines = machine + s • translation = translate + ion • goalkeeper = goal + keeper • English is a low-inflected language - simple morphological structure  High-inflected languages are much complicated!

  5. Introduction III – high-inflected languages • Concatenate chain of morphemes to form a word Finnish: oppositio + kansa + n + edusta + ja (opposition + people + of + represent + -ative) = opposition of parliarment member Turkish: uygarlas,tiramadiklarimizdanmis,sinizcasina (uygar+las, tir+ama+dik+lar+imiz+dan+mis, siniz+casina) = (behaving) as if you are among those whom we could not cause to become civilized This is a word!!!

  6. Introduction IV – Why morphological-aware SMT? • Tackle the data sparseness problem (Statistics from 1.021.180 sentence pairs) • Capture the relations among words Spanish máquina máquinas English machine machines

  7. Language pair I – our choice? • We chose English - Finnish as our main translation task Vietnamese Low-inflected highly-inflected (Dyer, 2007)

  8. Language pair II – why Finnish? • Honestly, I don’t know Finnish … • But because: • Available corpora • Finnish is an agglutinative morphologically-complex language, suitable for our project scope • Investigate in translation from low to high inflected languages -> an area to explore, yet hard !!!

  9. English-Finnish challenges I – many-to-one word relationship • Finnish uses suffixes to express grammatical relations and also to derive new words (about 14-15 cases for nouns) Not merely concatenating • Many-to-one English-Finnish word relationship  need word-morpheme correspondence

  10. English-Finnish challenges II – word order • Word order is “free” in Finnish • Pete rakastaa Annaa = Pete loves Annaa (normal) • Annaa Pete rakastaa: emphasizes Annaa • Rakastaa Pete Annaa: emphasizes rakastaa = Pete does love Anna • Pete Annaa rakastaa: stress on Pete • Rakastaa Annaa Pete. not sound like a normal sentence, quite understandable.

  11. English-Finnish challenges III – surface form generation • After translating from English words  Finnish morphemes, need a surface generation step oppositio + kansa + n + edusta + ja  oppositiokansanedustaja • What if missing morphemes or changes in morpheme order?  Need a more error-tolerate surface recovery algorithm

  12. Related works I – low-to-high inflected languages • Many works from high to low inflected languages, but very few works on the opposite direction, considered hard in (Koehn, 2005) • (Yang & Kirchhoff, 2006): Finnish-English, backoff • (Oflazer & Durgar El-Kahlout, 2006, 2007): English-Turkish, word-morpheme translation, then simply concatenating morphemes • All use language-dependent tools & syntactic knowledge: TreeTager, Snowball stemmer …

  13. Related works II – surface form recovery • (Toutanova et. al., 2007, 2008): English-Russian, English-Arabic; translate stem-to-stem; predict inflection from stems using many different features (lexical, morphological, and syntactic) • (Avramidis & Koehn, 2008): English-Greek Use syntax to get the “missing” morphology, depending on the syntactic position Noun cases agreement and verb person conjugation  Rely mostly on manual annotation data

  14. Project direction • Use language-independent tool (Morfessor), and based on the unannotated data only (i.e. no feature data or syntactical information) • Work on a general surface-form recovery • We would like to have a unified view of the transalation process: separating low-low, low-high, high-low, high-high We are at here

  15. Reference I • Chirs Dyer, 2007 http://www.ling.umd.edu/~redpony/edinburgh.pdf • Jurafsky, D., & Martin, J. H. (2007). Speech and language processing book • The Finnish language http://www.cs.tut.fi/~jkorpela/Finnish.html • Yang & Kirchhoff, 2006: Phrase-based backoff models for machine translation of highly inflected languages • Oflazer & Durgar El-Kahlout, 2006: Initial Explorations in English to Turkish Statistical Machine Translation

  16. Reference II • Oflazer & Durgar El-Kahlout, 2007: Exploring different representational units in English-to-Turkish statistical machine translation • Toutanova et. al., 2007: Generating complex morphology for machine translation • Toutanova et. al., 2008: Applying morphology generation models to machine translation • Avramidis & Koehn, 2008: Enriching morphologically poor languages for statistical machine translation

  17. Q & A?

  18. To be continued … • Thank you !!!

More Related