1 / 28

Transliteration

Transliteration. CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906. Humans transliterate frequently for different reasons Can a machine do this? (Why would a machine have to do this?) If yes, how?.

sai
Télécharger la présentation

Transliteration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

  2. Humans transliterate frequently for different reasons Can a machine do this? (Why would a machine have to do this?) If yes, how? Picture courtesy: Snapshot of Yahoo! Messenger

  3. Motivation • An important component of machine translation • When you cannot translate, transliterate • Generally used for named entities, technical terms and out of vocabulary words (OOV) • Issues specific to sounds, scripts and accents • Can a machine do this? If yes, how?

  4. What is transliteration? • Task of converting a word from one alphabetic script to another Used for: • Named entities • : Gandhiji • Out of vocabulary words • : Bank

  5. Linguistic issues • Accents : Thoda or thora? • Mapping of sounds • Mahaan: Kahaan: • Back-transliteration

  6. Linguistic Issues : Mapping of sounds Arabic Chinese Hindi / Japanese • Arabic b -> English p or b • English word: Paul transliterates to • Arabic word: Baul (issue in • Back-transliteration) • Origin of the proper noun determines • the symbol in Chinese language • Ideographic symbols in Chinese • Several English symbols do not map • to any Japanese symbols. So, often • mapped to closest sounding symbol • ice cream  aisukuriimu • Symbols map to different symbols • based on their position • America • Difference in origin • Restaurant • constant

  7. x Overview Source String Target String Transliteration Units Transliteration Units

  8. Contents Source String Target String Transliteration Units Transliteration Units Phoneme- based

  9. Phoneme-based approach Word in Source language Word in Target language P(wt) P( ps | ws) P ( wt | pt ) Pronunciation in Source language Pronunciation In target language P ( pt | ps ) Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) ) Note:Phoneme is the smallest linguistically distinctive unit of sound.

  10. Phoneme-based approach Transliterating ‘BAPAT’ B A P A T Source word to phonemes B /ə/ /a:/ P /ə/ /a:/ T t Source phonemes to target phonemes B /ə/ /a:/ P /ə/ /a:/ T t Step II : Converting to phoneme seq. Step III : Converting to target phoneme seq. Step I : Consider each character of the word

  11. Phoneme-based approach Step IV : Phoneme sequence to target string B : /ə/ : /a:/ : P: /ə/ : T: t: /a:/ : Output :

  12. Concerns Check if the world is valid In target language Word in Source language Word in Target language Check if environment Is noise-free Pronunciation in Source language Pronunciation In target language

  13. Issues in phonetic model • Unknown pronunciations • Back-transliteration can be a problem Johnson   Jonson sanhita samhita

  14. Contents Source String Target String Transliteration Units Transliteration Units Phoneme- based Spelling- based

  15. Spelling-based model • Maps source word sequences to target word sequences (i.e. direct word to word) • The transliteration score: • P(w) Word in Target language Word in Source language Pronunciation in Source language Pronunciation In target language Letter trigram model included Thus, we can accommodate the words not included in the dictionary

  16. Comparison of the two methods

  17. Contents Source String Target String Transliteration Units Transliteration Units Phoneme- based Spelling- based Joint Source Channel

  18. The Third Method - Why? • Particularly developed for Chinese • Chinese : Highly ideographic • Example : • Two main steps: Modeling Decoding Image courtesy: wikimedia-commons

  19. Modeling Step Modeling step • A bilingual dictionary in the source and target language • From this dictionary, the character mapping between the source and target language is learnt The word “Geo” has two possible mappings, the “context” in which it occurs is important

  20. Modeling step … Modeling step … • N-gram Mapping : • < Geo, > < rge, > • < Geo, > < lo, > • This concludes the modeling step

  21. Decoding Step Decoding step • Consider the transliteration of the word “George”. • Alignments of George: • GeorgeGeorge • GeorgeGeorge

  22. Decoding step … Decision to be made between…. • The context mapping is present in the map-dictionary • Using ……

  23. Transliteration Alignment • Where do the n-gram statistics come from? Ans.: Automatic analysis of the bilingual dictionary • How to align this dictionary? Ans. : Using EM-algorithm

  24. EM Algorithm Bootstrap Bootstrap initial random alignment Update n-gram statistics to estimate probability distribution Apply the n-gram TM to obtain new alignment Expectation Derive a list of transliteration units from final alignment Maximization Transliteration Units

  25. Evaluation E2C Error rates for n-gram tests E2C v/s C2E for TM Tests

  26. Conclusion • Transliteration can make use of phonemes as an intermediate layer to move from a script to another • Spelling-based approach connects the word sequences of the two languages • The joint source channel method integrates optimization of alignment and transliteration • no pre-alignment needed • reduction in development efforts

  27. ( the end )

  28. References • For all Devnagari transliterations, www.quillpad.in/hindi/ • Phoneme and spelling-based models K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612. N. AbdulJaleel and L. S. Larkey. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In CIKM, pages 139–146. Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages. • Joint source-channel model H. Li,M. Zhang, and J. Su. 2004. A joint source-channel model for machine transliteration. In ACL, pages 159–166. www.wikipedia.org

More Related