Transliteration

Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Humans transliterate frequently for different reasons Can a machine do this? (Why would a machine have to do this?) If yes, how? Picture courtesy: Snapshot of Yahoo! Messenger

Motivation • An important component of machine translation • When you cannot translate, transliterate • Generally used for named entities, technical terms and out of vocabulary words (OOV) • Issues specific to sounds, scripts and accents • Can a machine do this? If yes, how?

What is transliteration? • Task of converting a word from one alphabetic script to another Used for: • Named entities • : Gandhiji • Out of vocabulary words • : Bank

Linguistic issues • Accents : Thoda or thora? • Mapping of sounds • Mahaan: Kahaan: • Back-transliteration

Linguistic Issues : Mapping of sounds Arabic Chinese Hindi / Japanese • Arabic b -> English p or b • English word: Paul transliterates to • Arabic word: Baul (issue in • Back-transliteration) • Origin of the proper noun determines • the symbol in Chinese language • Ideographic symbols in Chinese • Several English symbols do not map • to any Japanese symbols. So, often • mapped to closest sounding symbol • ice cream  aisukuriimu • Symbols map to different symbols • based on their position • America • Difference in origin • Restaurant • constant

x Overview Source String Target String Transliteration Units Transliteration Units

Contents Source String Target String Transliteration Units Transliteration Units Phoneme- based

Phoneme-based approach Word in Source language Word in Target language P(wt) P( ps | ws) P ( wt | pt ) Pronunciation in Source language Pronunciation In target language P ( pt | ps ) Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) ) Note:Phoneme is the smallest linguistically distinctive unit of sound.

Phoneme-based approach Transliterating ‘BAPAT’ B A P A T Source word to phonemes B /ə/ /a:/ P /ə/ /a:/ T t Source phonemes to target phonemes B /ə/ /a:/ P /ə/ /a:/ T t Step II : Converting to phoneme seq. Step III : Converting to target phoneme seq. Step I : Consider each character of the word

Phoneme-based approach Step IV : Phoneme sequence to target string B : /ə/ : /a:/ : P: /ə/ : T: t: /a:/ : Output :

Concerns Check if the world is valid In target language Word in Source language Word in Target language Check if environment Is noise-free Pronunciation in Source language Pronunciation In target language

Issues in phonetic model • Unknown pronunciations • Back-transliteration can be a problem Johnson   Jonson sanhita samhita

Contents Source String Target String Transliteration Units Transliteration Units Phoneme- based Spelling- based

Spelling-based model • Maps source word sequences to target word sequences (i.e. direct word to word) • The transliteration score: • P(w) Word in Target language Word in Source language Pronunciation in Source language Pronunciation In target language Letter trigram model included Thus, we can accommodate the words not included in the dictionary

Comparison of the two methods

Contents Source String Target String Transliteration Units Transliteration Units Phoneme- based Spelling- based Joint Source Channel

The Third Method - Why? • Particularly developed for Chinese • Chinese : Highly ideographic • Example : • Two main steps: Modeling Decoding Image courtesy: wikimedia-commons

Modeling Step Modeling step • A bilingual dictionary in the source and target language • From this dictionary, the character mapping between the source and target language is learnt The word “Geo” has two possible mappings, the “context” in which it occurs is important

Modeling step … Modeling step … • N-gram Mapping : • < Geo, > < rge, > • < Geo, > < lo, > • This concludes the modeling step

Decoding Step Decoding step • Consider the transliteration of the word “George”. • Alignments of George: • GeorgeGeorge • GeorgeGeorge

Decoding step … Decision to be made between…. • The context mapping is present in the map-dictionary • Using ……

Transliteration Alignment • Where do the n-gram statistics come from? Ans.: Automatic analysis of the bilingual dictionary • How to align this dictionary? Ans. : Using EM-algorithm

EM Algorithm Bootstrap Bootstrap initial random alignment Update n-gram statistics to estimate probability distribution Apply the n-gram TM to obtain new alignment Expectation Derive a list of transliteration units from final alignment Maximization Transliteration Units

Evaluation E2C Error rates for n-gram tests E2C v/s C2E for TM Tests

Conclusion • Transliteration can make use of phonemes as an intermediate layer to move from a script to another • Spelling-based approach connects the word sequences of the two languages • The joint source channel method integrates optimization of alignment and transliteration • no pre-alignment needed • reduction in development efforts

( the end )

References • For all Devnagari transliterations, www.quillpad.in/hindi/ • Phoneme and spelling-based models K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612. N. AbdulJaleel and L. S. Larkey. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In CIKM, pages 139–146. Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages. • Joint source-channel model H. Li,M. Zhang, and J. Su. 2004. A joint source-channel model for machine transliteration. In ACL, pages 159–166. www.wikipedia.org

Transliteration

Transliteration

Presentation Transcript

Named Entity Recognition and Transliteration for 50 Languages

Transliteration of Indic Scripts

Hindi – Urdu Transliteration issues

CLIR System Enhanced with Transliteration Generation and Mining

Automated Braille Transliteration System

Linguistic Enrichment of Statistical Transliteration

In Greek and Hebrew it is a transliteration

Transliteration: 3 sound: “ah” object: vulture Gardner: G1

Transliteration in ICU

A Joint Source-Channel Model for Machine Transliteration

Acts of transliteration: bridging scripts for learning in London schools

Unsupervised Constraint Driven Learning for Transliteration Discovery

Transliteration

Backward Machine Transliteration by Learning Phonetic Similarity

Automatic Name Transliteration via OCR and NLP

Translation - Transliteration

Transliteration involving English and Hindi languages using Syllabification Approach

Transliteration: mw sound: “mew, moo” object: waters Gardner: N35A

Transliteration: Show Me the English

Expert Translation - Transliteration Company

Using Transliteration with Entity Resolution for Arabic Datasets

Backward Machine Transliteration by Learning Phonetic Similarity