Linguistic Enrichment of Statistical Transliteration

Linguistic Enrichment of Statistical Transliteration लिंगुइस्टिक एनरिच्मेंट ऑफ़ स्टटिस्टिकल ट्रांसलिटरेशन MTP Final Stage Presentation Guided by:- Presented by:- Prof. Pushpak Bhattacharyya Abhijeet Padhye (06305902) Department of Computer Science & Engineering IIT Bombay

Presentation Pathway • Problem Statement • Motivation • What is Transliteration? • Syllables and their Structure • Sonority Theory • Concept of Schwa • Proposed Transliteration Model • Experiments and Results • Discussions • Conclusion and Future Work • References

Problem Statement To exploit the Phonological similarities of Roman and Devanagari in order to linguistically aid the process of Statistical Transliteration.

Motivation • An important component of Machine Translation • When you cannot Translate – Transliterate. • Critical in tackling problem of OOV words and proper nouns • Proves acute in translating Named entities for CLIR • Transliteration – a Phonetic translation process; • Apt to exploit phonetic and phonological properties

What is Transliteration? • A process of phonetically translating words like named entities or technical terms from source to target language alphabet. • Examples:- • Gandhiji – गाँधीजी • OOV words likeनमस्कार - Namaskar

Humans translate/transliterate frequently for different reasons An example of how transliteration comes to rescue when no translations exist

x Overview of Transliteration Source Word Target Word Transliteration Units Transliteration Units Character n-grams Syllables

Basic of syllables “Syllable is a unit of spoken language consisting of a single uninterrupted sound formed generally by a Vowel and preceded or followed by one or more consonants.” • Vowels are the heart of a syllable(Most Sonorous Element) • Consonants act as sounds attached to vowels.

Syllable Structure • Simple syllables – Baba, दादा • Complex syllables – Andrew Ba + ba दा + दा Alert!!! Basic Structure doesn’t suffice An drew VC? CVC?

Possible syllable structures • The Nucleus is always present • Onset and Coda may be absent • Possible structures • V • CV • VC • CVC

Introduction to sonority theory “The Sonority of a sound is its loudness relative to other sounds with the same length, stress and speech.” • Some sounds are more sonorous • Words in a language can be divided into syllables • Sonority theory distinguishes syllables on the basis of sounds.

Sonority Hierarchy • Obstruents can be further classified into:- • Fricatives • Affricates • Stops

Sonority sequencing principle “The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.” Peak (Nucleus) Onset Coda

example • ABHIJEET • Sonority Profile 1 A I E E H J B T • Sonority Profile 2 A I E E H J B T

The concept of schwa • First alphabet of IAL – {a} • Unstressed and Toneless neutral vowel • Some schwas deleted and some are not • Schwa deletion – important issue for grapheme to phoneme conversion • Handled using a well-established schwa deletion algorithm • Example:- • Priyatama – Last “a” changes the Gender प्रियतम प्रियतमा

Proposed Transliteration Model Source Language Words Source Language Syllables Syllabification Modules Target Language Words Target Language Syllables Moses Training Target Language Model SRILM Phrase translation tables Moses Decoder Source Language Words Transliterated output

Transliteration system workflow • Syllabification of parallel list of names in Roman and Devanagari • Using these parallel list for:- • Alignment of syllables • Training Moses translation toolkit • Language model generation using SRILM • Decoding using trained phrase-translation tables and language model • Comparing results to analyze performance

Experiments and Results • Syllabification of Roman and Devanagari words Fig : Syllabification Algorithm

Syllabification results • A few examples

Transliteration Process • Syllabification of list of 10000 parallel names written in Roman and Devanagari and preparing a parallel aligned list of syllables. • Training Language Models for target language using SRILM toolkit. • Training MOSES with aligned corpus of 7500 names and target language model as input. • Testing with a list of 2500 proper names using the trained model for transliteration.

Roman to Devanagari Transliteration Fig : Result for Roman to Devanagari Transliteration Fig : Top-n Inclusion results

Devanagari to Roman Transliteration Fig : Result for Devanagari to Roman Transliteration Fig : Top-n translation results

Comparison with Character n-gram based model • Same Experimental setup; Transliteration units changed to n-grams • Bigrams (Sandeep  Sa, an, nd, de, ee, ep) • Trigrams (Sandeep  San, and, nde, dee, eep) • Quadrigrams (Sandeep  Sand, ande, ndee, deep) • Observations suggest performance improvement using syllables as transliteration units • n-gram based models prove to be ignorant to phonological properties like unstressed vowels Fig : Comparison with N-gram based model

Comparison with State-of-the-art Systems • Google transliteration engine and Quillpad used as benchmarks for comparison • A list of 1000 words written in Roman alphabet used as test input • Our system outperforms Quillpad and just falls short of Google’s results. • A more intense training with larger training set might improve system performance. Fig : Comparison with State-of-the-art transliteration systems

Discussions • Accents • थोड़ा: Thoda or thora? • Mapping of sounds • Mahaan – महान Kahaan - कहाँ • Silent Letters • Psychatrist - सायकेट्रिस्ट

Discussions (cntd…) • Improper Schwa deletion • Venkatachalam – वेंकटचलम • Improper placement (Onset or Coda) • सिराजउद्दीन - सि राज उद् दिन or सि रा जउद् दिन • Similar phonological structure but different pronunciation • सोमलता and कोमलता वें + कट + च + लम वेंक + टच + लम सोम लता को मल ता

Conclusion and Future work • Transliteration can prove critical in supporting Machine Translation • Phonologically aware transliteration units like syllables show strong signs of performance improvement • Syllable-based transliteration performs at least up to the state-of-the-art systems. • Syllabification algorithms should be subjected to further improvement • Developed system should be supplied with larger and more accurate training set. • Some linguistic issues discussed above are very challenging cases for future work on transliteration

References • Pirkola A., Toivonen J., Keskustalo H., Visala K., Jarvelin K. 2003. Fuzzy Translation of Cross-Lingual Spelling Variants. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval • Gao W., Lam W., and Wang K. 2004. Phoneme-based Transliteration of Foreign Names for OOV Problem. International Joint Conference on Natural Language Processing. • Osamu F. 1975. Syllable as a unit of Speech Recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. • Phillip Koehn et.al. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration Session, Prague, Czech Republic. • Laver J. 1994. Principles of Phonetics. Cambridge University Publications. PG. 114. • Knight L. and Graehl J. 1997. Machine Transliteration. Proceedings of ACL 1997. Pg 128-135. • Stolcke A. 2002. SRILM – An Extensible Language Modeling Toolkit. In proceedings of International Conference on Spoken Language Processing. • Choudhury M. and Bose A. 2002. A Rule Based Schwa Deletion Algorithm for Hindi. Technical Report. Dept. of Comp. Sci. & Engg. Indian Institute of Technology, Kharagpur.

Background Theory

Approaches towards Transliteration

Complex Syllable structure Fig : Detailed syllable structure Fig : Complex syllables fitting in above structure

Sonority theory & syllables “A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.” • Represented as waves of sonority or Sonority Profile of that syllable Nucleus Onset Coda

Sonority Hierarchy for English and Hindi Fig : Sonority hierarchy for English Fig : Sonority hierarchy for Hindi

Maximal Onset Principle “The Intervocalic consonants are maximally assigned to the Onsets of syllables in conformity with Universal and Language-Specific Conditions.” • In case of words having two valid syllable set, one with maximum onset length would be preferred. • Example – Diploma • Di + plo + ma • Dip + lo + ma

Schwa deletion algorithm Proceduredelete_schwa (DS) Input : word (String of alphabets) Output : Input word with some schwas deleted. • Mark all the full vowels and consonants followed by vowels other than the inherent schwas (i.e. consonants with Matras) and all the hs in the word as F unless it is explicitly marked as half by use of halant. Mark all the consonants immediately followed by consonants or halants (i.e consonants of conjugate syllables) as H. Mark all the remaining consonants, which are followed by implicit schwas as U. • If in the word, y is marked as U and preceded by i, I, ri, u or U, mark it F. • If y, r, l or v are marked U and preceded by consonants marked H, then mark them F. • If a consonant marked U is followed by a full vowel, then mark that consonant as F. • While traversing the word from left to right, if a consonant marked U is encountered before any consonant or vowel marked F, then mark that consonant as F. • If the last consonant is marked U, mark it H. • If any consonant marked U is immediately followed by a consonant marked H, mark it F. • While traversing the word from left to right, for every consonant marked U, mark it H if it is preceded by F and followed by F or U, otherwise mark it F. • For all consonants marked H, if it is followed by a schwa in the original word, then delete the schwa from the word. The resulting new word is the required output. End procedure delete_schwa

Example of Schwa deletion Fig : Application of Schwa deletion Algorithm

Examples • Correct Transliterations • Incorrect Transliteration

Linguistic Enrichment of Statistical Transliteration

Linguistic Enrichment of Statistical Transliteration

Presentation Transcript

Transliteration of Indic Scripts

Hindi – Urdu Transliteration issues

Towards automatic enrichment and analysis of linguistic data for low-density languages

Enrichment

Automated Braille Transliteration System

Enrichment

Enrichment

Models of Linguistic Choice

Models of chemical enrichment

Transliteration in ICU

The Pipeline of Enrichment

Transliteration

Enrichment

Transliteration

Linguistic

Translation - Transliteration

“Enrichment”

Semantic Enrichment of Mappings

Enrichment ?

Expert Translation - Transliteration Company

Enrichment

MEANING OF LINGUISTIC UNITS