1 / 37

Linguistic Enrichment of Statistical Transliteration

Linguistic Enrichment of Statistical Transliteration. लिंगुइस्टिक एनरिच्मेंट ऑफ़ स्टटिस्टिकल. ट्रांसलिटरेशन. MTP Final Stage Presentation Guided by:- Presented by:- Prof. Pushpak Bhattacharyya Abhijeet Padhye (06305902) Department of Computer Science & Engineering

glenda
Télécharger la présentation

Linguistic Enrichment of Statistical Transliteration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linguistic Enrichment of Statistical Transliteration लिंगुइस्टिक एनरिच्मेंट ऑफ़ स्टटिस्टिकल ट्रांसलिटरेशन MTP Final Stage Presentation Guided by:- Presented by:- Prof. Pushpak Bhattacharyya Abhijeet Padhye (06305902) Department of Computer Science & Engineering IIT Bombay

  2. Presentation Pathway • Problem Statement • Motivation • What is Transliteration? • Syllables and their Structure • Sonority Theory • Concept of Schwa • Proposed Transliteration Model • Experiments and Results • Discussions • Conclusion and Future Work • References

  3. Problem Statement To exploit the Phonological similarities of Roman and Devanagari in order to linguistically aid the process of Statistical Transliteration.

  4. Motivation • An important component of Machine Translation • When you cannot Translate – Transliterate. • Critical in tackling problem of OOV words and proper nouns • Proves acute in translating Named entities for CLIR • Transliteration – a Phonetic translation process; • Apt to exploit phonetic and phonological properties

  5. What is Transliteration? • A process of phonetically translating words like named entities or technical terms from source to target language alphabet. • Examples:- • Gandhiji – गाँधीजी • OOV words likeनमस्कार - Namaskar

  6. Humans translate/transliterate frequently for different reasons An example of how transliteration comes to rescue when no translations exist

  7. x Overview of Transliteration Source Word Target Word Transliteration Units Transliteration Units Character n-grams Syllables

  8. Basic of syllables “Syllable is a unit of spoken language consisting of a single uninterrupted sound formed generally by a Vowel and preceded or followed by one or more consonants.” • Vowels are the heart of a syllable(Most Sonorous Element) • Consonants act as sounds attached to vowels.

  9. Syllable Structure • Simple syllables – Baba, दादा • Complex syllables – Andrew Ba + ba दा + दा Alert!!! Basic Structure doesn’t suffice An drew VC? CVC?

  10. Possible syllable structures • The Nucleus is always present • Onset and Coda may be absent • Possible structures • V • CV • VC • CVC

  11. Introduction to sonority theory “The Sonority of a sound is its loudness relative to other sounds with the same length, stress and speech.” • Some sounds are more sonorous • Words in a language can be divided into syllables • Sonority theory distinguishes syllables on the basis of sounds.

  12. Sonority Hierarchy • Obstruents can be further classified into:- • Fricatives • Affricates • Stops

  13. Sonority sequencing principle “The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.” Peak (Nucleus) Onset Coda

  14. example • ABHIJEET • Sonority Profile 1 A I E E H J B T • Sonority Profile 2 A I E E H J B T

  15. The concept of schwa • First alphabet of IAL – {a} • Unstressed and Toneless neutral vowel • Some schwas deleted and some are not • Schwa deletion – important issue for grapheme to phoneme conversion • Handled using a well-established schwa deletion algorithm • Example:- • Priyatama – Last “a” changes the Gender प्रियतम प्रियतमा

  16. Proposed Transliteration Model Source Language Words Source Language Syllables Syllabification Modules Target Language Words Target Language Syllables Moses Training Target Language Model SRILM Phrase translation tables Moses Decoder Source Language Words Transliterated output

  17. Transliteration system workflow • Syllabification of parallel list of names in Roman and Devanagari • Using these parallel list for:- • Alignment of syllables • Training Moses translation toolkit • Language model generation using SRILM • Decoding using trained phrase-translation tables and language model • Comparing results to analyze performance

  18. Experiments and Results • Syllabification of Roman and Devanagari words Fig : Syllabification Algorithm

  19. Syllabification results • A few examples

  20. Transliteration Process • Syllabification of list of 10000 parallel names written in Roman and Devanagari and preparing a parallel aligned list of syllables. • Training Language Models for target language using SRILM toolkit. • Training MOSES with aligned corpus of 7500 names and target language model as input. • Testing with a list of 2500 proper names using the trained model for transliteration.

  21. Roman to Devanagari Transliteration Fig : Result for Roman to Devanagari Transliteration Fig : Top-n Inclusion results

  22. Devanagari to Roman Transliteration Fig : Result for Devanagari to Roman Transliteration Fig : Top-n translation results

  23. Comparison with Character n-gram based model • Same Experimental setup; Transliteration units changed to n-grams • Bigrams (Sandeep  Sa, an, nd, de, ee, ep) • Trigrams (Sandeep  San, and, nde, dee, eep) • Quadrigrams (Sandeep  Sand, ande, ndee, deep) • Observations suggest performance improvement using syllables as transliteration units • n-gram based models prove to be ignorant to phonological properties like unstressed vowels Fig : Comparison with N-gram based model

  24. Comparison with State-of-the-art Systems • Google transliteration engine and Quillpad used as benchmarks for comparison • A list of 1000 words written in Roman alphabet used as test input • Our system outperforms Quillpad and just falls short of Google’s results. • A more intense training with larger training set might improve system performance. Fig : Comparison with State-of-the-art transliteration systems

  25. Discussions • Accents • थोड़ा: Thoda or thora? • Mapping of sounds • Mahaan – महान Kahaan - कहाँ • Silent Letters • Psychatrist - सायकेट्रिस्ट

  26. Discussions (cntd…) • Improper Schwa deletion • Venkatachalam – वेंकटचलम • Improper placement (Onset or Coda) • सिराजउद्दीन - सि राज उद् दिन or सि रा जउद्  दिन • Similar phonological structure but different pronunciation • सोमलता and कोमलता वें + कट + च + लम वेंक + टच + लम सोम लता को मल ता

  27. Conclusion and Future work • Transliteration can prove critical in supporting Machine Translation • Phonologically aware transliteration units like syllables show strong signs of performance improvement • Syllable-based transliteration performs at least up to the state-of-the-art systems. • Syllabification algorithms should be subjected to further improvement • Developed system should be supplied with larger and more accurate training set. • Some linguistic issues discussed above are very challenging cases for future work on transliteration

  28. References • Pirkola A., Toivonen J., Keskustalo H., Visala K., Jarvelin K. 2003. Fuzzy Translation of Cross-Lingual Spelling Variants. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval • Gao W., Lam W., and Wang K. 2004. Phoneme-based Transliteration of Foreign Names for OOV Problem. International Joint Conference on Natural Language Processing. • Osamu F. 1975. Syllable as a unit of Speech Recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. • Phillip Koehn et.al. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration Session, Prague, Czech Republic. • Laver J. 1994. Principles of Phonetics. Cambridge University Publications. PG. 114. • Knight L. and Graehl J. 1997. Machine Transliteration. Proceedings of ACL 1997. Pg 128-135. • Stolcke A. 2002. SRILM – An Extensible Language Modeling Toolkit. In proceedings of International Conference on Spoken Language Processing. • Choudhury M. and Bose A. 2002. A Rule Based Schwa Deletion Algorithm for Hindi. Technical Report. Dept. of Comp. Sci. & Engg. Indian Institute of Technology, Kharagpur.

  29. Background Theory

  30. Approaches towards Transliteration

  31. Complex Syllable structure Fig : Detailed syllable structure Fig : Complex syllables fitting in above structure

  32. Sonority theory & syllables “A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.” • Represented as waves of sonority or Sonority Profile of that syllable Nucleus Onset Coda

  33. Sonority Hierarchy for English and Hindi Fig : Sonority hierarchy for English Fig : Sonority hierarchy for Hindi

  34. Maximal Onset Principle “The Intervocalic consonants are maximally assigned to the Onsets of syllables in conformity with Universal and Language-Specific Conditions.” • In case of words having two valid syllable set, one with maximum onset length would be preferred. • Example – Diploma • Di + plo + ma • Dip + lo + ma

  35. Schwa deletion algorithm Proceduredelete_schwa (DS) Input : word (String of alphabets) Output : Input word with some schwas deleted. • Mark all the full vowels and consonants followed by vowels other than the inherent schwas (i.e. consonants with Matras) and all the hs in the word as F unless it is explicitly marked as half by use of halant. Mark all the consonants immediately followed by consonants or halants (i.e consonants of conjugate syllables) as H. Mark all the remaining consonants, which are followed by implicit schwas as U. • If in the word, y is marked as U and preceded by i, I, ri, u or U, mark it F. • If y, r, l or v are marked U and preceded by consonants marked H, then mark them F. • If a consonant marked U is followed by a full vowel, then mark that consonant as F. • While traversing the word from left to right, if a consonant marked U is encountered before any consonant or vowel marked F, then mark that consonant as F. • If the last consonant is marked U, mark it H. • If any consonant marked U is immediately followed by a consonant marked H, mark it F. • While traversing the word from left to right, for every consonant marked U, mark it H if it is preceded by F and followed by F or U, otherwise mark it F. • For all consonants marked H, if it is followed by a schwa in the original word, then delete the schwa from the word. The resulting new word is the required output. End procedure delete_schwa

  36. Example of Schwa deletion Fig : Application of Schwa deletion Algorithm

  37. Examples • Correct Transliterations • Incorrect Transliteration

More Related