1 / 29

A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas

A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu. Transliterated Mandarin Search. Google suggests spelling correction. Alternate Transliterations?. Want to say “Did you mean Peiching ?”. Transliteration Problems.

dempster
Télécharger la présentation

A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Method for Enhancing Search Using Transliteration of Mandarin ChineseVijay Johnvijayjohn@mail.utexas.edu

  2. Transliterated Mandarin Search Google suggests spelling correction

  3. Alternate Transliterations? Want to say “Did you mean Peiching?”

  4. Transliteration Problems • “Beijing” provides many results • Google doesn’t find “Peiching,” “Peking,” “Bukgyeong,” etc. • Many pages using variety of transliterations • Transliterations unorganized • This paper organizes for Mandarin Chinese

  5. The Problem (Cont’d) • Why variety of transliterations? • Web content: 82% Romanized • Majority’s native languages: other scripts • Standard keyboards • Non-Romanized sources normally transliterated (esp. on Web) • Transliteration variations

  6. Example 1: Tibetan • Four languages: transliteration problems • Hello in Tibetan • Wylie (bkra shis bde legs) • Tibetan Pinyin • Several unofficial systems based on pronunciation • Spelled/transcribed in several ways (with some guidelines)

  7. Example 2: Malayalam • No official transliteration system • Transliteration based on personal preference (many unorganized variations) • Script conversion programs: more consistent systems • /maleja:m/ usu. transcribed “Malayalam” • malayaaLam (Maya), Malajal- (Slavic)

  8. Example 3: Romani • Vlax Romani standard • Literacy → few adopt standard • Different countries, different official languages → different spellings • No official systems (government) • Several transliteration systems exist (often inconsistent)—as in last 2 languages

  9. Example 4: Mandarin • Hànyŭ Pīnyīn • Tōngyòng Pīnyīn • Wade-Giles • Gwoyeu Romatzyh • (Yóuzhèngshì Pīnyīn) (etc.)

  10. Prior Work • In Mandarin: geared towards Chinese users searching for information from West • Western names-Hànzĭ-Hànyŭ Pīnyīn-Hànzĭ • Algorithms designed for Arabic & Japanese transliteration • Google • This method designed for Western users searching for Chinese information

  11. Initial Effort on Mandarin • Practical first step: increased trade with China • Simple transliteration problem (relatively) • Modifications for Tibetan, Romani, Hindustani, etc. • Intact for some other languages? (e.g. Russian, Arabic, Japanese, Korean) • Input = Hànyŭ Pīnyīn; output = other systems

  12. Initial Program • Combined many systems • Ying – yink – yenk – yenk’ – yemk’ – yermk’ – yarmk’ • Instead of “victory,” searched for “Yarmuk” River in Middle East • Transliteration systems organized by row but not by column

  13. Organize into Transliteration TableEntries for “beijing” in two systems(Purpose is to go from one column to another)

  14. Part of Patterns Table 8 systems

  15. Decomposition • Search for “Beijing” in table • Delete one letter; search for “Beijin” • Beiji, Beij…B • Search for “eijing” (beijing – b) similarly • Ei found, search for “jing” • “J” found, search for “ing”

  16. Composing new search terms • Components: b, ei, j, ing • B → b, p • ei → ei • j → j, ch • ing → ing

  17. Implementation • Java program • After composition, how does algorithm search? • Connects to Google via Google API (Application Programming Interface) • Google searches • 1-2 second delay (due to Google)

  18. Transliteration Patterns • Transliterations organized into table • {"üe", "yue", "yue", "ue", "ve", "üeh", "üeh", "üeh"} • lüe, lyue, lue, lve, lüeh • 3 transliteration systems; at most 5 patterns • First column Hànyŭ Pīnyīn like “ing” “b” “ei”

  19. Transliteration Systems By Column • Only 3 systems (in effect) • Hànyŭ Pīnyīn (HP) • Tōngyòng Pīnyīn #1 (TP1) & Tōngyòng Pīnyīn #2 (TP2) • Modified Hànyŭ Pīnyīn #1 (MHP1) & Modified Hànyŭ Pīnyīn #2 (MHP2) • Wade-Giles #1 (WG1), Wade-Giles #2 (WG2), & Wade-Giles #3 (WG3)

  20. Differences Between Transliteration System Variants • TP1- iu, ui, ‘ • TP2- iou, uei, - • WG2- h’ung (not hung) • WG3- ts’u (not tz’u) • WG1- szu (not ssu)

  21. Web versionhttp://www.translitsearch.com/demos/demos.htm

  22. Web search

  23. What is the effect? • Search for 130 Pinyin cities/regions • 16 – no other transliteration • 60 – at least two others • 6 – three or more • How much did Xiaozhi find? (8% more) • 5 min. 12 sec. – entire search

  24. Further work 1 • Include Yale, GR (Gwoyeu Romatzyh), &c. • YZSPY (Yóuzhèngshì Pīnyīn) • Accents • Hanja- and Kanji-based transliterations • Application to research archives

  25. Further Work 2 • Improvements in accuracy of transliteration • Search in other transliterations • Japanese version of current paper • Hindustani version • Romani with Indic cognates • Extension to translation (transliterated Mandarin-Cantonese characters)

  26. Solutions for Tibetan • Start with Wylie • Xiaozhi with adjustments • Dzongkha • Dzongkha-based variations? • Analysis of common transliteration patterns (usu. based on closest pronunciation)

  27. Solutions for Malayalam • Start with Maya (script conversion program) • Include minor variations from other script conversion programs • Analysis of transliterations used

  28. Solutions for Romani • Start with Vlax Romani Standard • Regional variations • Some transliterations easier to use on computers • e.g. chh, sh to omit hacek

  29. Conclusions • Enhances search by finding alternate transliterations • Applied to Mandarin • Applicable to other languages • Applicable to lesser-studied (& other) languages • Language- (or script-) specific

More Related