Chinese Romanization for Chinese Voice Browsing IBM China Research Lab
Index • Motivations & Proposals • IPA. VS. Chinese Romanization • Chinese Romanization Standards • Implementations of Chinese Romanization in SSML • Extensions for other languages
IBM Speech Synthesis System • IBM speech synthesis system support about 20 languages. • For Asian Language, we cover: • Mandarine, • Cantonese, • Korean, • Japanese, • Thai.
Pronunciations Annotations are important for Chinese • A Chinese character represents a meaning more than a pronunciation. • The homograph phenomenon is very common for Chinese characters. • So it will be very helpful if the pronunciation can be given explicitly.
Proposals • We propose to use Chinese Romanization to annotate Chinese pronunciation in “phoneme” element. • We also propose SSML to use diverse predefined and widely used pronunciation annotation standards for different languages. • Thus SSML can be more easily accepted and used around the world. • Note: Chinese Romanization = Hanyu Pinyin in this PPT.
Comparison Rule: Goal of SSML • The goal of SSML is to “provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications”. • To reach the goal, we need more and more users of SSML, such as ordinary Web applications developers, to learn and use the SSML easily. • So, we need to define the SSML based on ordinary people’s knowledge and skill rather than professional linguistics’ knowledge. • Otherwise, it will be a long way for SSML be widely accepted and used around the world.
IPA is not very fit for Chinese • IPA tries to collect an exhaustive set of pronunciations for all kinds of languages. • It has become very complicated and difficult to input. • A well educated Chinese adult can not annotate Chinese Pronunciation in IPA without special training. • IPA is not very popular in China. • Special linguistic phenomena in Chinese, such as tone, retroflex, can not be conveniently described by IPA.
Chinese Romanization is fit for Chinese • Chinese Romanization is specially designed only for Chinese instead of all languages. • Adding ‘r’ in the end to describe a “retroflex” syllable. • Adding ‘tone’ attribute to describe the tone. • Chinese Romanization is widely used and learnt. • Chinese people learn Chinese Romanization in primary school. • Many foreigners begin to learn Chinese by Chinese Romanization. • Chinese Romanization is widely used to input Chinese Characters on computer. • Chinese government has brought into effect a standard for Chinese Romanization. • It is in effect for education, publishing, information processing and other related industries in China.
Chinese Romanization Standard • The writing rules of Chinese Romanization conform to P.R.C state standard “Basic rules for Hanyu Pinyin Orthography”  published by (CSBQTS) in 1996. • This Orthography is based on “Hanyu Pinyin Schema” published in 1958. • According to the naming method of alphabet, we propose to use “x-CSBQTS-96” to represent Chinese Romanization alphabet. However, we also propose to use “x-Pinyin-96”, which is easier to remember. * CSBQTS: China State Bureau of Quality and Technical Supervision
Hanyu Pinyin Schema (published in 1958) • Character Set. • 25 characters, all from ‘a’ to ‘z’ except ‘ü’. • (For easy to input on computer: ü is replaced by v.) • Initial Set: • b, p m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s • Final Set: • i, u, ü, a , ia, ua, o, uo, e, ie, eü, ai, uai, ei, uei, • ao, iao, ou, iou, an, ian, uan, üan, en, in, uen, ün • ang, iang, uang, eng, ing, ueng, ong, iong, • Tone Annotation: • mā , má, mǎ, mà, ma • Separator: ' • pi’ao
Basic rules forHanyu Pinyin Orthography(published in 1996) 1. Words are the basic units for spelling the Chinese Common Language. (Space is used to separate Word) • rén (person/people), péngyou (friend[s]), túshūguǎn (library/libraries) • wǒrén hé nóngmín (Workers and Farmers) 2. Structures of two or three syllables that indicate a complete concept are linked: • quánguó (the whole nation), duìbuqǐ (sorry), 3. Separate terms with more than 4 syllables if they can be separated into words, otherwise link all the syllables: • wúfèng gāngbǐ (seamless pen), Hóngshízìhuì (Red Cross)
Basic rules forHanyu Pinyin Orthography(published in 1996) 4. Reduplicated monosyllabic words are linked, but reduplicated disyllabic words are separated: • rénrén (everybody), chángshi chángshi (give it a try) 5. In certain situations, for the purpose of making it convenient to read and understand the words, a hyphen can be added: • huán-bǎo (environmental protection), shíqī-bā suì (17 or 18 years old)
Implementation 1 • <?xml version="1.0"?> • <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" • xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" • xsi:schemaLocation="http://www.w3.org/2001/10/synthesis • http://www.w3.org/TR/speech-synthesis/synthesis.xsd" • xml:lang="zh-CH"> • <phoneme alphabet=" x-CSBQTS-96" ph="duìbuqǐ"> 对不起 </phoneme> • <!-- This is an example of Chinese Romanization Standard Tone Annotation--> • </speak>
Implementation 2 • <?xml version="1.0"?> • <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" • xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" • xsi:schemaLocation="http://www.w3.org/2001/10/synthesis • http://www.w3.org/TR/speech-synthesis/synthesis.xsd" • xml:lang="zh-CH"> • <phoneme alphabet="x-CSBQTS-96" ph="dui4bu0qi3"> 对不起 </phoneme> • <!-- This is an example of Chinese Romanization • using number to describe tone --> • </speak>
Comparison between Two implementations Implementation 1: <phoneme alphabet=" x-CSBQTS-96" ph="duìbuqǐ"> 对不起 </phoneme> Implementation 2: <phoneme alphabet="x-CSBQTS-96"ph="dui4bu0qi3"> 对不起 </phoneme> Note: "x-CSBQTS-96" may be replaced by "x-Pinyin-96"
Extension for Cantonese • The Linguistic society of Hong Kong has published a simple, easy-to-learn and easy-to-use “LSHK Cantonese Romanization Scheme” in 1993. • This scheme is widely adopted in various areas: education, Cantonese information process and computer input method, etc. • So we also propose to use “The LSHK Cantonese Romanization Scheme” to annotate Cantonese pronunciation.
Extension for more languages • Though it is possible to form up a general standard to annotate all languages’ pronunciation, such a standard may become very complex to use. • Another way is to use the predefined and widely accepted pronunciation annotation standards for different language. • At least, these diverse standards should be an important complement to the general standard.
Korea Romanization It is used in our Korea Speech Synthesis System.
Japanese Romanization • Japanese: • まだ覚えているでしょう 波音に包まれて • Japanese Romanization: • mada oboeteiru deshou nami oto ni tsutsumarete • English meaning: • Do you remember being surrounded by the sound of tide?
Discussion of “Word” • What is the definition of “Word” in Chinese? • Prosodic Word or Grammar Word • 你来还是不来？nǐ lái háishi bù lái? • Is “不来” a word? • What is the difference between ‘Word’ & ‘break’? • The misunderstanding problem can be solved by adding ‘break’. • Can Word information be handled by ‘Hanyu Pinyin Orthography’? • In ‘Hanyu Pinyin Orthography’, space is used to separate words.