Dialectal Chinese Speech Recognition

Thomas Fang Zheng Oct. 29, 2004 Workshop of KFIS (Korea Fuzzy Logic and Intelligent Systems Society) Oct. 29-30, 2004, Kyungnam Univ., Masan, Korea Dialectal Chinese Speech Recognition

Motivation • Chinese ASR encounters an issue that is bigger than that of any other language - dialect. • There are 8 major dialectal regions in addition to Mandarin (Northern China), including:- • Wu (Southern Jiangsu, Zhejiang, and Shanghai); • Yue (Guangdong, Hong Kong, Nanning Guangxi); • Min (Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan); • Hakka (Meixian Guangdong, Hsin-chu Taiwan); • Gan (Jiangxi); • Xiang (Hunan); • Hui (Anhui) • Jin (Shanxi, Hohehot Inner Mongolia). • Can be further divided into over 40 sub-categories.

Chinese dialects share a same written language:- • The same Chinese pinyin set (canonically), • The same Chinese character set (canonically), and • The same vocabulary (canonically). • And standard Chinese (known as Putonghua, or PTH) is widely spoken in most regions over China. • However, speech is strongly influenced by the native dialects; • Most Chinese people speak in both standard Chinese and their own dialect, resulting in dialectal Chinese - Putonghua influenced by native dialect

In dialectal Chinese:- • Word usage, pronunciation, and syntax and grammar vary depending on the speaker's dialect. • ASR relies to a great extent on the consistent pronunciation and usage of words within a language. • ASR systems constructed to process PTH perform poorly for the great majority of the population.

Project Goal • To develop a general framework to model in dialectal Chinese ASR tasks :- • Phonetic variability, • Lexical variability, and • Pronunciation variability • To find suitable methods to modify the baseline PTH recognizer to obtain a dialectal Chinese recognizer for the specific dialect of interest, which employ :- • dialect-related knowledge (syllable mapping, cross-dialect synonyms, …), and • training/adaptation data (in relatively small quantities, or even no) • Expectation: the resulted recognizer should also work for PTH, in other words, it should be good for a mixture of PTH and dialectal Chinese.

Dialectal Chinese Related Knowledge & Resources Standard Chinese Speech Recognizer + Dialectal Chinese Speech Recognition Framework Dialectal Chinese Speech Recognizer

Standard Chinese Speech Recognizer Dialectal Chinese Speech Recognition Framework LM Adapter AM Adapter Acoustic Regulator Dialect-Related Lexical Entry Replacement Rules Toned-Syllable Mappings: Word-Independent/-Dependent Pronunciation Modeling (PM) Techniques: Accents & Spontaneous Speech Language Post-Processing Algorithms

This proposal was selected as one of three projects for '2003 Johns Hopkins University Summer Workshop from tens of proposals collected from universities/companies over the world; • Postponed to 2004 due to SARS; • For practical reasons, during the summer we only focused on one specific dialect, the Wu dialect (Shanghai Area), and the target language was Wu dialectal Chinese (WDC for short);

Why Wu dialect (1) ? • Population: more than 70 million people use WUdialect, the 2nd popular dialect in China; • Economy: one of the most advanced city in China - Shanghai

Why Wu dialect (2)? • Wu dialect is a full-developed language • The syntax of Wu dialect is very complex; • The vocabulary is even more larger than Mandarin; • Many literature masterpiece were influenced by WU dialect (in history).

Spoken language changing from dialect to standard Chinese

Useful Dialect-Related Knowledge • Chinese Syllable Mapping (CSM) • This CSM is dialect-related. • Two types: • Word-independent CSM: e.g. in Southern Chinese, Initial mappings include zhz, chc, shs, nl, and so on, and Final mappings include engen, ingin, and so on; • Word-dependent CSM: e.g. in dialectal Chuan Chinese, the pinyin 'guo2' is changed into 'gui0' in word '中国(China)' but only the tone is changed in word '过去(past)'.

The CSM is not exact. For any mapping AB, it is mostly that the resulted pronunciation is not B exactly, but something quite similar to B, more similar to B than to any other syllable. • The CSM could be N→1, 1→N, or crossed. • Bi is a variation of B, such as :- • nasalization, centralization, voiced,voiceless, rounding, syllabic, pharyngrealization, aspiration

Lexicon: • Linguistician says the vocabulary similarity rate between PTH and Wu dialect is about 60~70% 60~70%

A dialect-related lexicon containing two parts :- • a common part shared by standard Chinese and most dialectal Chinese languages (over 50k words), and • a dialect-related part (several hundreds). • And in this lexicon :- • each word has one pinyin string for standard Chinese pronunciation and a kind of representation for dialectal Chinese pronunciation, and • each of those dialect-related words is corresponding to a word in the common part with the same meaning

Language • Though it is difficult to collect dialect texts, dialect-related lexical entry replacement rules could be learned in advance, and therefore • The language post-processing or language model adaptation techniques could be adopted.

我做饭给你吃 (PTH)我烧饭给你吃(Wu) • … • … w1 w2 w3 • Dialectal words substitute for some words w3 w3 • … • … • 你先走(PTH)你走先(Wu) w1 w2 w3 • Word-order changes w2 w2 w3 w2 1 V2 2

Our focus

Data Creation for WDC e-Dictionary Database IF & Syllable Set Definition Speech Transcription Database Collection PTH Words Wu Dialect Words Read Speech Spontaneous Speech C-Chars Syllables IFs/GIFs PTH Pron. PTH Pron. PTH Words Only Misc Info Wu Dialect Pron. Wu Dialect Pron. Topics PTH + Wu Words PTH Synonym Pre-workshop Work IF: a Chinese Initial or Final; GIF: generalized IF; PTH: Putonghua (standard Chinese); WDC: Wu Dialectal Chinese

Wu Dialectal Chinese (WDC) Database Collection (1) • Collection: • Totally 11 hours - Half read (R) + half spontaneous (S): • 100 Shanghai speakers * (3R +3S) minutes / speaker • 10 Beijing speakers * 6S minutes / speaker • Read speech with well-balanced prompting sentences; • Type I: each sentence contains PTH words only (5-6k) • Type II: each sentence contains one or two most commonly used Wu dialectal words while others are PTH words • Spontaneous speech with Pre-defined talking topics; • Conversations with PTH speaker on self-selected topic from: sports, policy/economy, entertainment, lifestyles, technology • Balanced Speaker (gender, age, education, PTH level, …)

Goal Actual WDC Data Diversity

Accent Assessment by experts 1A. State-level radiobroadcaster; 1B. Province-level radiobroadcaster; 2A. Quite good;2B. Less accented; 3A. More accented;3B. Hard to understand but known it is PTH

Accent Assessment according to age

Accent Assessment according to education level

Accent Assessment according to gender

Wu Dialectal Chinese (WDC) Database Collection (2) • Transcriptions include:- • For 100 Wu Dialectal Chinese speakers:- • Canonical Chinese Initial/Final labels, and • Generalized IF (GIF) labels. • For 10 Beijing speakers:- • Chinese character and pinyin transcriptions only

Dialectal Lexicon Construction • Establish a 50k-word electronic dialect dictionary with each word having :- • PTH pronunciation in PTH IF string • Wu dialect pronunciation in Wu IF string • Purpose: summarizing Dialect-Related Knowledge • Figure out Chinese syllable mappings:- • Same written form (character), different pronunciations; • Both word-independent and word-dependent; • Find dialect-related word variations:- • Same meanings in Chinese language; • Different written forms (character); • Uttered in standard Chinese manner; • For LM adaptation/modification

e-Dictionary Word Examples

Workshop Experiments • Experiment Conditions (1): • Using HTK 3.2.1 (latest version downloadable on web); • Data Set Division: • Using spontaneous speech data only • Data were split according to age (younger, older), education (higher, lower), and PTH level into • Train Set: 80 speakers • devTest Set: 20 speakers (a part of devTrain) • Test Set: 20 speakers

Experiment Conditions (2): • Acoustic model: • Trained from Mandarin Broadcast News (MBN); • 39 dimensional MFCC_E_D_A_Z; • diagonal covariance matrix; • 4 states per unit; • 103,041 units (triIF), 10,641 real units (triIF); • 3,063 different states (after state tying); • 16 mixtures per state, 28 mixtures per state for silence unit; • Language model: • Built on HKUST 100 hour CTS data, plus Hub5, plus Wu-Accented Training Data Transcriptions

Observation on WDC Data • IF-mapping / Syllable-mapping: • Influenced by Wu dialect, a Wu dialectal Chinese (WDC) speaker often pronounce any of a certain set of IFs into another IF, and there are rules to follow, such as zh -> z, ch -> c, sh -> s, and so on. • Observations on three sets - Train (80 speakers), devTest (20), and Test (20): • Mapping pairs almost the same among all three sets; • Mapping pairs almost identical to experts' knowledge; • Mapping probabilities also almost equal; • Remarks: • Experts' knowledge could be useful; • Mapping rules can be learned from less data.

Using only devTest set + dialect-based knowledge • Step 1: Apply PTH-IF mapping rules; • Step 2: Apply WDC-IF mapping rules; • Step 3: Apply syllable-dependent mapping rules; • Step 4: Perform multi-pronunciation expansion (MPE) based on unigram probability.

Why trying this method? • "IF-mapping" in dialectal Chinese is the fact (human uses it); • "In-domain data training" will sure get a good result but collecting data is a huge task, especially for 40 sub-dialects of Chinese; • "Mere Adaptation" will be easier and better but might make it hard to distinguish those mapping pairs, each pair tends to become a single IF; • This is not practical in such applications where you have no more information about the speakers and a mixture of WDC and PTH is used as Call Centers; • It is expected that knowledge based method would result in an overall good performance for both WDC and PTH.

Step 1: Applying PTH-IF mapping rules • Rules are based on experts' knowledge (with AM unchanged) • (zh, z) (z, zh) • (ch, c) (c, ch) • (sh, s) (s, sh) • (eng, en) (en, eng) • (ing, in) (in, ing) • (r, l) • Gain not so significant: 0.5% Chinese Character Error Rate (CER) reduction • Pronunciation entry probability does not help improve performance

Step 2: Applying WDC-IF mapping rules • There indeed are some Wu dialect Chinese specific IFs, such as iao -> io^; • Rules learned from devTest • Newly introduced WDC specific IFs trained from devTest using adaptation method • 8.66% absolute CER reduction • MLLR adaptation outperforms MLLR+MAP • About 10% difference • Possibly due to less data • We referred it to surface form (WDC) MLLR adaptation; for comparison purpose, the base form (PTH) MLLR adaptation is also evaluated where only canonical IFs are used.

Step 3: Apply syllable-dependent mapping rules • Assumption: most IF-mappings are context-independent, but some are syllable-dependent (such as iii|(sh iii) -> ii|(s ii)), we believe there are others • Rules learned from devTest • We do not succeed in improving the accuracy, on the contrary, the character accuracy reduced by about 6% • We do not have a clear explanation yet • So we keep using context-free mapping rules

Step 4: Multi-pronunciation expansion (MPE) based on unigram probability • Motivation: more pronunciations help model pron. variations, but lead to more confusion, there should be tradeoff; • Accumulated unigram probability (AccProb) used as the criterion • Only words with higher unigram probabilities will have multiple pronunciations each; • Words with lower unigram probabilities will have a single standard pronunciation each;

Best result achieved at a suitable AccProb value, say 94%, with VocSizeRatio=1.10 AccProb: 0% means no multiple pronunciation expansion, while 100% full expansion; Base-form MLLR + PTH-IF mapping + MPE (CER)

Best result achieved at a suitable AccProb value, say 94%, with VocSizeRatio=1.24 AccProb: 0% means no multiple pronunciation expansion, while 100% full expansion; Surface-form MLLR + WDC-IF mapping + MPE (CER)

Best result achieved at a suitable AccProb value, say 94%, with VocSizeRatio=1.24 AccProb: 0% means no multiple pronunciation expansion, while 100% full expansion; Base-form MLLR + PTH-IF mapping + MPE (CER) Surface-form MLLR + WDC-IF mapping + MPE (CER)

Performance improvement comparison: overall, and in terms of speaker clusters

Q: How about recognizing PTH using the resulted WDC recognizer? • We obtain WDC recognizer from PTH recognizer; • We get a CER reduction of over 10% when recognizing WDC on an average; • How about using it to recognize PTH?

sh Adaptation sh s (Conventional Method) s sh sh MPE Rule + (Our method) s s

We can expect that using WDC recognizer to recognize PTH, the performance will degrade; • But we would expect it will not decrease too much; • Results: using WDC recognizer, you get • Over 10% CER reduction to recognize WDC; • 0.62% CER increase to recognize PTH.

Discussions • The use of knowledge is useful and effective • In this project, there are several problems to solve: channel, speaking-style, dialect background, and domain problems. • It is easier to solve all these problems by simply using the adaptation method; • Our method focuses only on the dialect problem; • The results using our method could be better if we integrate those methods related to channel, and speaking-style.

Future Plan • Continue on the current project, including: • Investigating the syllable-dependent mapping; • Rank-based Rescoring; • Language Model Adaptation;

Rank-based AM Rescoring • Assumption: ranks in lattice when using the recognizer derived from the PTH one to recognize WDC speech has a relatively stable distribution

Generate lattice (“SIL” marks pauses) for each sentence in devTest Turn the lattice into multiple alignment (“-” marks deletions) - information of arcs in the lattice will be remembered for later back-tracking. Lidia Mangu et al [1999]

Dialectal Chinese Speech Recognition