Building A Highly Accurate Mandarin Speech Recognizer

Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007

Outline • Goal: A highly accurate Mandarin ASR • Background • Acoustic segmentation • Acoustic models and adaptation • Language models and adaptation • Cross adaptation • System combination • Error analysis  Future

Background • 870 hours of acoustic training data. • N-gram based (N=1) ML Chinese word segmentation. • 60K-word lexicon. • 1.2G words of training text. Trigrams and 4-grams.

silence noise Start / null End / null speech silence silence silence noise noise noise Start Start Start / / / null null null End End End / / / null null null speech speech speech Acoustic segmentation • Former segmenter caused high deletion errors. It mis-classified some speech segments as noises. • Speech segment min duration 18*30=540ms=0.5s

noise silence Foreign Start / null End / null Mandarin 1 Mandarin 2 New Acoustic Segmenter • Allow shorter speech duration • Model Mandarin vs. Foreign (English) separately.

Two Sets of Acoustic Models • For cross adaptation and system combo • Different error behaviors • Similar error rate performance

MLP Phoneme Posterior Features • Compute Tandem features with pitch+PLP input. • Compute HATs features with 19 critical bands • Combine two Tandem and HATs posterior vectors into one. • Log(PCA(71)  32) • MFCC + pitch + MLP = 74-dim • 3500x128 Gaussians, MPE trained. • Both cross-word (CW) and nonCW triphones trained.

(42x9)x15000x71 PLP (39x9) Pitch (3x9) Tandem Features [T1,T2,…,T71] • Input: 9 frames of PLP+pitch

HATS Features [H1,H2,…,H71] 51x60x71 (60*19)x8000x71 E1 E2 … E19

Phone-81: Diphthongs for BC • Add diphthongs (4x4=16) for fast speech and modeling longer triphone context. • Maintain unique syllabification. • Syllable ending W and Y not needed anymore.

Phone-81: Frequent Neutral Tones for BC • Neural tones more common in conversation. • Neutral tones were not modeled. The 3rd tone was used as replacement. • Add 3 neutral tones for frequent chars.

Phone-81: Special CI Phones for BC • Filled pauses (hmm,ah) common in BC. Add two CI phones for them. • Add CI /V/ for English.

Phone-81: Simplification of Other Phones • Now 72+14+3+3=92 phones, too many triphones to model. • Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2. • 92 – (4x3–1) = 81 phones.

PLP Models with fMPE Transform • PLP model with fMPE transform to compete with MLP model. • Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT • 5 Neighboring frames of Gaussian posteriors. • M is 42 x (3500*32*5), h is (3500*32*5)x1. • Ref: Zheng ICASSP 07 paper

{w | w same story (4secs) } Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model q q0 One sentence • 4s window is used to make adaptation more robust against ASR errors. • {w} are weighted based on distance.

Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model One sentence • Training • One topic per sentence. • Train 64 topic-dep. 4-gram LM1 , LM2, … LM64. • Decoding • Top n topics per sentence, where qi’ > threshold. Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model Latent Dirichlet Allocation Topic Model Latent Dirichlet Allocation Topic Model One sentence One sentence

Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on eval06

Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 Indeed different error behaviors --- good for system combo.

Decoding Architecture Aachen MLP nonCW qLM3 PLP CW+SAT+fMPE MLLR, LM3 MLP CW+SAT MLLR, LM3 qLM4 Adapt/Rescore qLM4 Adapt/Rescore Confusion Network Combination

Topic-based LM Adaptation (NTU) • Training, per sentence: • 64 topics: q = (q 1, q 2, …, q m) • Topic(sentence) = k = argmax {q1, q2, …, qm} • Train 64 topic-dep (TD) 4-grams • Testing, per utterance: • {w}: N-best confidence based weighting + distance weighting • Pick all TD 4-grams whose qi is above a threshold. • Interpolate with the topic-indep. 4-gram. • Rescore N-best list.

CERs with diff LMs (internal use)

Topic-based LM Adaptation (NTU) “q” represents “quick” or tightly pruned. Oracle CNC: 4.7%. Could it be a broken word sequence? Need to verify that with word perplexity and HTER.

2006 ASR System vs. 2007 CER on Eval07 37% relative improvement!!

Eval07 BN ASR Error Distribution 66 BN snippets (Avg CER 3.4%) 20 15 CER (%) 10 SRI 5 0 0.0% 50.0% 100.0% 150.0% % snippets

Eval07 BC ASR Error Distribution 53 BC snippets (avg CER 15.9%) 50 40 30 CER (%) SRI 20 10 0 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% % snippets

What Worked for Mandarin ASR? • MLP features • MPE • CW+SAT • fMPE • Improved acoustic segmentation, particularly for deletion errors. • CNC Rover.

Small Help for ASR • Topic-dep. LM adaptation. • Outside regions for additional AM adaptation data. • A new phone set with diphthongs to offer different error behaviors. • Pitch input in tandem features. • Cross adaptation with Aachen  Successful collaboration among 5 team members from 3 continents.

Error Analysis on Extreme Cases • CER not directly related to HTER; genre matters. • Better CER does ease MT.

Error Analysis • (a) worst BN: OOV names • (b) worst BC: overlapped speech • (c) best BN: composite sentences • (d) best BC: simple sentences with disfluency and re-starts.

Error Analysis • OOV (especially names) Problematic for both ASR/MT • Overlapped speech  What to do? • Content word mis-reco (not all errors are equal!) • 升值(increase in value)  甚至 (even) Parsing scores?

Error Analysis • MT BN high errors • Composite syntax structure. • Syntactic parsing would be useful. • MT BC high errors • Overlapped speech • ASR high errors due to disfluency • Conjecture: MT on perfect BC ASR is easy, for its simple/short sentence structure

Next ASR: Chinese OOV Org Names • Semi-auto abbreviation generation for long words. • Segment a long word into a sequence of shorter words • Extract the 1st char of each shorter words: • World Health Organization  WHO (Make sure they are in MT translation table, too)

Next ASR: Chinese OOV Per. Names • Mandarin high rate of homophones: 408 syllables  6000 common characters. 14 homophone chars / syllable!! • Given a spoken Chinese OOV name, no way to be sure which characters to use. But for MT, don’t care anyway as long as the syllables are correct.!! • Recognizing repetition of the same name in the same snippet: CNC at syllable level • Xu  {Chang, Cheng}  {Lin, Min, Ming} • Huang  Zhu  {Qin, Qi} • After syllable CNC, apply the same name to all occurrences in Pinyin.

Next ASR: English OOV Names • English spelling in Lexicon, with (multiple) Mandarin pronunciations: • Bush /bu4 shi2/ or /bu4 xi1/ • Bin Laden /ben1 la1 deng1/ or /ben1 la1 dan1/ • John /yue1 han4/ • Sadr /sa4 de2 er3/ • Name mapping from MT? • Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re-train n-gram.

Next ASR: LM • LM adaptation with fine topics, each topic with small vocabulary size. • Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? • 我想那(也)(也)也是 我想那也是 • I think it, (too), (too), is, too.  I think it is, too. • If optimizing CER, stm needs to be designed such that disfluency is optionally deletable.

Next ASR: AM • Add explicit tone modeling (Lei07). • Prosody info: duration and pitch contour at word level • Various backoff schemes for infrequent words • More understanding why outside regions not helping with AM adaptation. • Add SD MLLR regression tree (Mandal06). • Improve auto speaker clustering • Smaller clusters, better performance

ASR & MT Integration • Do we need to merge lexicon? ASR <= MT. • Do we need to use the same word segmenter? • Is word/char -level CNC output better for MT? • Open questions and feedback!!!

Building A Highly Accurate Mandarin Speech Recognizer