1 / 38

Building A Highly Accurate Mandarin Speech Recognizer

Building A Highly Accurate Mandarin Speech Recognizer. Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007. Outline. Goal: A highly accurate Mandarin ASR Baseline: System-2006 Improvement Acoustic segmentation

Télécharger la présentation

Building A Highly Accurate Mandarin Speech Recognizer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007

  2. Outline • Goal: A highly accurate Mandarin ASR • Baseline: System-2006 • Improvement • Acoustic segmentation • Two complementary comparable systems • Language models and adaptation • More Data • Error analysis  Future

  3. Background: System-2006 • 849M words training text • 60K-word lexicon • Static 5-gram rescoring • 465 hrs acoustic training • Two AMs (same phone-72 pronunciation) • MFCC+pitch (42-dim), SAT+fMPE, CW MPE, 3000x128 Gaussians. • MFCC+MLP+pitch (74-dim), SAT+fMPE, nonoCW MPE, 3000x64 Gaussians • CER 18.4% on Eval06.

  4. 2007 Increased Training Data • 870 hours of acoustic training data. 3500x128 Gaussians. • 1.2G words of training text. Trigrams and 4-grams.

  5. silence noise Start / null End / null speech silence silence silence noise noise noise Start Start Start / / / null null null End End End / / / null null null speech speech speech Acoustic segmentation • Former segmenter caused high deletion errors. It mis-classified some speech segments as noises. • Speech segment min duration 18*30=540ms=0.5s

  6. noise silence Foreign Start / null End / null Mandarin 1 Mandarin 2 New Acoustic Segmenter • Allow shorter speech duration • Model Mandarin vs. Foreign (English) separately.

  7. Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on Eval06

  8. Decoding Architecture Aachen MLP nonCW qLM3 PLP CW SAT+fMPE MLLR, LM3 MLP CW SAT MLLR, LM3 qLM4 Adapt/Rescore qLM4 Adapt/Rescore Confusion Network Combination

  9. Two Sets of Acoustic Models • For cross adaptation and system combo • Different error behaviors • Similar error rate performance

  10. MLP Phoneme Posterior Features • Compute Tandem features with pitch+PLP input. • Compute HATs features with 19 critical bands • Combine Tandem and HATs posterior vectors into one. • PCA(Log(71))  32 • MFCC + pitch + MLP = 74-dim

  11. (42x9)x15000x71 PLP (39x9) Pitch (3x9) Tandem Features [T1,T2,…,T71] • Input: 9 frames of PLP+pitch

  12. HATS Features [H1,H2,…,H71] 51x60x71 (60*19)x8000x71 E1 E2 … E19

  13. MLP and Pitch Features nonCW ML, Hub4 Training, MLLR, LM2 on Eval04

  14. Phone-81: Diphthongs for BC • Add diphthongs (4x4=16) for fast speech and modeling longer triphone context. • Maintain unique syllabification. • Syllable ending W and Y not needed anymore.

  15. Phone-81: Frequent Neutral Tones for BC • Neural tones more common in conversation. • Neutral tones were not modeled. The 3rd tone was used as replacement. • Add 3 neutral tones for frequent chars.

  16. Phone-81: Special CI Phones for BC • Filled pauses (hmm, ah) common in BC. Add two CI phones for them. • Add CI /V/ for English.

  17. Phone-81: Simplification of Other Phones • Now 72+14+3+3=92 phones, too many triphones to model. • Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2. • 92 – (4x3–1) = 81 phones.

  18. Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 Indeed different error behaviors --- good for system combo.

  19. PLP Models with fMPE Transform • PLP model with fMPE transform to compete with MLP model. • Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT • 5 Neighboring frames of Gaussian posteriors. • M is 42 x (3500*32*5), h is (3500*32*5)x1. • Ref: Zheng ICASSP 07 paper

  20. {w | w same story (4secs) } Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model q q0 One sentence • 4s window is used to make adaptation more robust against ASR errors. • {w} are weighted based on distance.

  21. Topic-based LM Adaptation • Training: one topic per sentence • Train 64 topic-dependent LMs. • Testing: top n topics per sentence, weighting on neighboring 4s of speech

  22. Topic-based LM Adaptation • LMi still 60K-words? • Per-sentence adaptation? • Computational cost?

  23. LM Adaptation and CNC on Dev07 UW 2 systems only

  24. LM Adaptation and CNC on Eval07

  25. Eval07

  26. 2006 vs. 2007 on Eval07 37% relative improvement!!

  27. Progress

  28. RWTH Demo • UW acoustic segmenter. • RWTH single-system ASR. Foreign (Korean) speech skipped. Mis-reco highlighted. • Manual sentence segmentation. • Machine translation. • Not real-time.

  29. MT Error Analysis on Extreme Cases • CER not directly related to HTER; genre matters. • Better CER does ease MT.

  30. MT Error Analysis • (a) worst BN: OOV names • (b) worst BC: overlapped speech • (c) best BN: composite sentences • (d) best BC: simple sentences with disfluency and re-starts. • *.html, *.wav

  31. Error Analysis • OOV (especially names): problematic for ASR, MT, distillation.

  32. Error Analysis • MT BN high errors • Composite syntax structure. • Syntactic parsing would be useful. • MT BC high errors • Overlapped speech • ASR high errors due to disfluency • Conjecture: MT on perfect BC ASR is easy, for its simple/short sentence structure

  33. Next ASR: Chinese Organization Names • Semi-auto abbreviation generation for long words. • Segment a long word into a sequence of shorter words • Extract the 1st char of each shorter words: • 世界卫生组织 世卫 (Make sure they are in MT translation table, too)

  34. Next ASR: Chinese Person Names • Mandarin high rate of homophones: 408 syllables  6000 common characters. 14 homophone chars / syllable!! • Given a spoken Chinese OOV name, no way to be sure which characters to use. But for MT, don’t care anyway as long as the syllables are correct.!! • Recognizing repetition of the same name in the same snippet: CNC at syllable level • Xu  {Chang, Cheng}  {Lin, Min, Ming} • Huang  Zhu  {Qin, Qi} • After syllable CNC, apply the same name to all occurrences in Pinyin.

  35. Next ASR: Foreign Names • English spelling in Lexicon, with (multiple) Mandarin pronunciations: • Bush /bu4 shi2/ or /bu4 xi1/ • Bin Laden /ben1 la1 deng1/ or /ben3 la1 deng1/ • John /yue1 han4/ • Sadr /sa4 de2 er3/ • Name mapping from MT? • Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re-train n-gram.

  36. Next ASR: LM • LM adaptation with fine topics, each topic with small vocabulary size. • Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? • 我想那(也)(也)也是 我想那也是 • I think it, (too), (too), is, too.  I think it is, too. • If optimizing CER, stm needs to be designed such that disfluency is optionally deletable.小孩(儿)

  37. Next ASR: AM • Add explicit tone modeling (Lei07). • Prosody info: duration and pitch contour at word level • Various backoff schemes for infrequent words • More understanding why outside regions not helping with AM adaptation. • Add SD MLLR regression tree (Mandal06). • Improve auto speaker clustering • Smaller clusters, better performance • Gender ID first.

  38. ASR & MT Integration • Do we need to merge lexicon? ASR MT. • Do we need to use the same word segmenter? • Is word/char -level CNC output better for MT? • Open questions and feedback!!!

More Related