1 / 15

Investigation on Mandarin Broadcast News Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition. Mei-Yuh Hwang , Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006, Interspeech, Pittsburgh. Outline. The task Text training data and language modeling Acoustic training data and acoustic modeling

bell
Télécharger la présentation

Investigation on Mandarin Broadcast News Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006, Interspeech, Pittsburgh

  2. Outline • The task • Text training data and language modeling • Acoustic training data and acoustic modeling • Decoding structure • Experimental results • Recent progress and future direction

  3. The Task • Mandarin broadcast news (BN) transcription • Mainland Mandarin speech • TV/radio programs in China, USA • CCTV中央电视台 • NTDTV 新唐人电视台 • PHOENIX TV 凤凰卫视 • VOA 美国之音 • RFA 自由亚洲电台 • CNR 中国广播网

  4. Text Training Data • LM1: • 1997 Mandarin BN Hub4 transcriptions • Chinese TDT2,3,4 • Multiple-translation Chinese (MTC) corpus, part 1, 2, 3 • LM2: Gigaword XIN 2001-2004 (China) • LM3: Gigaword ZBN 2001-2004 (Singapore) • LM4: Gigaword CNA 2001-2004 (Taiwan) • All together 420M words. • 4 LMs interpolated

  5. Chinese Word Segmentation • BBN 64k-word lexicon, derived from LDC • Longest-first match with the 64k-lexicon • Choose most frequent 49k words as new lexicon • Train n-gram • Use unigram part to re-do word segmentation based on the ML path

  6. Chinese Word Segmentation • Longest-first • 民进党/和亲/民党… • The Green Party made peace with the Min Party via marriage… • Maximum-likelihood • 民进党/和/亲民党… • The Green Party and the Qin-Min Party...

  7. Perplexity • 49k-word lexicon

  8. Acoustic Training Data *auto selection via a flexible alignment with closed caption

  9. Acoustic Feature Representation • 39-dim MFCC cepstra + D + D D • 3-dim pitch + D + D D • Auto speaker clustering • VTLN per auto speaker • Speaker-based CMN+CVN for training

  10. Acoustic Models • 2500 senones (clustered states) x 32 Gaussians • ML training vs. MPE training with phone lattices • Gender indepdent. • nonCW vs. CW triphones • Speaker-adaptive training (SAT): N(x; am+b, ASAt) = |A|-1N(A-1(x-b); m, S) Linear transformation A-1x + (-A-1b) applied to the feature domain.

  11. 2-Pass Search Architecture nonCW,nonSAT, ML model Small bigram Search 1 hypothesis SAT MLLR CW,SAT,MPE model Search 2 Big 4-gram Final word sequence

  12. Adding Pitch: SA Results (CER)

  13. 2-pass Search Results (CER)

  14. More Recent Progress • Add more acoustic (440 hrs) and text training data (840M words). • Increased and improved lexicon (60k words). • fMPE training. • Add ICSI feature as a second system. • 5-gram LM. • Between MFCC system and ICSI system, • Cross adaptation • Rover • 3.7% on dev04, 12.1% on eval04. • Submitted to ICASSP 2007

  15. Challenges • Channel compensation • Conversational speech • Overlapped speech • Speech with music background • Commercial • Language ID (in addition to English) • Is CER the best measurement for MT?

More Related