1 / 59

A Feature Weighting Method for Robust Speech Recognition

20 Aug 01 at CUHK. A Feature Weighting Method for Robust Speech Recognition. -- Speech Activities in CST. Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University

lobel
Télécharger la présentation

A Feature Weighting Method for Robust Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 20 Aug 01 at CUHK A Feature Weighting Method for Robust Speech Recognition -- Speech Activities in CST Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/

  2. Center of Speech Technology • Founded in 1979, named as Speech Laboratory • Joined the State Key Laboratory of Intelligent Technology and Systems in 1999, renamed as Center of Speech Technology • http://sp.cs.tsinghua.edu.cn/ Center of Speech Technology, Tsinghua University

  3. Members of CST in 2001 Center of Speech Technology, Tsinghua University

  4. Founding Resources • State fundamental research plan: NSF, 863, 973, 985 • Collaboration with industries: • Microsoft • IBM • Intel • Lucent Technologies • Nokia • Weniwen • SoundTek • Keysun • ... Center of Speech Technology, Tsinghua University

  5. Acoustic Modeling Feature Extraction and Selection Acoustic Modeling Accurate & fast AM Search Robustness Speech Enhancement Fractals Speaker Adaptation Speaker Normalization Chinese Pronunciation Modeling Language Modeling Characteristics of Chinese Language Modeling and Search LM Adaptation & New Word Induction Natural/Spoken Speech Understanding (NLU/SLU) NLU - GLR Based Parsing SLU - KW based robust parsing Dialogue Manager Applications Command and control Keyword spotting Language Learning Input method editor Chinese dictation machine Spoken dialogues Speaker identification and verification Resources Speech Research Activities Center of Speech Technology, Tsinghua University

  6. Feature Extraction and Selection • Trying to extract discriminative features • Trying to select robust feature components from the existing features Center of Speech Technology, Tsinghua University

  7. Introduction • Human does not always use an identical feature to recognize objects. • It is often that the feature components being used vary with the objects to be recognized. • This is feature selection after the feature extraction. • Feature selection can be regarded as a special case of feature weighting. • Before going to the topic, let’s see a problem at first... Center of Speech Technology, Tsinghua University

  8. Room 1 Room 2 • A Problem • Conditions: • Two opaque and completely separated rooms are very close to each other. • Room 1 contains 3 switches and Room 2 contains 3 lights. • Each switch is corresponding to one and only one light. • You can switch on/off any switch for any times. • But you can only enter each room for only once. • Goal: finding which switch is corresponding to which light. A B C X Y Z Center of Speech Technology, Tsinghua University

  9. Room 1 Unseen Room 2 A B C X Y Z • The Answer (Step 1) • Actions: • Turn on Switch A. • Wait for a couple of minutes ... • Turn off Switch A. Center of Speech Technology, Tsinghua University

  10. Room 1 Unseen Room 2 A B C X Y Z • The Answer (Step 2) • Actions: • Turn on Switch B. • Immediately go to Room 2. Center of Speech Technology, Tsinghua University

  11. Room 1 Unseen Room 2 A B C X Y Z • The Answer (Step 3) • Conclusions: • The BRIGHT one (among three)  Switch B. • The HOT one (among left two)  Switch A. • The left one  Switch C. Center of Speech Technology, Tsinghua University

  12. Those Behind the Answer • “Turning on/off Switch A and Turning on Switch B in Room 1” is a feature extraction procedure; and “status checking in Room 2” a feature selection and recognition. • Feature vector = (hot, bright)T. • Feature selecting vector: W=(wh, wb)T, where wh, wb {0, 1} • A hierarchical feature selecting procedure • Step 1: W=(0,1)T (“Bright” component) to tell “B” from others (“A” & “C”) • Step 2: W=(1,0)T (“Hot” component) to tell “A” from “C” • Or alternatively • Step 1: W=(1,0)T to tell “A” from “B” & “C” • Step 2: W=(0,1)T to tell “B” from “C” Center of Speech Technology, Tsinghua University

  13. Is this idea suitable for ASR? • This is a good idea. • But in MFCC/LPCC features, components are not so separate from each other because each component contributes to the recognition of any unit. A solution is to generalize the feature selection to feature weighting, so that the difference of contribution of the feature component to different speech recognition unit is reflected. • In “Feature Weighting”, the W’s element value range is [0,1], instead of {0, 1}. Center of Speech Technology, Tsinghua University

  14. What we have experimented • Hierarchical Feature Weighting, • Each speech recognition unit (SRU) sub-set shares the same and fixed feature weighting vector (using same feature components); • The SRU set is divided into subsets according to a Minimum Classification Error (MCE) criterion. • Sub-band feature weighting inside MFCC calculation • The weight in each sub-band is based on its SNR level. • The method is combined with the noise spectral subtraction. Center of Speech Technology, Tsinghua University

  15. I.A Hierarchical Feature Weighting Method Based onMinimum Classification Error (MSE) Center of Speech Technology, Tsinghua University

  16. X1 A X >80% A B 80% B X2 A >80% C C A,B&C: SRUs X,X1&X2: Feature vectors • Basic Idea • Human Cognition - during recognition, different feature components are used for different objects to be distinguished • Pattern Recognition - different feature components for different speech recognition unit (SRU) subsets (model subsets) Center of Speech Technology, Tsinghua University

  17. Definition • For any feature vector X=[x1, x2, …, xD]T, and a choosing vector W=[w1, w2, …, wD]T, we define the choosing operation as WX=XW=[x1·w1, x2·w2, …, xD·wD]T • In the figure on last page, X1 and X2 can be regarded as Xi = XWi, i=1,2, where wid{0,1}, i=1,2, and d=1,2,…,D • If “Choosing” generalized into “weighting”: wd[0,1], d=1,2,…,D Center of Speech Technology, Tsinghua University

  18. Input Signal Feature Extraction { X1, X2, …, XT } {Y1, Y2, …, YT } Yt(s)=Xt Wts Recognizer s is an SRU subset • Feature Weighting Center of Speech Technology, Tsinghua University

  19. Problems • How to divide the whole SRU set (or model set) into subsets {s} ? • How to train the weighting vector for each model subset {Ws}? • The model set division should be based on a minimum classification error (MCE) criterion. Center of Speech Technology, Tsinghua University

  20. E(S|W): error count for model subset S given weight W W(S): optimal weight for subset S • MCE Based Model Set Division (0) • Goal: to find an optimal weight for set S Model Space S S E(S|W(S)) Center of Speech Technology, Tsinghua University

  21. MCE Based Model Set Division (1) • E(S|W(S)) = E(S1|W(S)) +E(S2|W(S)) +E(S1,S2|W(S)) • Goal: to minimize inter-subset error E (S1, S2|W (S))) S=S1+S2 E(S|W(S)) S2 E(S2|W(S)) S1 E(S1|W(S)) Inter-subset error Center of Speech Technology, Tsinghua University

  22. MCE Based Model Set Division (2) • E(S|W(S)) = E(S1|W(S)) +E(S2|W(S)) +E(S1,S2|W(S)) ≥ E(S1|W(S1))+E(S2|W(S2))+E(S1,S2|W(S)) • Goal: to find optimal weights for new subsets S=S1+S2 E(S|W(S)) S2 E(S2|W(S2)) ≤ E(S2|W(S)) S1 E(S1|W(S1)) ≤ E(S1|W(S)) Center of Speech Technology, Tsinghua University

  23. Cut Set S2 S1 Confusion Graph • Optimal Division into 2 Subsets • A confusion graph is established at first (vertex: model; edge: inter-model classification error); acceptable error threshold (AET) is used as stop criterion. • I. Graph Maximum Flow algorithm is used to find the minimum cut set in the confusion graph (i.e., to minimize inter-subset) • II. Generalized probabilistic descent (GPD) algorithm is used to train the weight vectors Center of Speech Technology, Tsinghua University

  24. S, W(S) S1, W(S1) S2, W(S2) S11={A,C} , W(S11) S12,W(S12) • Result after Subset Division - Classification Tree Center of Speech Technology, Tsinghua University

  25. MCE (GDP) Maximum Flow Threshold Terminal Node ML Training Weighting Updating Graph Splitting HMM Parameters Feature Weighting Recognition Tree • Training Algorithm Summary Center of Speech Technology, Tsinghua University

  26. S W(S), X S1 S2 W(S1), X W(S2), X W(S11), X S11 ={A,C} S12 W(S12), X • A Hierarchical Recognition Procedure Center of Speech Technology, Tsinghua University

  27. Database • 863 Chinese Continuous Speech Database • Training (10 males) / Testing (2 males) • 521 utterances (6~10 syllables/utterance) for per speaker • Speech Recognition Unit: CD-IFs • Feature • (16-D MFCC + 1-D Frame-Energy) + Delta • Acoustic Model • HMM (left-to-right, 3-state, 16-Gaussian-mixture) Center of Speech Technology, Tsinghua University

  28. Experimental Result (1) • Accuracy vs. Acceptable Error Threshold (AET) Center of Speech Technology, Tsinghua University

  29. Experimental Result (2) • Accuracy comparison w/ AET=5% Center of Speech Technology, Tsinghua University

  30. Conclusions • Hierarchical feature weighting is applied to ASR without normalization during recognition • Classification tree is constructed optimally based on Minimum-Classification-Error Principle • Acceptable Error Threshold can be used for controlling classification complexity and avoiding over-fitting of training • The hierarchical recognition tree can be adapted for other training criterions (e.g. MMI) and parameters (e.g. HMM parameters). Center of Speech Technology, Tsinghua University

  31. II.Sub-band feature weightingcombined with noise subtractionfor robust speech recognition Center of Speech Technology, Tsinghua University

  32. Basic Idea • In adverse environment, speech is often polluted by noise; and the noise is often assumed to be of some type. • Assumption: different (spectral) sub-bands of the speech are polluted by noise at different noise levels; • Goal: deemphasize the polluted sub-bands according to their noise levels • This is a kind of Multi-band ASR. • Key Problems: • How to estimate the noise levels ? • How to deemphasize the corresponding sub-bands ? Center of Speech Technology, Tsinghua University

  33. Multi-Band ASR - Overview: traditional Center of Speech Technology, Tsinghua University

  34. Multi-Band ASR - Overview: LR • A combination of results from sub-HMM recognizers • Disadvantages: (1) loss of joint spectral info, such as shape of cepstral envelope; (2) difficult to combine Center of Speech Technology, Tsinghua University

  35. Multi-Band ASR - Overview: FR • A combination of groups (sub-bands) of feature components after being processed • FR outperforms LR • Our method is similar to FR method Center of Speech Technology, Tsinghua University

  36. Noise estimation in a specified sub-band • The observation is that in a narrower band of the spectrogram of a long speech segment: • Speech distribution is often not uniform (pause, tone sandhi, ...); but • Noise distribution is relatively uniform (for either white noise or irregular noise). • And for either long-term speech or long-term noise, a peak can always be found in the log energy density distribution. See figures on next page. Center of Speech Technology, Tsinghua University

  37. (a) Clean speech (b) White Noise Log energy density distribution as a function of log energy * The energy is normalized by the maximum value in the focused sub-band. Center of Speech Technology, Tsinghua University

  38. Center of Speech Technology, Tsinghua University

  39. Center of Speech Technology, Tsinghua University

  40. Center of Speech Technology, Tsinghua University

  41. Center of Speech Technology, Tsinghua University

  42. Observations: • For clean speech, the peak is far left to the 0 energy. • When SNR=0dB, the peak lies at 0 energy. • The peak shifts rightwards with SNR decreases (noise energy increases) • Conclusions/Assumptions: • The log (normalized) energy where the peak resides is the average energy of the noise; or • Noise/silence is with the maximum distribution in log energy density distribution (of the sub-band being analyzed). Center of Speech Technology, Tsinghua University

  43. Noise Energy in a given sub-band is estimated as where E( f ) is the energy at frequency f in band b • H. G. Hirsch. “Estimation of noise spectrum and its application to SNR-estimation and speech enhancement,” Technical Report TR-93-012, International Computer Science Institute, 1993. http://www.icsi.berkeley.edu/techreports/ Center of Speech Technology, Tsinghua University

  44. How to deemphasize the noise-polluted speech ? • Spectral subtraction • Sub-band weighting Center of Speech Technology, Tsinghua University

  45. Spectral subtraction (SS) Center of Speech Technology, Tsinghua University

  46. SS is performed on the magnitude spectrum. • Basic SS: • Generalized SS: • : over-subtraction factor to control the subtraction extent • : sharpness exponent, the bigger the more smooth • Best values:=3, =2. • SS can be performed on each sub-band individually. Center of Speech Technology, Tsinghua University

  47. Sub-band Weighting (SW) • Spectrum of noisy speech is divided into sub-bands; • Spectrum of each sub-band is multiplied by a weighting function: P'y( f, b ) = w( f, b ) Py( f, b). • Assume noise in a specified sub-band is uniform: the weighting function can be simplified as w(f, b)  w(b). • The weighting function should be an increasing function of SNR in that band, or for simplification an increasing function of average signal energy and a decreasing function of noise energy. Center of Speech Technology, Tsinghua University

  48. Therefore, the weighting function can be defined as • And Center of Speech Technology, Tsinghua University

  49. Integration of SS and/or SW into MFCC calculation • MFCC considers human’s hearing perception (the mel-scale); • Sub-band division is embedded in the MFCC calculation; Center of Speech Technology, Tsinghua University

  50. Sub-band weighting Spectral subtraction • Flow Chat for modified MFCC calculation Spectrum calculation (DFT) Sub-band division Mel Scale Noise Energy Estimation Weight Estimation Sub-band energy calculation Log |·| DCT MFCC Center of Speech Technology, Tsinghua University

More Related