A Feature Weighting Method for Robust Speech Recognition

20 Aug 01 at CUHK A Feature Weighting Method for Robust Speech Recognition -- Speech Activities in CST Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/

Center of Speech Technology • Founded in 1979, named as Speech Laboratory • Joined the State Key Laboratory of Intelligent Technology and Systems in 1999, renamed as Center of Speech Technology • http://sp.cs.tsinghua.edu.cn/ Center of Speech Technology, Tsinghua University

Members of CST in 2001 Center of Speech Technology, Tsinghua University

Founding Resources • State fundamental research plan: NSF, 863, 973, 985 • Collaboration with industries: • Microsoft • IBM • Intel • Lucent Technologies • Nokia • Weniwen • SoundTek • Keysun • ... Center of Speech Technology, Tsinghua University

Acoustic Modeling Feature Extraction and Selection Acoustic Modeling Accurate & fast AM Search Robustness Speech Enhancement Fractals Speaker Adaptation Speaker Normalization Chinese Pronunciation Modeling Language Modeling Characteristics of Chinese Language Modeling and Search LM Adaptation & New Word Induction Natural/Spoken Speech Understanding (NLU/SLU) NLU - GLR Based Parsing SLU - KW based robust parsing Dialogue Manager Applications Command and control Keyword spotting Language Learning Input method editor Chinese dictation machine Spoken dialogues Speaker identification and verification Resources Speech Research Activities Center of Speech Technology, Tsinghua University

Feature Extraction and Selection • Trying to extract discriminative features • Trying to select robust feature components from the existing features Center of Speech Technology, Tsinghua University

Introduction • Human does not always use an identical feature to recognize objects. • It is often that the feature components being used vary with the objects to be recognized. • This is feature selection after the feature extraction. • Feature selection can be regarded as a special case of feature weighting. • Before going to the topic, let’s see a problem at first... Center of Speech Technology, Tsinghua University

Room 1 Room 2 • A Problem • Conditions: • Two opaque and completely separated rooms are very close to each other. • Room 1 contains 3 switches and Room 2 contains 3 lights. • Each switch is corresponding to one and only one light. • You can switch on/off any switch for any times. • But you can only enter each room for only once. • Goal: finding which switch is corresponding to which light. A B C X Y Z Center of Speech Technology, Tsinghua University

Room 1 Unseen Room 2 A B C X Y Z • The Answer (Step 1) • Actions: • Turn on Switch A. • Wait for a couple of minutes ... • Turn off Switch A. Center of Speech Technology, Tsinghua University

Room 1 Unseen Room 2 A B C X Y Z • The Answer (Step 2) • Actions: • Turn on Switch B. • Immediately go to Room 2. Center of Speech Technology, Tsinghua University

Room 1 Unseen Room 2 A B C X Y Z • The Answer (Step 3) • Conclusions: • The BRIGHT one (among three)  Switch B. • The HOT one (among left two)  Switch A. • The left one  Switch C. Center of Speech Technology, Tsinghua University

Those Behind the Answer • “Turning on/off Switch A and Turning on Switch B in Room 1” is a feature extraction procedure; and “status checking in Room 2” a feature selection and recognition. • Feature vector = (hot, bright)T. • Feature selecting vector: W=(wh, wb)T, where wh, wb {0, 1} • A hierarchical feature selecting procedure • Step 1: W=(0,1)T (“Bright” component) to tell “B” from others (“A” & “C”) • Step 2: W=(1,0)T (“Hot” component) to tell “A” from “C” • Or alternatively • Step 1: W=(1,0)T to tell “A” from “B” & “C” • Step 2: W=(0,1)T to tell “B” from “C” Center of Speech Technology, Tsinghua University

Is this idea suitable for ASR? • This is a good idea. • But in MFCC/LPCC features, components are not so separate from each other because each component contributes to the recognition of any unit. A solution is to generalize the feature selection to feature weighting, so that the difference of contribution of the feature component to different speech recognition unit is reflected. • In “Feature Weighting”, the W’s element value range is [0,1], instead of {0, 1}. Center of Speech Technology, Tsinghua University

What we have experimented • Hierarchical Feature Weighting, • Each speech recognition unit (SRU) sub-set shares the same and fixed feature weighting vector (using same feature components); • The SRU set is divided into subsets according to a Minimum Classification Error (MCE) criterion. • Sub-band feature weighting inside MFCC calculation • The weight in each sub-band is based on its SNR level. • The method is combined with the noise spectral subtraction. Center of Speech Technology, Tsinghua University

I.A Hierarchical Feature Weighting Method Based onMinimum Classification Error (MSE) Center of Speech Technology, Tsinghua University

X1 A X >80% A B 80% B X2 A >80% C C A,B&C: SRUs X,X1&X2: Feature vectors • Basic Idea • Human Cognition - during recognition, different feature components are used for different objects to be distinguished • Pattern Recognition - different feature components for different speech recognition unit (SRU) subsets (model subsets) Center of Speech Technology, Tsinghua University

Definition • For any feature vector X=[x1, x2, …, xD]T, and a choosing vector W=[w1, w2, …, wD]T, we define the choosing operation as WX=XW=[x1·w1, x2·w2, …, xD·wD]T • In the figure on last page, X1 and X2 can be regarded as Xi = XWi, i=1,2, where wid{0,1}, i=1,2, and d=1,2,…,D • If “Choosing” generalized into “weighting”: wd[0,1], d=1,2,…,D Center of Speech Technology, Tsinghua University

Input Signal Feature Extraction { X1, X2, …, XT } {Y1, Y2, …, YT } Yt(s)=Xt Wts Recognizer s is an SRU subset • Feature Weighting Center of Speech Technology, Tsinghua University

Problems • How to divide the whole SRU set (or model set) into subsets {s} ? • How to train the weighting vector for each model subset {Ws}? • The model set division should be based on a minimum classification error (MCE) criterion. Center of Speech Technology, Tsinghua University

E(S|W): error count for model subset S given weight W W(S): optimal weight for subset S • MCE Based Model Set Division (0) • Goal: to find an optimal weight for set S Model Space S S E(S|W(S)) Center of Speech Technology, Tsinghua University

Cut Set S2 S1 Confusion Graph • Optimal Division into 2 Subsets • A confusion graph is established at first (vertex: model; edge: inter-model classification error); acceptable error threshold (AET) is used as stop criterion. • I. Graph Maximum Flow algorithm is used to find the minimum cut set in the confusion graph (i.e., to minimize inter-subset) • II. Generalized probabilistic descent (GPD) algorithm is used to train the weight vectors Center of Speech Technology, Tsinghua University

S, W(S) S1, W(S1) S2, W(S2) S11={A,C} , W(S11) S12,W(S12) • Result after Subset Division - Classification Tree Center of Speech Technology, Tsinghua University

MCE (GDP) Maximum Flow Threshold Terminal Node ML Training Weighting Updating Graph Splitting HMM Parameters Feature Weighting Recognition Tree • Training Algorithm Summary Center of Speech Technology, Tsinghua University

S W(S), X S1 S2 W(S1), X W(S2), X W(S11), X S11 ={A,C} S12 W(S12), X • A Hierarchical Recognition Procedure Center of Speech Technology, Tsinghua University

Database • 863 Chinese Continuous Speech Database • Training (10 males) / Testing (2 males) • 521 utterances (6~10 syllables/utterance) for per speaker • Speech Recognition Unit: CD-IFs • Feature • (16-D MFCC + 1-D Frame-Energy) + Delta • Acoustic Model • HMM (left-to-right, 3-state, 16-Gaussian-mixture) Center of Speech Technology, Tsinghua University

Experimental Result (1) • Accuracy vs. Acceptable Error Threshold (AET) Center of Speech Technology, Tsinghua University

Experimental Result (2) • Accuracy comparison w/ AET=5% Center of Speech Technology, Tsinghua University

Conclusions • Hierarchical feature weighting is applied to ASR without normalization during recognition • Classification tree is constructed optimally based on Minimum-Classification-Error Principle • Acceptable Error Threshold can be used for controlling classification complexity and avoiding over-fitting of training • The hierarchical recognition tree can be adapted for other training criterions (e.g. MMI) and parameters (e.g. HMM parameters). Center of Speech Technology, Tsinghua University

II.Sub-band feature weightingcombined with noise subtractionfor robust speech recognition Center of Speech Technology, Tsinghua University

Basic Idea • In adverse environment, speech is often polluted by noise; and the noise is often assumed to be of some type. • Assumption: different (spectral) sub-bands of the speech are polluted by noise at different noise levels; • Goal: deemphasize the polluted sub-bands according to their noise levels • This is a kind of Multi-band ASR. • Key Problems: • How to estimate the noise levels ? • How to deemphasize the corresponding sub-bands ? Center of Speech Technology, Tsinghua University

Multi-Band ASR - Overview: traditional Center of Speech Technology, Tsinghua University

Multi-Band ASR - Overview: LR • A combination of results from sub-HMM recognizers • Disadvantages: (1) loss of joint spectral info, such as shape of cepstral envelope; (2) difficult to combine Center of Speech Technology, Tsinghua University

Multi-Band ASR - Overview: FR • A combination of groups (sub-bands) of feature components after being processed • FR outperforms LR • Our method is similar to FR method Center of Speech Technology, Tsinghua University

Noise estimation in a specified sub-band • The observation is that in a narrower band of the spectrogram of a long speech segment: • Speech distribution is often not uniform (pause, tone sandhi, ...); but • Noise distribution is relatively uniform (for either white noise or irregular noise). • And for either long-term speech or long-term noise, a peak can always be found in the log energy density distribution. See figures on next page. Center of Speech Technology, Tsinghua University

(a) Clean speech (b) White Noise Log energy density distribution as a function of log energy * The energy is normalized by the maximum value in the focused sub-band. Center of Speech Technology, Tsinghua University

Center of Speech Technology, Tsinghua University

Observations: • For clean speech, the peak is far left to the 0 energy. • When SNR=0dB, the peak lies at 0 energy. • The peak shifts rightwards with SNR decreases (noise energy increases) • Conclusions/Assumptions: • The log (normalized) energy where the peak resides is the average energy of the noise; or • Noise/silence is with the maximum distribution in log energy density distribution (of the sub-band being analyzed). Center of Speech Technology, Tsinghua University

Noise Energy in a given sub-band is estimated as where E( f ) is the energy at frequency f in band b • H. G. Hirsch. “Estimation of noise spectrum and its application to SNR-estimation and speech enhancement,” Technical Report TR-93-012, International Computer Science Institute, 1993. http://www.icsi.berkeley.edu/techreports/ Center of Speech Technology, Tsinghua University

How to deemphasize the noise-polluted speech ? • Spectral subtraction • Sub-band weighting Center of Speech Technology, Tsinghua University

Spectral subtraction (SS) Center of Speech Technology, Tsinghua University

SS is performed on the magnitude spectrum. • Basic SS: • Generalized SS: • : over-subtraction factor to control the subtraction extent • : sharpness exponent, the bigger the more smooth • Best values:=3, =2. • SS can be performed on each sub-band individually. Center of Speech Technology, Tsinghua University

Sub-band Weighting (SW) • Spectrum of noisy speech is divided into sub-bands; • Spectrum of each sub-band is multiplied by a weighting function: P'y( f, b ) = w( f, b ) Py( f, b). • Assume noise in a specified sub-band is uniform: the weighting function can be simplified as w(f, b)  w(b). • The weighting function should be an increasing function of SNR in that band, or for simplification an increasing function of average signal energy and a decreasing function of noise energy. Center of Speech Technology, Tsinghua University

Therefore, the weighting function can be defined as • And Center of Speech Technology, Tsinghua University

Integration of SS and/or SW into MFCC calculation • MFCC considers human’s hearing perception (the mel-scale); • Sub-band division is embedded in the MFCC calculation; Center of Speech Technology, Tsinghua University

Sub-band weighting Spectral subtraction • Flow Chat for modified MFCC calculation Spectrum calculation (DFT) Sub-band division Mel Scale Noise Energy Estimation Weight Estimation Sub-band energy calculation Log |·| DCT MFCC Center of Speech Technology, Tsinghua University

A Feature Weighting Method for Robust Speech Recognition

A Feature Weighting Method for Robust Speech Recognition

Presentation Transcript

A Universal Human Machine Speech Interaction Language for Robust Speech Recognition Applications

Distinctive Feature Detection For Automatic Speech Recognition

Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Robust Speech Feature

Robust Speech recognition

ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION

Robust Recognition of Emotion from Speech

Survey of ICASSP 2013 section: feature for robust automatic speech recognition

Large-Margin Feature Adaptation for Automatic Speech Recognition

Quantile Based Histogram Equalization for Noise Robust Speech Recognition

Discriminative Feature Optimization for Speech Recognition

Histogram-based Quantization for Distributed / Robust Speech Recognition

Robust Speech Feature

Linear Discriminant Feature Extraction for Speech Recognition

Articulatory Feature-Based Speech Recognition

Articulatory Feature-Based Speech Recognition

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION

Enhanced Speech Models for Robust Speech Recognition

CMU Robust Vocabulary-Independent Speech Recognition System

Articulatory Feature-Based Speech Recognition

Prosodic Constraints for Robust Speech Recognition

Articulatory Feature-Based Speech Recognition