220 likes | 326 Vues
This work presents a novel approach using Growth Transformation for MCE-based model estimation in Automatic Speech Recognition, yielding improved performance. The innovative methodology focuses on minimizing classification error and optimizing the MCE objective function. The study showcases significant advancements in Microsoft's telephony speech recognition system, demonstrating enhanced efficiency and accuracy in real-world applications.
E N D
Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009
Minimum Classification Error (MCE) • The objective function of MCE training is a smoothed recognition error rate. • Traditionally, MCE criterion is optimized through stochastic gradient descent (e.g., GPD) • In this work we proposed the Growth Transformation based method for MCE based model estimation
Automatic Speech Recognition (ASR) Speech recognition:
Models (feature functions) in ASR ASR in the log-linear framework Λ is the parameter set of the acoustic model (HMM), which is of interest at MCE training in this work.
MCE: Mis-classification measure Define misclassification measure: (in the case of using correct and top one incorrect competing tokens) sr,1: the top one incorrect (not equal to Sr) competing string
MCE: Loss function Classification: Classifi. error: dr(Xr,Λ) > 0 1 classification error dr(Xr,Λ) < 0 0 classification error Loss function: smoothed error count func.
MCE: Objective function MCE objective function: LMCE(Λ) is the smoothed recognition error rate on the string (token) level. Model (acoustic model) is trained to minimizeLMCE(Λ), i.e., Λ* = argminΛ{LMCE(Λ)}
MCE: Optimization • Growth Transformation based MCE: If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T(∙) is called a growth transformation ofΛ for P(Λ). Maximizing F(Λ;Λ′) = G-P′×H+D Maximizing P(Λ) = G(Λ)/H(Λ) Minimizing LMCE(Λ) = ∑l﴾d(∙)﴿ GT formula ∂U(∙)/∂Λ = 0 Λ =T(Λ′) Maximizing U(Λ;Λ′) = ∑f′(∙)log f(∙) Maximizing F(Λ;Λ′) = ∑ f(∙)
MCE: Optimization Re-write MCE loss function to Then, min. LMCE(Λ) max. Q(Λ), where
MCE: Optimization Q(Λ) is further re-formulated to a single fractional function P(Λ) where
MCE: Optimization Increasing P(Λ) can be achieved by maximizing as long as D is a Λ-independent constant. i.e., (Λ′ is the parameter set obtained from last iteration) Substitute G() and H() into F(),
MCE: Optimization Reformulate F(Λ;Λ') to where F(Λ;Λ') is ready for EM style optimization Note: Γ(Λ′) is a constant, andlog p(χ, q | s, Λ)is easy to decompose.
MCE: Optimization Increasing F(Λ;Λ') can be achieved by maximizing Use extend Baum-Welch for E step. log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M step is easy to compute. So the growth transformation of Λ for CDHMM is:
MCE: Model estimation formulas For Gaussian mixture CDHMM, GT of mean and covariance of Gaussian m is where
MCE: Model estimation formulas Setting of Dm Theoretically, setDm so that f(χ,q,s,Λ;Λ') > 0 Empirically,
MCE: Workflow Training utterances Last iteration ModelΛ′ Recognition Competing strings Training transcripts GT-MCE next iteration New model Λ
Experiment: TI-DIGITS • Vocabulary: “1” to “9”, plus “oh” and “zero” • Training set: 8623 utterances / 28329 words • Test set: 8700 utterances / 28583 words • 33-dimentional spectrum feature: energy +10 MFCCs, plus ∆ and ∆∆ features. • Model: Continuous Density HMMs • Total number of Gaussian components: 3284
Experiment: TI-DIGITS GT-MCE vs. ML (maximum likelihood) baseline Obtain the lowest error rate on this task Reduce recognition Word Error Rate (WER) by 23% Fast and stable convergence
Experiment: Microsoft Tele. ASR • Microsoft Speech Server – ENUTEL • A telephony speech recognition system • Training set: 2000 hour speech / 2.7 million utterances • 33-dim spectrum features: (E+MFCCs) +∆ +∆∆ • Acoustic Model: Gaussian mixture HMM • Total number of Gaussian components: 100K • Vocabulary: 120K (delivered vendor lexicon) • CPU Cluster: 100 CPUs @ 1.8GHz – 3.4GHz • Training Cost: 4~5 hours per iteration
Experiment: Microsoft Tele. ASR • Evaluate on four corpus-independent tests • Collected from sites other than training data providers • Cover major commercial Tele. ASR scenarios
Experiment: Microsoft Tele. ASR Significant performance improvements across-the-board The first time MCE is successfully applied to a 2000 hr. speech database The Growth Transformation based MCE training is well suited for large scale modeling tasks