14.0 Some Fundamental Problem-solving Approaches

14.0 Some Fundamental Problem-solving Approaches References: 1. 4.3.2, 4.4.2 of Huang, or 9.1-9.3 of Jelinek 2. 6.4.3 of Rabiner and Juang 3. 9.8 of Gold and Morgan, “ Speech and Audio Signal Processing” 4. http://www.stanford .edu/class/cs229/materials.html 5. “ Minimum Classification Error Rate Methods for Speech Recognition”, IEEE Trans. Speech and Audio Processing, May 1997 6. “Segmental Minimum Bayes-Rick Decoding for Automatic Speech Recognition”, IEEE Trans. Speech and Audio Processing, 2004 7. “Minimum Phone Error and I-smoothing for Improved Discriminative Training”, International Conference on Acoustics, Speech and Signal Processing, 2002

Goalestimating the parameters for some probabilistic models based on some criteria Parameter Estimation Principles given some observations X=[x1, x2, ……, xN]: Maximum Likelihood (ML) Principlefind the model parameter set  such that the likelihood function is maximized, P(X |)= max. For example, if  ={,} is the parameters of a normal distribution, and X is i.i.d, then the ML estimate of  ={,} is the Maximum A Posteriori (MAP) Principle Find the model parameter  so that the A Posterior probability is maximized i.e. P( |X)= P(X |) P()/ P(X)= max  P(X |) P() = max EM (Expectation and Maximization) Algorithm

Why EM? In some cases the evaluation of the objective function (e.g. likelihood function) depends on some intermediate variables (latent data) which are not observable (e.g.the state sequencefor HMM parameter training) direct estimation of the desired parameters without such latent data is impossible or difficulte.g. almost impossible to estimate {A,B, } for HMM without considerations on the state sequence Iteractive Procedure with Two Steps in Each Iteration: E (Expectation): expectation with respect to the possible distribution (values and probabilities) of the latent data based on the current estimates of the desired parameters conditioned on the given observations M (Maximization): generating a new set of estimates of the desired parameters by maximizing the objective function (e.g. according to ML or MAP) the objective function increased after each iteration, eventually converged EM ( Expectation and Maximization) Algorithm

A B output (RGG) Observed data : O : “ball sequence”: RGG Latent data : q : “bottle sequence”: AAB Parameter to be estimated : λ={P(A),P(B),P(R|A),P(G|A), P(R|B), P(G|B)} EM Algorithm: An example • First, randomly assigned λ(0)={P(0)(A),P(0)(B),P(0)(R|A),P(0)(G|A), P(0)(R|B), P (0)(G|B)}for example : {P(0)(A)=0.4,P(0)(B)=0.6,P(0)(R|A)=0.5,P(0)(G|A) =0.5, P(0)(R|B) =0.5, P (0)(G|B) =0.5} • Expectation Step : find the expectation of logP(O| λ) 8 possible state sequences qi :{AAA},{BBB},{AAB},{BBA},{ABA},{BAB},{ABB},{BAA} • Maximization Step : findλ(1) to maximize the expectation function Eq(logP(O|λ)) • Iterations : λ(0) λ(1) λ(2) .... For example, when qi = {AAB}

In Each Iteration (assuming logP(x|) is the objective function) E step: expressing the log-likelihood logP(x|) in terms of the distribution of the latent data conditioned on [x, (k)] M step: find a way to maximized the above function, such that the above function increases monotonically, i.e., logP(x|(k+1))logP(x|(k)) The Conditions for each Iteration to Proceed based on the Criterion x : observed (incomplete) data, z : latent data, {x, z} : complete data EM Algorithm

EM Algorithm • For the EM Iterations to Proceed based on the Criterion: • to make sure logP(x|[k+1])  logP(x|[k]) • H([k+1],[k]) H([k],[k]) due to Jenson’s Inequality • the only requirement is to have [k+1] such thatQ([k+1],[k]) -Q([k],[k])  0 • E-step: to estimateQ(,[k]): auxiliary function, or Q-function, the expectation of the objective function in terms of the distribution of the latent data conditioned on (x,[k]) • M-step: [k+1] = Q(,[k]) arg max 

Observed data : observations O, latent data : state sequence q The probability of the complete data isP(O,q|λ)= P(O|q,λ)P(q|λ) E-Step :Q(λ, λ[k])=E[log P(O,q|λ)|O, λ[k]]= ΣqP(q|O,λ[k])log[P(O,q|λ)] λ[k]: k-th estimate of λ (known), λ: unknown parameter to be estimated M-Step : Find λ [k+1]such that λ [k+1]=arg maxλQ(λ, λ[k]) Given the Various Constraints (e.g. ), It can be shown the above maximization leads to the formulas obtained previously Example: Use of EM Algorithm in Solving Problem 3 of HMM

General Objective : find an optimal set of parameters (e.g. for recognition models) to minimize the expected error of classification the statistics of test data may be quite different from that of the training data training data is never enough Assume the recognizer is operated with the following classification principles :{Ci, i=1,2,...M}, M classes (i): statistical model for Ci ={(i)}i=1……M , the set of all models for all classes X : observationsgi(X,): class conditioned likelihood function, for example,gi(X,) = P (X|(i)) C(X) = Ciif gi(X,) = maxj gj(X,) : classification principles an error happens whenP(X|(i)) = max but X Ci Conventional Training Criterion : find (i) such that P(X|(i)) is maximum (Maximum Likelihood) if X Ci This does not always lead to minimum classification error, since it doesn't consider the mutual relationship among competing classes The competing classes may give higher likelihood function for the test data Minimum-Classification-Error (MCE) and Discriminative Training

One form of the misclassification measure Comparison between the likelihood functions for the correct class and the competing classes A continuous loss function is defined l(d) →0 when d →-∞ l(d) →1 when d →∞ θ= 0 switching from 0 to 1 near θ γ: determining the slope at switching point Overall Classification Performance Measure : 1 if X ∈ Ci 0 otherwise Minimum-Classification-Error (MCE) Training

Find  such that the above objective function in general is difficult to minimize directly local minimum can be obtained iteratively using gradient (steepest) descent algorithm every training observation may change the parameters of ALL models, not the model for its class only partial differentiation with respect to all different parameters individually t : the t-th iteration ε: adjustment step size, should be carefully chosen Minimum-Classification-Error (MCE) Training ^

Discriminative Training For Large Vocabulary Speech Recognition • Minimum Bayesian Risc (MBR) • adjusting all model parameters to minimize the Bayesian Risc • Λ: {λi,i=1,2,……N} acoustic models • Γ: Language model parameters • Or : r-th training utterance • sr: correct transcription of Or • Bayesian Risc • u: a possible recognition output • L(u,sr) : Loss function • PΛ,Γ(u|Or) : posteriori probability of u given Or based on Λ,Γ • Other definitions of L(u,sr) possible • Minimum Phone Error Rate (MPE) Training • Acc(u,sr) : phone accuracy • Better features obtainable in the same way • e.g. yt = xt + Mht feature-space MPE

14.0 Some Fundamental Problem-solving Approaches