Bayesian Predictive Classification With Incremental Learning for Noisy Speech Recognition

Bayesian Predictive Classification With Incremental Learning for Noisy Speech Recognition 朱國華 89/12/06

References • 簡仁宗、廖國鴻,“ 具有累進學習能力之貝氏預測法則在汽車語音辨識之應用”, ROCLING XIII, pp.179~197, 2000. • H. Jiang, K. Hirose and Q. Huo, “Robust Speech Recognition Based on a Bayesian Prediction Approach”,IEEE Transaction on Speech and Audio Processing, Vol. 7, no. 4, pp. 426-440, July,1999. • J.T. Chien, “Online Hierarchical Transformation of Hidden Markov Models for Speech Recognition”, IEEE Transaction on Speech and Audio Processing, Vol. 7, no. 6, pp. 656-667, November 1999.

Contents • Introduction • Problem Formulation • Some Decision Rules for ASR • Transform-Based Bayesian Predictive Classification(TBPC) • Derivation of the Bayesian Predictive Likelihood Measurement (BPLM) • Online Prior Evolution(OPE) • Experiments and Discussions

Introduction • Transform-based Bayesian predictive classification robustness decision rules for noisy speech recognition. • Online prior evolution to copy with the nonstationary testing environment (both for environmental and speaker’s variation).

Problem Formulation • Approximate MAP(Quasi Bayesian, QB) Estimation of ASR: n : index of the input test utterance W : word content or syllable string of input utternce η: acoustic transformation parameter(function) n: ={X1,X2,…, Xn } be the i.i.d. and successively observed block samples φ(n-1) : represents the estimated environmental statistic from previous input utterance X1,X2,…, Xn-1 .

Problem Formulation(cont.) • Assume W and ηare independent, so the previous QB estimation can be rewritten as follow: {p(W) :Language Model}

Some Decision Rules for ASR • Plug-In MAP Rule: • The performance of plug-in MAP decision rule depends on the choice of estimation approach (ML, MAP, discriminative training, etc.)., the nature and size of the training data, and the degree of the mismatch between training and testing conditions. • Point estimation.

Some Decision Rules for ASR(cont.) • Minimax Rule : • Nonparametric Compensation. • Minimizes the upper bound of the worst-case probability of classification error. • Assume the unknown true parameter  is a r. v. with uniform distribution in a neighborhood region .

TBPC • Transform Bayesian Predictive Classification(TBPC) Rule: • where likelihood is obtained by:

TBPC (cont.) • TBPC treat the transformed parameter as a random variable(not the point estimation). • The average is taken both with respect to the sampling variation in the expected testing data and with respect to the uncertainty described by the prior pdf p(Xn|W,). • TBPC can be applied both to supervised and unsupervised learning environment.

TBPC (cont.) • Transformation-based Adaptation: • For a given HMM model with L states and K mixtures ={i}={ik,ik,rik}, i=1~L, k=1~K,the estimated transformation function G(n)() of the given testing utterance nis defined as : • where c is the index of the transformation cluster (hierarchical transformation).

TBPC (cont.) • Implementation: (Approach I) • Considering the missing data problem, we use the Viterbi TBPC for the likelihood : • Frame-synchronous Viterbi Bayesian search algorithm can be utilized to overcome the memory space and computation load(Jiang, IEEE SAP 1999).

TBPC (cont.) • Implementation: (Approach I cont.) • In Jiang ( IEEE SAP 1999), they only considered the uncertainty of the mean vectors of CDHMM with diagonal covariance matrices and assume they are uniformly distributed in a neighborhood of pretrained means(no online adaptation).

TBPC (cont.) • Implementation: (Approach II) • Bayesian Predictive Density Based Model Compensation(BP-MC) of the K mixture state observation pdf is : • where f(xt(n)|ik)is the Bayesian predictive density and is defined below:

TBPC (cont.) • Implementation: (Approach II cont.) • The choice of prior pdf: • In Chien (RocLing 2000), he adopted the multivariate Gaussian pdf which is based on the conjugate prior of statistical.

Derivation of the BPLM • Since p(xt(n)|ik,c) and p(c|c(n-1)) are both Gaussian, we can derivate the f(xt(n)|ik) : (assume both kc and rik are diagonal precision matrix)

Online Prior Evolution • Viterbi Approach: • where (sn*,ln*) is the most likely state and mixture sequence corresponding to Xn, respectively.

Online Prior Evolution (cont.) • The parameter statistics of the c th cluster are:

Online Prior Evolution (cont.) • Where • From above derivation, we can online adapted(learning) the c(n-1) from c(n-1).

Online Prior Evolution (cont.) • We can estimate the initial parameter c (0) from the prior given training data.

Experiments • Training and testing data set I: (Mic1,clean) • 70 males and 70 females , each person records 10 continuous Mandarin digit sentence. • 50 males’ + 50 females’ utterance are for training, the other 20 males’ and females’ are for testing. • Training and testing data set II: (Mic2,noisy) • 2 males+2 females in Toyota Corolla 1.8 • 3 males+3females in Nissan Sentra 1.6 • Each speaker records individually 10 sentences in idle speed, 20 sentences in 50km speed, and 30 sentences in 90 km speed . • Arbitrary choose 5 sentences for training, others for testing.

Experiments (cont.) • Signal to Noise Ratio :

Experiments (cont.) • Recognizer Structure • Features : 12 order LPC derived cepstrum and -cepstrum plus  and  log energy. • HMM Model : 7 states and 4 mixtures for each digit model, plus 3 different single state background noise model.

Experiments (cont.) • Baseline results:

Experiments (cont.) • Supervised DER corresponding to the number of training data.

Experiments (cont.) • Unsupervised TBPC-OPE DER . • (parentheses means the % improvement) • In 2 clusters case , 10 digits are one cluster , and 3 background noise model are the other.

Experiments (cont.) • Unsupervised Performance Comparison of different BPC approaches:

Discussions • Jiang’s results are the worst among all because of the fixed prior distribution. • Surendran’s results are worse than TBPC-OPE because the adaptation of prior pdf is just count on the current input utterance but not the accumulated ones. • We can also adjust the weight (Dirichlet dist.) and variance (Wishart dist.) with the mean at the same time of the HMM model of the BP-MC approach.

Bayesian Predictive Classification With Incremental Learning for Noisy Speech Recognition

Bayesian Predictive Classification With Incremental Learning for Noisy Speech Recognition

Presentation Transcript

Speech Recognition

Multipitch Tracking for Noisy Speech

Using Speech Recognition for Speech Therapy

Bayesian Classification

Speech Recognition and HMM Learning

Speech recognition

Combining Speech Attributes for Speech Recognition

Speech Recognition

Speech Recognition

DTW for Speech Recognition

Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach

SPEECH RECOGNITION:

Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments

Bayesian Classification

Bayesian Classification

Classification Techniques: Bayesian Classification

Speech Recognition

Speech Recognition for Dummies

Bayesian Learning for Models of Human Speech Perception

Incremental Learning with Multiple Classifier Systems using Correction Filters for Classification

Bayesian Classification

Speech Recognition