1 / 1

Introduction Accurate Automatic Speech Recognition (ASR) Highly discriminative features

Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen A. Zahorian Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, USA hongbing.hu@binghamton.edu, zahorian@binghamton.edu. PCA. Network Outputs. Dimensionality

betsy
Télécharger la présentation

Introduction Accurate Automatic Speech Recognition (ASR) Highly discriminative features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dimensionality Reduction Methods for HMM Phonetic RecognitionHongbing Hu, Stephen A. Zahorian Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, USAhongbing.hu@binghamton.edu, zahorian@binghamton.edu PCA Network Outputs Dimensionality Reduced Features Original Features Dimensionality Reduced Outputs PCA • NLDA Reduction • Overview • Nonlinear Discriminant Analysis (NLDA) • A multilayer neural network performs a nonlinear feature transformation of the input speech features • Phone models for transformed features using HMMs with each state modeled with a GMM (Gaussian Mixture Model) • PCA performs a Karhunen-Loeve (KL) transform for reducing the correlation of the network outputs • NLDA2 Method • Reduced Features • Use outputs of the network middle hidden layer • The reduced dimensionality is determined by the number of middle nodes, giving flexibility in reduced feature dimensionality • The linear PCA is used only for feature decorrelation • Nonlinearity • All nonlinear layers are used in both the feature transformation and network training • NLDA1 Method • Dimensionality Reduced Features • Obtained at the output layer of the neural network • Feature dimensionality is further reduced by PCA • Node Nonlinearity (Activation Function) • In the feature transformation, a linear function for the output layer and a Sigmoid nonlinear function for the other layers • In the NLDA1 training, all layers are nonlinear Dimensionality Reduced Features • Introduction • Accurate Automatic Speech Recognition (ASR) • Highly discriminative features • Incorporate nonlinear frequency scales and time dependency • Low dimensionality feature spaces • Efficient recognition models (HMMs & Neural Networks) • Neural Network Based Dimensionality Reduction • Neural Networks (NNs) used to represent complex data while preserving the variability and discriminability of the original data • Combine with a HMM recognizer to form a hybrid NN/HMM recognition model Dimensionality Reduction Decorrelation Reduction & Decorrelation Dimensionality Reduction • For 3-state models, train using “Don’t Cares” • For 1st portion, target is “1” for state 1 and “Don’t Cares” for states 2 and 3 • For 2nd portion, target is “1” for state 2 and “Don’t cares” for states 1 and 3 • For 3rd portion, target is “1” for state 3 and “Don’t Cares” for states 1 and 2 • Experimental Setup • TIMIT Database (‘SI’ and ‘SX” only) • 48 phoneme set mapped down from 62 phoneme set • Training Data: 3696 sentences (460 speakers) • Testing Data: 1344 sentences (168 speakers) • DCTC/DCSC Features • A total of 78 features (13 DCTCs x 6 DCSCs) were computed • 10 ms frames with 2 ms frame spacing and 8 ms block spacing with 1s block length • HMM • 3-state Left-to-right Markov models with no skip • 48 monophone HMMs were created using HTK (ver 3.4) • Language model: Phone bigram information of the training data • Neural Networks in NLDA • 3 hidden-layers: 500-36-500 nodes • Input layer: 78 nodes corresponding to the feature dimensionality • Output Layer: 48 nodes for the phoneme targets, or 144 nodes for the state level targets • Training Neural Networks • Phone Level Targets • Each NN output correspond to a specific phone • Straightforward to implement, using phonetically labeled training database • But why should NN output be forced to the same value for the entire phone • State Level Targets • Each NN output correspond to a single state of a phone HMM • But how to determine state boundaries • Estimate using percentage of total length • Use initial training iteration, then Viterbi alignment • Experimental Results • Control Experiment • Compare the original DCTC-DSCS with the PCA and LDA reduced features • Use various numbers of mixtures in HMMs • Accuracies using the original, PCA and LDA reduced features (20 & 36 dimensions) • NLDA Experiment • Evaluate NLDA1 and NLDA2 with or without PCA • 48-dimensional phoneme level targets used • The features reduced to 36 dimensions • Accuracies using the NLDA1 and NLDA2 reduced features, and the reduced features without the PCA processing • State Level Target Exp 1 • Compare the state level targets with and without “Don’t cares” • 144-dimensional state level targets used • State boundaries obtained using the fixed state length method (3 states: 1:4:1) • Network training: 8x106 weight updates • Accuracies using the state level targets with and without “Don’t cares” • State Level Target Exp 2 • Use a fixed length ration and the Viterbi alignment for the state targets • State level targets with “Don’t cares” used • Targets obtained using a fixed length ratio (3 states: 1:4:1) and the Viterbi alignment • Network training: 4x107 weight updates • Accuracies; (R)” indicates a fixed length ratio and “(A)” the Viterbi forced alignment • Literature Comparison • Recognition accuracy based on TIMIT • Conclusions • Very high recognition accuracies were obtained using the outputs of network middle layer as in NLDA2 • The NLDA methods are able to produce a low-dimensional effective representation of speech features • The NLDA2 reduced features achieved a substantial improvement versus the original features • The original 78-dimensional features yield the highest accuracy of 73.2% using 64-mix HMMs • The state level targets with “Don’t cares” result in higher accuracies • The middle layer outputs of a network results in more effective features in a reduced space • The accuracies improved about 2% with PCA

More Related