1 / 18

Linear Dynamic Model (LDM) for Automatic Speech Recognition

Linear Dynamic Model (LDM) for Automatic Speech Recognition. PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University. An Example of Kalman Filter (another name of LDM).

tyrons
Télécharger la présentation

Linear Dynamic Model (LDM) for Automatic Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Dynamic Model (LDM) for Automatic Speech Recognition PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University

  2. An Example of Kalman Filter (another name of LDM) • In control system engineering, Kalman Filter succeeds to model a system with noisy observations Filtering: Position at present time (remove noise effect) Predicting: Position at a future time Smoothing: Position at a time in the past Observation A Kalman Filter models the position evolution

  3. Outline • Why Linear Dynamic Model (LDM)? • Linear Dynamic Model • Pilot experiment: LDM phone classification on Aurora 4 • Hybrid HMM/LDM decoder architecture for LVCSR • Future work

  4. HMM & Speech Recognition System Hidden Markov Models

  5. Is HMM a perfect model for ASR? • Progress on improving the accuracy of HMM-based system has slowed in the past decade • Theory drawbacks of HMM • False assumption that frames are independent and stationary • Spatial correlation is ignored (diagonal covariance matrix) • Limited discrete state space Accuracy Clean Noisy Time

  6. Motivation of Linear Dynamic Model (LDM) Research • Motivation • A model which reflects the characteristics of speech signals will ultimately lead to great ASR performance improvement • LDM incorporates frame correlation information of speech signals, which is potential to increase recognition accuracy • “Filter” characteristic of LDM has potential to improve noise robustness of speech recognition • Fast growing computation capacity (thanks to Intel) make it realistic to build a two-way HMM/LDM hybrid speech engine

  7. State Space Model • Linear Dynamic Model (LDM) is derived from State Space Model • Equations of State Space Model: • y: observation feature vector • x: corresponding internal state vector • h(): relationship function between y and x at current time • f(): relationship functionbetween current state and all previous states • epsilon: noise component • eta: noise component

  8. Linear Dynamic Model • Equations of Linear Dynamic Model (LDM) • Current state is only determined by previous state • H, F are linear transform matrices • Epsilon and Eta are driving components • y: observation feature vector • x: corresponding internal state vector • H: linear transform matrix between y and x • F:linear transform matrix between current state and previous state • epsilon: driving component • eta: driving component

  9. Kalman filtering for state inference (E-Step of EM training) For a speech sound, Human Being Sound System e Kalman Filtering Estimation

  10. RTS smoother for better inference • Rauch-Tung-Striebel (RTS) smoother • Additional backward pass to minimize inference error • During EM training, computes the expectations of state statistics • Standard Kalman Filter • Kalman Filter with RTS smoother

  11. Maximum Likelihood Parameter Estimation (M-Step of EM training) LDM Parameters aa ae ah ao aw ay b ch d dh eh er Nothing but matrix multiplication! ………

  12. LDM for Speech Classification ^ x aa y x ^ x ch MFCC Feature Hypothesis ^ x eh ……… HMM-Based Recognition x ^ LDM-Based Recognition aa y x ^ MFCC Feature x ch Hypothesis ^ x eh ………

  13. Challenges of Applying LDM to ASR • Segment-based model • frame-to-phoneme information is needed before classification • EM training is sensitive to state initialization • Each phoneme is modeled by a LDM, EM training is to find a set of parameters for a specific LDM • No good mechanism for state initialization yet • More parameters than HMM (2~3x) • Currently mono-phone model, to build a tri-phone model for LVCSR would need more training data

  14. Pilot experiment: phone classification on Aurora 4 • Aurora 4: Wall Street Journal + six kinds of noises • Airport, Babble, Car, Restaurant, Street, and Train • Frame-to-phone alignment is generated by ISIP decoder (force align mode) • Adding language model will get 93% accuracy for clean data • 40 phones, one vs. all classifier

  15. Hybrid HMM/LDM decoder architecture for LVCSR Confidence Measurement Best Hypothesis

  16. Status and future work • The development of HMM/LDM hybrid decoder is still in progress • HMM/LDM hybrid decoder is Expected to be done in 2009 • ISIP HMM/SVM hybrid decoder acts as the reference for implementation • Future work • Research has proved the nonlinear effects in speech signals • Investigate the probability of replacing Kalman filtering with nonlinear filtering (such as Unscented Kalman Filter, Extended Kalman Filter)

  17. Thank you! • Questions?

  18. References • Digalakis, V., “Segment-based Stochastic Models of Spectral Dynamics for Continuous Speech Recognition,” Ph.D. Dissertation, Boston University, Boston, Massachusetts, USA, 1992. • Digalakis, V., Rohlicek, J. and Ostendorf, M., “ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 1, no. 4, pp. 431–442, October 1993. • Frankel, J., “Linear Dynamic Models for Automatic Speech Recognition,” Ph.D. Dissertation, The Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK, 2003.

More Related