Conditional Random Fields for Automatic Speech Recognition - PowerPoint PPT Presentation

kennedy-larsen
conditional random fields for automatic speech recognition n.
Skip this Video
Loading SlideShow in 5 Seconds..
Conditional Random Fields for Automatic Speech Recognition PowerPoint Presentation
Download Presentation
Conditional Random Fields for Automatic Speech Recognition

play fullscreen
1 / 106
Download Presentation
Conditional Random Fields for Automatic Speech Recognition
276 Views
Download Presentation

Conditional Random Fields for Automatic Speech Recognition

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 06/03/2010

  2. Motivation • What is the purpose of Automatic Speech Recognition? • Take an acoustic speech signal … • … and extract higher level information (e.g. words) from it “speech”

  3. Motivation • How do we extract this higher level information from the speech signal? • First extract lower level information • Use it to build models of phones, words “speech” / s p iych/

  4. Motivation • State-of-the-art ASR takes a top-down approach to this problem • Extract acoustic features from the signal • Model a process that generates these features • Use these models to find the word sequence that best fits the features “speech” / s p iych/

  5. Motivation • A bottom-up approach • Look for evidence of speech in the signal • Phones, phonological features • Combine this evidence together to find the most probable sequence of words in the signal voicing? burst? frication? “speech” / s p iych/

  6. Motivation • How can we combine this evidence? • Conditional Random Fields (CRFs) • Discriminative, probabilistic sequence model • Models the conditional probability of a sequence given evidence voicing? burst? frication? “speech” / s p iych/

  7. Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

  8. CRF Models • Conditional Random Fields (CRFs) • Discriminative probabilistic sequence model • Directly defines a posterior probability P(Y|X) of a label sequence Y given evidence X

  9. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence

  10. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence

  11. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence

  12. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence

  13. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence • Evidence can influence transitions between states

  14. CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence • Evidence can influence transitions between states

  15. CRF Models • Evidence is incorporated via feature functions state feature functions

  16. CRF Models • Evidence is incorporated via feature functions transition feature functions state feature functions

  17. CRF Models state feature functions transition feature functions • The form of the CRF is an exponential model of weighted feature functions • Weights trained via gradient descent to maximize the conditional likelihood

  18. Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

  19. Phone Recognition • What evidence do we have to combine? • MLP ANN trained to estimate frame-level posteriors for phonological features • MLP ANN trained to estimate frame-level posteriors for phone classes P(voicing|X) P(burst|X) P(frication|X) … P( /ah/ | X) P( /t/ | X) P( /n/ | X) …

  20. Phone Recognition • Use these MLP outputs to build state feature functions

  21. Phone Recognition • Use these MLP outputs to build state feature functions

  22. Phone Recognition • Use these MLP outputs to build state feature functions

  23. Phone Recognition • Pilot task – phone recognition on TIMIT • ICSI Quicknet MLPs trained on TIMIT, used as inputs to the CRF models • Compared to Tandem and a standard PLP HMM baseline model • Output of ICSI Quicknet MLPs as inputs • Phone class attributes (61 outputs) • Phonological features attributes (44 outputs)

  24. Phone Recognition *Signficantly(p<0.05) better than comparable Tandem system (Morris & Fosler-Lussier 08)

  25. Phone Recognition • Moving forward: How do we make use of CRF classification for word recognition? • Attempt to fit CRFs into current state-of-the-art models for speech recognition? • Attempt to use CRFs directly? • Each approach has its benefits • Fitting CRFs into a standard framework lets us reuse existing code and ideas • A model that uses CRFs directly opens up new directions for investigation • Requires some rethinking of the standard model for ASR

  26. Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

  27. HMM-CRF Word Recognition • Inspired by Tandem HMM systems • Uses ANN outputs as input features to an HMM “speech” / s p iych/ PCA

  28. HMM-CRF Word Recognition • Inspired by Tandem HMM systems • Uses ANN outputs as input features to an HMM • HMM-CRF system (Crandem) • Use a CRF to generate input features for HMM • See if improved phone accuracy helps the system • Problem: CRFs estimate probability of the entire sequence, not individual frames “speech” / s p iych/ PCA

  29. HMM-CRF Word Recognition • One solution: Forward-Backward Algorithm • Used during CRF training to maximized conditional likelihood • Provides an estimate of the posterior probability of a phone label given the input

  30. HMM-CRF Word Recognition • Original Tandem system “speech” / s p iych/ PCA

  31. HMM-CRF Word Recognition • Modified Tandem system (Crandem) Local Feature Calc. PCA “speech” / s p iych/

  32. HMM-CRF Word Recognition • Pilot task – phone recognition on TIMIT • Same ICSI Quicknet MLP outputs used as inputs • Crandem compared to Tandem, a standard PLP HMM baseline model, and to the original CRF • Evidence on transitions • This work also examines the effect of using the same MLP outputs as transition features for the CRF

  33. HMM-CRF Word Recognition • Pilot Results 1 (Fosler-Lussier & Morris 08) *Significant (p<0.05) improvement at 0.6% difference between models

  34. HMM-CRF Word Recognition • Pilot Results 2 (Fosler-Lussier & Morris 08) *Significant (p<0.05) improvement at 0.6% difference between models

  35. HMM-CRF Word Recognition • Extension – Word recognition on WSJ0 • New MLPs and CRFs trained on WSJ0 corpus of read speech • No phone level assignments, only word transcripts • Initial alignments from HMM forced alignment of MFCC features • Compare Crandem baseline to Tandem and original MFCC baselines • WJ0 5K Word Recognition task • Same bigram language model used for all systems

  36. HMM-CRF Word Recognition • Results (Morris & Fosler-Lussier 09) *Significant (p≤0.05) improvement at roughly 0.9% difference between models

  37. HMM-CRF Word Recognition *Significant (p≤0.05) improvement at roughly 0.06% difference between models

  38. HMM-CRF Word Recognition Comparison of MLP activation vs. CRF activation

  39. HMM-CRF Word Recognition Ranked average per-frame activation MLP vs. CRF

  40. HMM-CRF Word Recognition • Insights from these experiments • CRF posteriors very different in flavor from MLP posteriors • Overconfident in local decision being made • Higher phone accuracy did not translate to lower WER • Further experiment to test this idea • Transform posteriors via taking a root and renormalizing • Bring classes closer together • Achieved results insignificantly different from baseline, no longer degraded with further epochs of training (though no improvement either)

  41. Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

  42. CRF Word Recognition • Instead of feeding CRF outputs into an HMM “speech” / s p iych/ 42

  43. CRF Word Recognition • Instead of feeding CRF outputs into an HMM • Why not decode words directly off the CRF? “speech” / s p iych/ “speech” / s p iych/ 43

  44. CRF Word Recognition • The standard model of ASR uses likelihood based acoustic models • CRFs provide a conditional acoustic model P(Φ|X) Acoustic Model Lexicon Model Language Model

  45. CRF Word Recognition Lexicon Model Language Model CRF Acoustic Model Phone Penalty Model

  46. CRF Word Recognition • Models implemented using OpenFST • Viterbi beam search to find best word sequence • Word recognition on WSJ0 • WJ0 5K Word Recognition task • Same bigram language model used for all systems • Same MLPs used for CRF-HMM (Crandem) experiments • CRFs trained using 3-state phone model instead of 1-state model • Compare to Tandem and original MFCC baselines

  47. CRF Word Recognition • Results – Phone Classes only *Significant (p≤0.05) improvement at roughly 0.9% difference between models

  48. CRF Word Recognition • Results – Phone & Phonological features *Significant (p≤0.05) improvement at roughly 0.9% difference between models

  49. Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

  50. Conclusions & Future Work • Designed and developed software for CRF training for ASR • Developed a system for word-level ASR using CRFs • Meets baseline performance of an MLE trained HMM system • Platform for further exploration