190 likes | 286 Vues
This study delves into learning long-term temporal features for better conversational speech recognition. It compares TRAPS and HATS approaches, assessing their efficacy in feature extraction for speech processing. The experimentation and evaluation on speech datasets demonstrate the potential robustness and improved recognition accuracy achieved through these methods. The study also investigates the combination of TRAPS/HATS with conventional PLP/MLP features to enhance overall performance. In conclusion, the benefits of constrained learning and non-linear discriminant transforms are highlighted, emphasizing the significance of incorporating these advanced techniques in speech recognition systems.
E N D
Learning Long-Term Temporal Features for Conversational Speech Recognition A Comparative Study Barry Chen, Qifeng Zhu, & Nelson Morgan (with many thanks to Hynek Hermansky and Andreas Stolcke) AMI/Pascal/IM2/M4 Workshop
Log-Critical Band Energies Conventional Feature Extraction AMI/Pascal/IM2/M4 Workshop
Log-Critical Band Energies TRAPS/HATS Feature Extraction AMI/Pascal/IM2/M4 Workshop
What is a TRAP? • TempoRal Patterns (TRAPs) were developed by our colleagues at OGI: Sharma, Jain, and Sivadas and Hermansky (last 2 now at IDIAP) • TRAP = a narrow frequency speech energy pattern over a period of time (0.5 – 1 second) • TRAPS use neural networks trained to get posteriors (as in Qifeng Zhu’s talk) • Hidden Activation TRAPS (HATS) AMI/Pascal/IM2/M4 Workshop
TRAPS/HATS Motivation • Psychoacoustics -> long time scale • Mutual information -> more than 100 ms • Potential robustness to speech degradations AMI/Pascal/IM2/M4 Workshop
Learn Everything in One Step? AMI/Pascal/IM2/M4 Workshop
Learn in Individual Bands? AMI/Pascal/IM2/M4 Workshop
One-Stage Approach AMI/Pascal/IM2/M4 Workshop
2-Stage Linear->Nonlinear Approaches AMI/Pascal/IM2/M4 Workshop
2-Stage MLP-Based Approaches AMI/Pascal/IM2/M4 Workshop
Two Questions TRAPS and HATS:two-stage approaches to learning long-term temporal features • Are these constrained approaches better than an unconstrained approach? • Are the non-linear transformations of critical band trajectories necessary? AMI/Pascal/IM2/M4 Workshop
Experimental Setup • Training: ~68 hours of conversational telephone speech from English CallHome, Switchboard I, and Switchboard Cellular • Testing: 2001 Hub-5 Evaluation Set (Eval2001) • 2,255,609 frames and 62,890 words • Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models (thanks to Andreas Stolcke) AMI/Pascal/IM2/M4 Workshop
Frame Classification AMI/Pascal/IM2/M4 Workshop
Standalone Feature System • Transform MLP outputs by: • log transform to make features more Gaussian • PCA for decorrelation • Similar to Tandem setup introduced by [Hermansky, Ellis, and Sharma] (discussed by Qifeng Zhu on Monday) • Use transformed MLP outputs as front-end features for the SRI recognizer AMI/Pascal/IM2/M4 Workshop
ASR with Standalone Temporal Features AMI/Pascal/IM2/M4 Workshop
Combination w/ PLP+ • SRI’s 2003 PLP front-end feature is 12th order PLP with three deltas. Then HLDA transforms 52 to 39 dimensions • Concatenate PCA truncated MLP features to HLDA(PLP+3d) features AMI/Pascal/IM2/M4 Workshop
Combo w/PLP Baseline Features AMI/Pascal/IM2/M4 Workshop
Interpretation • Learning constraints introduced by the 2-stage approach is better than unconstrained learning • Non-linear discriminant transform of HATS is better than linear discriminant transforms from LDA and HATS before sigmoid • Like TRAPS, HATS is complementary to more conventional features and combines synergistically with PLP/MLP (tandem) features. AMI/Pascal/IM2/M4 Workshop
The End AMI/Pascal/IM2/M4 Workshop