Comparative Study: Long-Term Temporal Features for Conversational Speech Recognition

Learning Long-Term Temporal Features for Conversational Speech Recognition A Comparative Study Barry Chen, Qifeng Zhu, & Nelson Morgan (with many thanks to Hynek Hermansky and Andreas Stolcke) AMI/Pascal/IM2/M4 Workshop

Log-Critical Band Energies Conventional Feature Extraction AMI/Pascal/IM2/M4 Workshop

Log-Critical Band Energies TRAPS/HATS Feature Extraction AMI/Pascal/IM2/M4 Workshop

What is a TRAP? • TempoRal Patterns (TRAPs) were developed by our colleagues at OGI: Sharma, Jain, and Sivadas and Hermansky (last 2 now at IDIAP) • TRAP = a narrow frequency speech energy pattern over a period of time (0.5 – 1 second) • TRAPS use neural networks trained to get posteriors (as in Qifeng Zhu’s talk) • Hidden Activation TRAPS (HATS) AMI/Pascal/IM2/M4 Workshop

TRAPS/HATS Motivation • Psychoacoustics -> long time scale • Mutual information -> more than 100 ms • Potential robustness to speech degradations AMI/Pascal/IM2/M4 Workshop

Learn Everything in One Step? AMI/Pascal/IM2/M4 Workshop

Learn in Individual Bands? AMI/Pascal/IM2/M4 Workshop

One-Stage Approach AMI/Pascal/IM2/M4 Workshop

2-Stage Linear->Nonlinear Approaches AMI/Pascal/IM2/M4 Workshop

2-Stage MLP-Based Approaches AMI/Pascal/IM2/M4 Workshop

Two Questions TRAPS and HATS:two-stage approaches to learning long-term temporal features • Are these constrained approaches better than an unconstrained approach? • Are the non-linear transformations of critical band trajectories necessary? AMI/Pascal/IM2/M4 Workshop

Experimental Setup • Training: ~68 hours of conversational telephone speech from English CallHome, Switchboard I, and Switchboard Cellular • Testing: 2001 Hub-5 Evaluation Set (Eval2001) • 2,255,609 frames and 62,890 words • Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models (thanks to Andreas Stolcke) AMI/Pascal/IM2/M4 Workshop

Frame Classification AMI/Pascal/IM2/M4 Workshop

Standalone Feature System • Transform MLP outputs by: • log transform to make features more Gaussian • PCA for decorrelation • Similar to Tandem setup introduced by [Hermansky, Ellis, and Sharma] (discussed by Qifeng Zhu on Monday) • Use transformed MLP outputs as front-end features for the SRI recognizer AMI/Pascal/IM2/M4 Workshop

ASR with Standalone Temporal Features AMI/Pascal/IM2/M4 Workshop

Combination w/ PLP+ • SRI’s 2003 PLP front-end feature is 12th order PLP with three deltas. Then HLDA transforms 52 to 39 dimensions • Concatenate PCA truncated MLP features to HLDA(PLP+3d) features AMI/Pascal/IM2/M4 Workshop

Combo w/PLP Baseline Features AMI/Pascal/IM2/M4 Workshop

Interpretation • Learning constraints introduced by the 2-stage approach is better than unconstrained learning • Non-linear discriminant transform of HATS is better than linear discriminant transforms from LDA and HATS before sigmoid • Like TRAPS, HATS is complementary to more conventional features and combines synergistically with PLP/MLP (tandem) features. AMI/Pascal/IM2/M4 Workshop

The End AMI/Pascal/IM2/M4 Workshop

Comparative Study: Long-Term Temporal Features for Conversational Speech Recognition