170 likes | 297 Vues
This paper investigates the incorporation of voicing features into traditional Mel-cepstral frameworks for improved speech recognition. Existing cepstral coefficients often overlook critical discriminative cues essential for effective phonetic decoding. We propose augmenting Mel-cepstral features with voicing metrics, optimizing parameters for distinct phonetic states. By combining normalized peak autocorrelation and entropy cepstrum, our findings suggest notable advancements in word error rate (WER) across different speakers. Further evaluation and integration of these features promise significant enhancements in automatic speech recognition systems.
E N D
Voicing Features • Horacio Franco, Martin Graciarena • Andreas Stolcke, Dimitra Vergyri, Jing Zheng • STAR Lab. SRI International
Phonetically Motivated Features • Problem: • Cepstral coefficients fail to capture many discriminative cues. • Front-end optimized for traditional Mel cepstral features. • Front-end parameters are a compromise solution for all phones.
Phonetically Motivated Features • Proposal: • Enrich Mel cepstral feature representation with phonetically motivated features from independent front-ends. • Optimize each specific front-end to improve discrimination. • Robust broad class phonetic features provide “anchor points” in acoustic phonetic decoding. • General framework for multiple phonetic features. First approach: voicing features.
Voicing Features • Voicing features algorithms: • Normalized peak autocorrelation(PA). For time frame X • max computed in pitch region 80Hz to 450Hz • Entropy of high order cepstrum (EC) and linear spectra (ES).If • And H is the entropy of Y, • then • Entropy computed in pitch region 80Hz to 450Hz
Voicing Features • Correlation with template and DP alignment [Arcienega, ICSLP’02]. The Discrete Logarithm Fourier Transform • for the frequency band for speech signal • If IT is an impulse train, the template is • and the signal DLFT • the correlation for frame j with the template is • the DP optimal correlation is • max computed in pitch region 80Hz to 450Hz
Voicing Features • Preliminary exploration of voicing features: • - Best feature combination: Peak Autocorrelation + Entropy Cepstrum • - Complementary behavior of autocorrelation and entropy features for high and low pitch. • Low pitch: time periods are well separated therefore correlation is well defined. • High pitch: harmonics are well separated and cepstrum is well defined.
Voicing Features • Graph of voicing features: w er k ay n d ax f s: aw th ax v dh ey ax r
Voicing Features • Integration of Voicing Features: • 1 - Juxtaposing Voicing Features: • Juxtapose two voicing features to traditional Mel cepstral feature vector (MFCC) plus delta and delta-delta features (MFCC+D+DD) • Voicing feature front-end: use same MFCC frame rate and optimize temporal window duration.
Voicing Features • Train small switchboard database (64 hours). Test on dev 2001. WER for both sexes. • Features: MFCC+D+DD, 25.6 msec. frame every 10 msec. • VTL and speaker mean and var. norm. Genone acoustic model. Non-X-word, MLE trained, Gender Dep. Bigram LM.
Voicing Features • 2 – Voiced/Unvoiced Posterior Features: • Use a posterior voicing probability as feature. Computed from 2 state HMM. Juxtaposed feature dim is 40. • Similar setup as before. Males only results. • Soft V/UV transitions may be not captured because posterior feature behaves similar to binary feature.
Voicing Features • 3 –Window of Voicing Features + HLDA: • Juxtapose MFCC features and window of voicing features around current frame. • Apply dimensionality reduction with HLDA. Final feature had 39 dimensions. • Same setup as before, MFCC+D+DD+3rd diffs. Both sexes. • Baseline 1.5% abs. better, Voicing improves 1% more. 39.5 39.5
Voicing Features • 4 – Delta of Voicing Features + HLDA: • Use delta and delta-delta features instead of window of voicing features. Apply HLDA to juxtaposed feature. • Same setup as before, MFCC+D+DD+3rd diffs. Males only. • Reason may be variability in voicing features produce noisy deltas. • HLDA weighting of “window of voicing features” is similar to average. • ---------------------------------------------------------------------------------- • The best overall configuration was MFCC+D+DD+3rd diffs. and 10 voicing features + HLDA.
Voicing Features • Voicing Features in SRI CTS Eval. Sept 03 System: • Adaptation of MMIE cross-word models w/wo voicing features. • Used best configuration of voicing features. • Train on Full SWBD+CTRANS data. Test on EVAL’02. • Feature: MFCC+D+DD+3rd diffs.+HLDA • Adaptation: 9 transforms full matrix MLLR. • Adaptation hypothesis from: MLE non cross-word model, PLP front end with voicing features.
Voicing Features • Hypothesis Examples: • REF: OH REALLY WHAT WHAT KIND OF PAPER • HYP BASELINE: OH REALLY WHICH WAS KIND OF PAPER • HYP VOICING: OH REALLY WHAT WHAT KIND OF PAPER • REF: YOU KNOW HE S JUST SO UNHAPPY • HYP BASELINE: YOU KNOW YOU JUST I WANT HAPPY • HYP VOICING: YOU KNOW HE S JUST SO I WANT HAPPY
Voicing Features • Error analysis: • In one experiment: 54% of speakers got WER reduction (some up to 4% abs. reduction). Rest 46% small WER increase. • Still need a more detailed study of speaker dependent performance. • Implementation: • Implemented a voicing feature engine in DECIPHER system. • Fast computation, using one FFT and two IFFTs per frame for both voicing features.
Voicing Features • Conclusions: • Explored how to represent/integrate the voicing features for best performance. • Achieved 1% abs (~2 % rel) gain in first pass (using small training set), and >0.5 % abs (2 % rel) (using full training set) in higher rescoring passes of DECIPHER LVCSR system. • Future work: • Still need to further explore feature combination/selection • Develop more reliable voicing features, features not always reflect actual voicing activity • Develop other phonetically derived features (vowels/consonants, occlusion, nasality, etc).