Pushing the Envelope

Pushing the Envelope - Aside Nelson Morgan, Qifeng Zhu, Andreas Stolcke, Kemal Sönmez, Sunil Sivadas, Takahiro Shinozaki, Mari Ostendorf, Pratibha Jain, Hynek Hermansky, Dan Ellis, George Doddington, Barry Chen, Özgür Çetin, Hervé Bourlard, and Marios Athineos Presenter: Shih-Hsiang IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER,2005

Reference • Ö. Çetin and M. Ostendorf, “Multi-rate and variable-rate modeling of speech at phone and syllable time scales,” in Proc. ICASSP 2005 • B. Chen, Q. Zhu, and N. Morgan, “Learning long term temporal features in LVCSR using neural networks,” in Proc. ICSLP, 2004 • H. Hermansky and S. Sharma, “TRAPS—Classifiers of temporal patterns,” in Proc. ICSLP, 1998 • H. Hermansky, S. Sharma, and P. Jain, “Data-derived nonlinear mapping for feature extraction in HMM,” in Proc. ASRU, 1999

Reference (cont.) • C. Moreno, Q. Zhu, B. Chen, Nelson Morgan, “Automatic Data Selection for MLP-based Feature Extraction for ASR”in Proc. ASRU, 2005 • N. Morgan, B. Chen, Q. Zhu, A. Stolcke, “Trapping Conversational Speech: Extending TRAP/TANDEM Approaches to Conversational Telephone Speech Recognition” in Proc. ICASSP, 2004

Today’s topic • Focus on three issues • Using MLP to extract the long-term features • TRAPs • HATs • The considerations when training the large amount data • New HMM model introduced (multi-scale) • Multi-Scale, Variable-Scale

Introduction • The core acoustic operation has essentially remained the same for decades • Using single feature vector compares to a set of distributions derived from training • The feature vector often derived from the power spectral envelope over a 20-30ms window, steeped forward by ~10ms step per frame • Systems using short-term cepstra for modeling have been successful both in the laboratory and in numerous application • But there are still significant limitations to speech recognition performance, particularly for conversational speech and/or speech with significant acoustic degradations from noise or reverberation

Introduction (cont.) • Human phonetic categorization is poor for extremely short segments (<100ms) • suggesting that analysis of longer time regions is somehow essential to the task • In mid-2002, they began working on a DARPA sponsored project - EARS • The fundamental goal of this multisite effect was is • Push the spectral envelope away from its role as the sole source of acoustic incorporated by the statistical models of modern speech recognition systems (SRSs) • This ultimately would required both a revamping of acoustical feature extraction and a fresh look at the incorporation of these feature into statistical models representing speech

Temporal Representation • Replace (or augment) the current notion of a spectral-energy based vector at time t with variables • Based on posterior probabilities of speech categories for long and short time functions of the time-frequency plane • These feature may be represented as multiple streams of probabilistic information • Working with narrow spectral subbands and long temporal windows (up to 500 ms or more, sufficiently long for two or more syllables) • TempoRAl Patterns (TRAPs) • Hidden Activation TRAPS (HATS)

TempoRAl Patterns (TRAPs) ICSLP 1998 • Substitute a conventional spectral feature vector in phonetic classification by a 1 sec long temporal vector of critical band logarithmic spectral energies (Bark critical band)

Bark Critical Band • The scale ranges from 1 to 24 and corresponds to the first 24 critical bands of hearing The subsequent band edges are (in Hz) 0, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500

TempoRAl Patterns (cont.) Fig. Mean TRAPs for 16 phonemes at the fifth critical band

TempoRAl Patterns (cont.) ASRU 1999 • The TRAPS system consists of two stages of MLPs • In the first stage critical band MLPs learn phone probabilities posterior on the input • In the second stage A “merger” MLP merges the output of each of these individual critical band MLPs resulting in overall phone posteriors probabilities

TempoRAl Patterns (cont.) • Input to each TRAP is a 1 sec long temporal vector • Output of each TRAP is a vector of estimates of phoneme-specific likelihoods • Output from the merging MLP is a vector of estimates of phoneme-specific posterior probabilities 15 Critical-band 101 input units 300 hidden units 29 output phonetic classes TRAP

Hidden Activation TRAPS (HATS) ICSLP 2004 • Use the hidden activations of the critical band MLPs instead of their outputs as inputs to the “merger” MLPs ?? • Widening acoustic context by using more frames of full band speech energies as input to the MLP • Reducing the word error rate from 25.6% to 23.5% on the 2001 NIST evaluation set • Reducing the word error rate from 20.3% to 18.3% on the 2004 NIST evaluation set

Hidden Activation TRAPS (cont.)

Hidden Activation TRAPS (cont.) • PLP feature were derived from short term spectral analysis(25ms time slices every 10 ms) • PLP/MLP used 9 frames of PLP features and HATs used 51 frames of log critical band energies

Stability of Results • Switch board (earlier) and Fisher (later) conversational data is extremely difficult to recognize • Due to their unconstrained vocabulary, speaking style, and range of telephones used • Increasing amounts of training data can achieved better performance

Some Practical Consideration • Larger and larger training sets can provide the best improvement • implies a quadratic growth in training time • Solution • Hyper-threading on the dual CPUs • Gender-specific training • Preliminary network training passes with fewer training patterns • Customization of the learning regimen to reduce the number of epochs (training iteration) • Using selected subsets of the data for later training passes

Some Practical Consideration (cont.) • Faster probabilistic inference algorithms and judicious model selection methods for controlling model complexity are needed

Some Practical Consideration (cont.) ASRU 2005 • Data Selection is also an important issue • Reducing the redundancy existing in the database can help to reduce the costs of learning achieving the same performance with less effort • Over-represented examples in the database can harm the generalization capabilities of a given learning machines biasing its modeling toward those classes • For the selection of data based on the filter approach we need an evaluation method that allows us to sort the data according to some sampling criteria of definition of usefulness of the data

Some Practical Consideration (cont.) • Evaluation method • The first step, we have to train an MLP selector (classifier) ,s, using a small subset of the data that will result in a set of parameters, • Afterward, given those parameters we can then obtain the probabilities a posteriori for the rest of the data for every feature frame and phoneme, qk • We can now compute the entropy value for each feature frame as

Some Practical Consideration (cont.) • Sampling criteria • High entropy values indicate that taking a decision is going to be difficult • Low entropy value indicate that the decision is easy to make (not necessarily implying it will be the right one) • Very high entropy values may account for outlier or mislabeled examples: non–separable data. • Very low entropy value can account for overrepresented or easily learnt examples • This overrepresentation can harm the classifier ability by forcing too much detail in the corresponding class

Some Practical Consideration (cont.) NIST 2001

Statistical Modeling for the New Features • HMMs are not well suited to long-term features • The use of HMMs as the core acoustic modeling technology might obscure the gains from new features, especially those from long time scales • This may be one reason why progress with novel techniques has been so difficult • The standard way to use longer temporal scale with an HMM is simply to use a large analysis window and a small frame step • The successive features at the slow time scale are even more correlated than those at the fast time scale, leading to a bias in posteriors • Models that do not represent the high correlation between successive frames effectively

Statistical Modeling for the New Features (cont.) • They propose instead to focus on the problem of multistream and multirate process modeling • It is desirable to improve robustness to corruption of individual streams • The use of multiple streams introduces more flexibility in characterizing speech at different time and frequency scale • The statistical models and features interact, and simple HMM-based combination approaches might not fully utilize complementary information in different feature sequences • A multi-rate and variable-rate modeling is introduced

Multi-Rate and Variable-Rate Modeling ICASSP 2005 • The traditional approach for utilizing new features is to concatenate them with existing cepstral features after over-sampling and use them with in a standard HMM-based models • HMM have become so tuned to short-term features that their use might obscure the gains from new features • Traditional HMM

:states :observation Multi-Rate and Variable-Rate Modeling (cont.) • Basic Multi-rate HMM T1=3 M2=3 T2=M2xT1=9 coarser scale finer scale

:states :observation Multi-Rate and Variable-Rate Modeling (cont.) • Variable-rate Extension (2-rate) coarser scale finer scale

Multi-Rate and Variable-Rate Modeling (cont.) • In their experiment, they modeled speech using both recognition units and feature sequences corresponding to phone and syllable time scales • Short-time: traditional phone HMMs using cepstral features (PLP cepstral) • Long-time: characterizes syllable structure and lexical stress using HATs • Unlike the previously mentioned HAT features that were trained on phone targets, these HAT features are trained on broad consonant/vowel classes with distinction for syllable position (onset, coda, and ambi-syllabification) for consonants and low/high stress level for vowels 2% word error rate reduction on NIST 2001 Hub-5 task

Multi-Rate and Variable-Rate Modeling (cont.) • The experiment result shows the explicit modeling of speech at two time scales via multirate, coupled HMMs architecture outperforms simple HMM-based feature concatenation approach • The feature extraction and statistical modeling are tailored to focus more on information-bearing regions (e.g. phone transition) as opposed to a uniform emphasis over the whole signal space • Research direction • Choice of the sampling rates according to the scale/rate of the larger time-window features • Multirate acoustic models with more than two time scales • The third or higher time scale can represent utterance-level effects such as speaking rate and style, gender and noise

What could be next • Determine optimal window sizes and frame rates for different regions of speech, thus creating a signal-adaptive front end • The energy-based representations of temporal trajectories could be replaced by autoregressive models for these components of the time-frequency plane • FDLP, LP-TRAP • Perceptual linear prediction squared (PLP2 ) • A spectrogram-like signal representation that is iteratively approximated by all-pole models applied sequentially in the time and frequency direction of the spectrotemporal pattern • Unlike conventional feature processing, no frame-based spectral analysis occures

Final Words • They wrote some words … • “We implored the reader not to be deterred by initial result that were poorer than those achieved by more conventional method, since this was almost inevitable when wandering from a well-worn path. However the goal was always to ultimately improve performance, and the explorations into relatively uncharted territory were only a path to that goal. This process can be slow and sometimes frustrating”

Pushing the Envelope - Aside