510 likes | 750 Vues
“Pushing the Envelope” A six month report. By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL George Doddington, NA-sayer. Overview Nelson Morgan, ICSI.
E N D
“Pushing the Envelope” A six month report By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL George Doddington, NA-sayer
The Current Cast of Characters • ICSI: Morgan, Q. Zhu, B. Chen, G. Doddington • UW: M. Ostendorf, Ö. Çetin • OGI: H. Hermansky, S. Sivadas, P. Jain • Columbia: D. Ellis, M. Athineos • SRI: K. Sönmez • IDIAP: H. Bourlard, J. Ajmera, V. Tyagi
Rethinking Acoustic Processing for ASR • Escape dependence on spectral envelope • Use multiple front-ends across time/freq • Modify statistical models to accommodate new front-ends • Design optimal combination schemes for multiple models
10 ms estimate of sound identity estimate of sound identity up to 1s kth estimate information fusion ith estimate time nth estimate Task 1: Pushing the Envelope (aside) OLD • Problem: Spectral envelope is a fragile information carrier PROPOSED • Solution:Probabilities from multiple time-frequency patches
short-term features conventional HMM advanced features Task 2: Beyond Frames… OLD • Solution: Advanced features require advanced models, free of fixed-frame-rate paradigm • Problem: Features & models interact; new features may require different models PROPOSED multi-rate, dynamic-scale classifier
Today’s presentation • Infrastructure: training, testing, software • Initial Experiments: pilot studies • Directions: where we’re headed
Infrastructure Kemal Sönmez, SRI (SRI/UW/ICSI effort)
Initial Experimental Paradigm • Focus on a small task to facilitate exploratory work (later move to CTS) • Choose a task where LM is fixed & plays a minor role (to focus on acoustics) • Use mismatched train/test data: • To avoid tuning to the task • To facilitate later move to CTS • Task: OGI numbers/ Train: swbd+macrophone
Hub5 “Short” Training Set • Composition (total ~ 60 hours) * subset of SWB-1 hand-checked at SRI for accuracy of transcriptions and segmentations • WER 2-4% higher vs. full 250+ hour training
Reduced UW Training Set • A reduced training set to shorten expt. turn-around time • Choose training utterances with per-frame likelihood scores close to the training set average • 1/4th of the original training set • Statistics (gender, data set constituencies) are similar to that of the full training set. • For OGI Numbers, no significant WER sacrifice in the baseline HMM system (worse for Hub 5).
Development Test Sets • A “Core-Subset” of OGI’s Numbers 95 corpora – telephone speech of people reciting addresses, telephone numbers, zip codes, or other miscellaneous items • “Core-Subset” or “CS” consists of utterances that were phonetically hand-transcribed, intelligible, and contained only numbers • Vocabulary Size: 32 words (digits + eleven, twelve… twenty… hundred…thousand, etc.)
Statistical Modeling Tools • HTK (Hidden Markov Toolkit) for establishing an HMM baseline, debugging • GMTK (Graphical Models Toolkit) for implementing advanced models with multiple feature/state streams • Allows direct dependencies across streams • Not limited by single-rate, single-stream paradigm • Rapid model specification/training/testing • SRI Decipher system for providing lattices to rescore (later in CTS expts) • Neural network tools from ICSI for posterior probability estimation, other statistical software from IDIAP
Baseline SRI Recognizerfor the numbers task • Bottom-up state-clustered Gaussian mixture HMMs for acoustic modeling • Acoustic adaptation to speakers using affine mean and variance transforms[Not used for numbers] • Vocal-tract length normalization using maximum likelihood estimation [Not helpful for numbers] • Progressive search with lattice recognition and N-best rescoring [To be used in later work] • Bigram LM
Initial Experiments Barry Chen, ICSI Hynek Hermansky, OHSU (OGI) Özgür Çetin, UW
Goals of Initial Experiments • Establish performance baselines • HMM + standard features (MFCC, PLP) • HMM + current best from ICSI/OGI • Develop infrastructure for new models • GMTK for multi-stream & multi-rate features • Novel features based on large timespans • Novel features based on temporal fine structure • Provide fodder for future error analysis
ICSI Baseline experiments • PLP based - SRI system • “Tandem” PLP-based ANN + SRI system • Initial combination approach
Phonetically Trained Neural Net • Multi-Layer Perceptron (input, hidden, and output layer) • Trained Using Error-Backpropagation Technique – outputs interpreted as posterior probabilities of target classes • Training Targets: 47 mono-phone targets from forced alignment using SRI Eval 2002 system • Training Utterances: UW Reduced Hub5 Set • Training Features: PLP12+e+d+dd, mean & variance normalized on per-conversation side basis • MLP Topology: • 9 Frame Context Window (4 frames in past + current frame + 4 frames in future) • 351 Input Units, 1500 Hidden Units, and 47 Output Units • Total Number of Parameters: ~600k
Baseline ICSI Tandem • Outputs of Neural Net before final softmax non-linearity used as inputs to PCA • PCA without dimensionality reduction • 4.1% Word and 11.7% Sentence Error Rate on Numbers95-CS test set
Baseline ICSI Tandem+PLP • PLP Stream concatenated with neural net posteriors stream • PCA reduces dimensionality of posteriors stream to 16 (keeping 95% of overall variance) • 3.3% Word and 9.5% Sentence Error Rate on Numbers95-CS test set
OGI Experiments:New Features in EARS • Develop on home-grown ASR system (phoneme-based HTK) • Pass the most promising to ICSI for running in SRI LVCSR system • So far • new features match the performance of the baseline PLP features but do not exceed it • advantage seen in combination with the baseline
Psychophysics Components within certain frequency range (several critical bands) interact [e.g. frequency masking] Components within certain time span (a few hundreds of ms) interact [e.g. temporal masking] Physiology 2-D (time-frequency) matched filters for activity in auditory cortex [cortical receptive fields] Looking to the human auditory system for design inspiration
Multilayer Perceptron (MLP) Posterior probabilities of phonemes 101 point input Multilayer Perceptron (MLP) Mean & variance normalized, hamming windowed critical band trajectory Multilayer Perceptron (MLP) TRAP-based HMM-NN hybrid ASR Search for the best match
MLP transform HMM ASR TANDEM transform MLP Feature estimation from linearly transformed temporal patterns ? ? ?
Preliminary TANDEM/TRAP results (OGI-HTK) WER% on OGI numbers, training on UW reduced training set, monophone models
Features from more than one critical-band temporal trajectory Studying KLT-derived basis functions, we observe: cosine transform + frequency derivative average
UW Baseline Experiments • Constructed an HTK-based HMM system that is competitive with the SRI system • Replicated the HMM system in GMTK • Move on to models which integrate information from multiple sources in a principled manner: • Multiple feature streams (multi-stream models) • Different time scales (multi-rate models) • Focus on statistical models not on feature extraction
HTK HMM Baseline • An HTK-based standard HMM system: • 3 state triphones with decision-tree clustering, • Mixture of diagonal Gaussians as state output dists., • No adaptation, fixed LM. • Dimensions explored: • Front-end: PLP vs. MFCC, VTLN • Gender dependent vs. independent modeling • Conclusions: • No significant performance differences • Decided on PLPs, no VTLN, gender-independent models for simplicity
HMM Baselines (cont.) • Replicated HTK baseline with equivalent results in GMTK • To reduce experiment turn-around time, wanted to reduce the training set • For HMMs and Numbers95, 3/4th of the training data can be safely ignored:
feature stream X state seq. of stream X feature stream Y Multi-stream Models STATE TOPOLOGY • Information fusion from multiple streams of features • Partially asynchronous state sequences GRAPHICAL MODEL states of stream X state seq. of stream Y states of stream Y
Temporal envelope features(Columbia) • Temporal fine structure is lost (deliberately) in STFT features: • Need a compact, parametric description... 10 ms windows
Frequency-DomainLinear Prediction (FDLP) • Extend LPC with LP model of spectrum • ‘Poles’ represent temporal peaks: • Features ~ pole bandwidth, ‘frequency’ TD-LP y[n] = Siaiy[n-i] FD-LP Y[k] = SibiY[k-i] DFT
Preliminary FDLP Results • Distribution of pole magnitudes for different phone classes (in 4 bands): • NN Classifier Frame Accuracies:
Directions Dan Ellis, Columbia(SRI/UW/Columbia work) Nelson Morgan, ICSI (OGI/IDIAP/ICSI work + summary)
Multi-rate Models (UW) • Integrate acoustic information from different time scales • Account for dependencies across scales • Better robustness against time- and/or frequency localized interferences • Reduced redundancy gives better confidence estimates long-term features coarse state chain Cross-scale dependencies (example) fine state chain short-term features
SRI Directions • Task 1:Signal-adaptive weighting of time-frequency patches • Basis-entropy based representation • Matching pursuit search for optimal weighting of patches • Optimality based on minimum entropy criterion • Task 2:Graphical models of patch combinations • Tiling-driven dependency modeling • GM combines across patch selections • Optimality based on information in representation
Data-derived phonetic features (Columbia) • Find a set of independent attributes to account for phonetic (lexical) distinctions • phones replaced by feature streams • Will require new pronunciation models • asynchronous feature transitions (no phones) • mapping from phonetics (for unseen words) Joint work with Eric Fosler-Lussier
ICA for feature bases • PCA finds decorrelated bases;ICA finds independent bases • Lexically-sufficient ICA basis set?
OGI Directions:Targets in sub-bands • Initially context-independent and band-specific phonemes • Gradually shifted to band-specific 6 broad phonetic classes (stops, fricatives, nasals, vowels, silence, flaps) • Moving towards band-independent speech classes (vocalic-like, fricative-like, plosive-like, ???)
More than one temporal pattern? MLP KLT1 101 dim KLTn MLP Mean & Variance normalized, Hamming windowed critical band trajectory
Pre-processing by 2-D operatorswith subsequent TRAP-TANDEM * frequency time differentiate f average t differentiate t average f diff upwards av downwards diff downwards av upwards
IDIAP Directions:Phase AutoCorrelation Features Traditional Features: Autocorrelation based. Very sensitive to additive noise, other variations. Phase AutoCorrelation (PAC): if represents autocorrelation coeffs derived from a frame of length PACs:
Entropy Based Multi-Stream Combination • Combination of evidences from more than one expert to improve performance • Entropy as a measure of confidence • Experts having low entropy are more reliable as compared to experts having high entropy • Inverse entropy weighting criterion • Relationship between entropy of the resulting (recombined) classifier and recognition rate
ICSI Directions:Posterior Combination Framework • Combination of Several Discriminative Probability Streams
Improvement of the Combo Infrastructure • Improve basic features: • Add prosodic features: voicing level, energy continuity, • Improve PLP by further removing the pitch difference among speakers. • Tandem • Different targets, different training features. E.g.: word boundary. • Improve TRAP (OGI) • Combination • Entropy based, accuracy based stream weighting or stream selection.
New types of tandem features: Possible word/syllable boundary NN Processing Target posterior Input feature • Input feature: • Traditional or improved PLP • Spectral continuity • Voicing, voicing continuity • Formant continuity feature • …more • Phonemes • Word/syllable boundary • Broad phoneme classes • Manner/ place / articulation… etc
Initial segmentation: large number of clusters Is thresholdless BIC-likemerging criterion met? Stop No Yes Merge, re-segment, and re-estimate Data Driven Subword Unit Generation (IDIAP/ICSI) • Motivation: • Phoneme-based units may not be optimal for ASR. • Approach (based on speaker segmentation method):
Summary • Staff and tools in place to proceed with core experiments • Pilot experiments provided coherent substrate for cooperation between 6 sites • Future directions for individual sites are all over the map, which is what we want • Possible exploration of collaborations w/MS in this meeting