Download
hiwire progress report n.
Skip this Video
Loading SlideShow in 5 Seconds..
HIWIRE Progress Report PowerPoint Presentation
Download Presentation
HIWIRE Progress Report

HIWIRE Progress Report

102 Vues Download Presentation
Télécharger la présentation

HIWIRE Progress Report

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

  2. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  3. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  4. Baseline • Baseline Performance Completed • Aurora 2 on HTK • Aurora 3 on HTK • Aurora 4 on HTK • Lattices for Aurora 4 • Baseline Performance Ongoing • WSJ1 (Decipher) • DMHMMs (Decipher)

  5. Aurora 2 Database • Based on TIdigits downsampled to 8KHz • Noise artificially added at several SNRs • 3 sets of noises • A: subway, babble, car, exhib. hall • B: restaurant, street, airport, train station • C: subway, street (with different freq. characteristics) • Two training conditions • Training on clean data • Multi-condition Training on noisy data

  6. Aurora 2 Database • 8440 training sentences • 1001 test sentences / test set • Three front-end configurations • HTK default • WI007 (Aurora 2 distribution) • WI008 (Thanks to Prof. Segura)

  7. Aurora 2: Clean training • HTK default Front-End

  8. Aurora 2: Multi-Condition training • HTK default Front-End

  9. Aurora 2: Clean vs Multi-Condition Training

  10. Aurora 2 Front End Comparison: Clean Training

  11. Front End Comparison: Multi-ConditionTraining

  12. Aurora 3 Database • 5 languages • Finnish • German • Italian • Spanish • Danish • 3 noise conditions • quiet • low noisy (low) • high noisy (high) • 2 recording modes • close-talking microphone (ch0) • hands-free microphone (ch1)

  13. Aurora 3 Database • 3 experimental setups • Well-Matched (WM) • 70% of all utts in “quiet, low, high” conditions were used for training • remaining 30% were used for testing • Medium Mismatched (MM) • 100% hands-free recordings from “quiet” and “low” for training • 100% hands-free recordings from “high” for testing • High Mismatched (HM) • 70% of close-talking recordings from all noise conditions for training • 30% of hands-free recordings from “low” and “high” for testing

  14. Baseline Aurora 3 performance

  15. Baseline Aurora 3 performance

  16. Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison )

  17. Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison )

  18. Baseline Aurora 3 with WI008 FE ( TUC - UGR comparison )

  19. Aurora 4 Database • Based on the WSJ phase 0 collection • 5000 word vocabulary • 7138 training data (ARPA evaluation) • 2 recording microphones • 6 different noises artificially added • Car, Babble, Restaurant, Street, Airport, TrainSt

  20. Clean training Multicondition training 7138 utterances (as in the ARPA evaluation) 7138 utterances 3569 utterances (Sennheiser) 3569 utterances (2nd mic) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) Aurora 4 Training Data Sets • 3 Training Conditions • (Clean – MultiCondition – Noisy)

  21. SET 1 SET 2 SET 3 SET 7 … 330 utt. (Sennheiser microphone) 330 utt. (Sennheiser mic; Noise 1 added at SNRs between 5 and 15 dB) 330 utt. (Sennheiser mic; Noise 2 added at SNRs between 5 and 15 dB) 330 utt. (Sennheiser mic; Noise 6 added at SNRs between 5 and 15 dB) SET 9 SET 10 SET 14 SET 8 … 330 utt. (2nd mic; Noise 1 added at SNRs between 5 and 15 dB) 330 utt. (2nd mic; Noise 2 added at SNRs between 5 and 15 dB) 330 utt. (2nd mic; Noise 6 added at SNRs between 5 and 15 dB) 330 utt. (2nd microphone) Aurora 4 Test Sets • 14 Test Sets • 2 sizes: small (166 utts) and large (330 utts)

  22. Lattices • Obtained from SONIC recognizer • real time decoding for WSJ 5k task • State-of-the-art performance (8% WERR) • Lattices obtained from clean models • Three sizes lattices: small, medium, large • Fixed branching factor for each lattice size (small=2.5, medium=4, large=5.5) • Speed-up factor compared to HTK decoding: x100, x50, x10

  23. Baseline Aurora 4 with Lattices

  24. Baseline Aurora 4 with Lattices

  25. Baseline Aurora 4 (Comparing Lattices)

  26. Aurora4 BaselineConclusions on Lattices • Lattices speed up recognition • Medium Size Lattice is ~ 60 times faster • Small Size Lattice is ~ 108 times faster • Problem: improved performance in noisy test • Careful when using lattices in mismatched conditions (clean training-noisy data)! • Solution: • two sets of lattices lattices: matched, mismatched

  27. Audio-Visual ASR: Database • Subset of CUAVE database used: • 36 speakers (30 training, 6 testing) • 5 sequences of 10 connected digits per speaker • Training set: 1500 digits (30x5x10) • Test set: 300 digits (6x5x10) • CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)

  28. CUAVE Database Speakers

  29. Audio-Visual ASR: Feature Extraction • Lip region of interest (ROI) tracking • A fixed size ROI is detected using template matching • ROI minimizes RGB-Euclidean distance with a given ROI template • ROI template is selected from 1st frame of each speaker • Continuity constraint: search within a 20x20 pixel window of previous frame ROI (does not work for rapid speaker movements)

  30. Audio-Visual ASR: Feature Extraction • Features extracted from ROI • ROI is transformed to grayscale • ROI is decimated to a 16x16 pixel region • 2D separable DCT is applied to 16x16 pixel region • Upper-left 6x6 region is kept (excluding first coef.) • 35 feature vector is resampled in time from 29.97 fps (NTSC) to 100 fps • First and second derivatives in time are computed using a 6 frame window (feature size 105) • Sanity check: unsupervised k-means clustering of ROI results in …

  31. Experiments • Recognition experiment: • Open loop digit grammar (50 digits per utterance, no endpointing) • Classification experiment: • Single digit grammar (endpointed digits based on provided segmentation)

  32. Models • Features: • Audio: 39 features (MFCC_D_A) • Visual: 105 features (ROIDCT_D_A) • Audio-Visual: 39+35 feats (MFCC_D_A+ROIDCT) • HMM models • 8 state, left-to-right HMM whole-digit models with no state skipping • Single Gaussian mixture • Audio-Visual HMM uses separate audio and video feature streams with equal weights (1,1)

  33. Results (Word Accuracy] • Data • Training: 1500 digits (30 speakers) • Testing: 300 digits (6 speakers)

  34. Future Work • Multi-mixture models • Front-end (NTUA) • Tracking algorithms • Feature extraction • Feature Combination • Feature integration • Feature weighting

  35. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  36. Feature extraction and combination • Noise Robust Features (NTUA) – m12 • AM-FM Features (NTUA) – m12 • Feature combination – m12 • Supra-segmental features (see also segment models) – m18

  37. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  38. Segment Models • Baseline system • Supra-segmental features • Phone Transition modeling – m12 • Prosody modeling – m18 • Stress modeling – m18 • Parametric modeling of feature trajectories • Dynamical system modeling • Combine with HMMs

  39. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  40. Blind Source Separation (Mokios, Sidiropoulos] • Based on PARallel FACtor (PARAFAC) analysis, i.e., low-rank decomposition of multi-dimensional tensorial data • Collecting spatial covariance matrix estimates which are sufficiently separated in time: • Assumptions • uncorrelated speaker signals and noise • D(t) is a diagonal matrix of speaker powers for measurement period t • denotes noise power (estimated from silence intervals)

  41. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  42. Acoustic Model Adaptation • Adaptation Method: • Bayes’ Optimal Classification • Acoustic Models: • Discrete Mixture HMMs

  43. Bayes optimal classification • Classifier decision for a test data vector xtest: • Choose the class that results in the highest value:

  44. Bayes optimal versus MAP • Assumption: the posterior is sufficiently peaked around the most probable point • MAP approximation: • θMAP is the set of parameters that maximize:

  45. Why Bayes optimal classification • Optimal classification criterion • The prediction of all the parameter hypotheses is combined • Better discrimination • Less training data • Faster asymptotic convergence to the ML estimate

  46. Why Bayes optimal classification • However: • Computationally more expensive • Difficult to find analytical solutions • ....hence some approximations should still be considered

  47. Discrete-Mixture HMMs (Digalakis et. al. 2000) • It is based on sub-vector quantization • Introduces a new form of observation distributions

  48. DMHMMs benefits (Digalakis et. al. 2000) • Speech Recognition performance driven quantization scheme • Quantization of the acoustic space in sufficient detail • Mixtures capture the correlation between sub-vectors • Well-matched in client-server applications • Comparable performance to continuous HMMs • Faster decoding speeds