1 / 68

HIWIRE Progress Report

HIWIRE Progress Report. Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2). Outline. Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline

ramla
Télécharger la présentation

HIWIRE Progress Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

  2. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  3. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  4. Baseline • Baseline Performance Completed • Aurora 2 on HTK • Aurora 3 on HTK • Aurora 4 on HTK • Lattices for Aurora 4 • Baseline Performance Ongoing • WSJ1 (Decipher) • DMHMMs (Decipher)

  5. Aurora 2 Database • Based on TIdigits downsampled to 8KHz • Noise artificially added at several SNRs • 3 sets of noises • A: subway, babble, car, exhib. hall • B: restaurant, street, airport, train station • C: subway, street (with different freq. characteristics) • Two training conditions • Training on clean data • Multi-condition Training on noisy data

  6. Aurora 2 Database • 8440 training sentences • 1001 test sentences / test set • Three front-end configurations • HTK default • WI007 (Aurora 2 distribution) • WI008 (Thanks to Prof. Segura)

  7. Aurora 2: Clean training • HTK default Front-End

  8. Aurora 2: Multi-Condition training • HTK default Front-End

  9. Aurora 2: Clean vs Multi-Condition Training

  10. Aurora 2 Front End Comparison: Clean Training

  11. Front End Comparison: Multi-ConditionTraining

  12. Aurora 3 Database • 5 languages • Finnish • German • Italian • Spanish • Danish • 3 noise conditions • quiet • low noisy (low) • high noisy (high) • 2 recording modes • close-talking microphone (ch0) • hands-free microphone (ch1)

  13. Aurora 3 Database • 3 experimental setups • Well-Matched (WM) • 70% of all utts in “quiet, low, high” conditions were used for training • remaining 30% were used for testing • Medium Mismatched (MM) • 100% hands-free recordings from “quiet” and “low” for training • 100% hands-free recordings from “high” for testing • High Mismatched (HM) • 70% of close-talking recordings from all noise conditions for training • 30% of hands-free recordings from “low” and “high” for testing

  14. Baseline Aurora 3 performance

  15. Baseline Aurora 3 performance

  16. Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison )

  17. Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison )

  18. Baseline Aurora 3 with WI008 FE ( TUC - UGR comparison )

  19. Aurora 4 Database • Based on the WSJ phase 0 collection • 5000 word vocabulary • 7138 training data (ARPA evaluation) • 2 recording microphones • 6 different noises artificially added • Car, Babble, Restaurant, Street, Airport, TrainSt

  20. Clean training Multicondition training 7138 utterances (as in the ARPA evaluation) 7138 utterances 3569 utterances (Sennheiser) 3569 utterances (2nd mic) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) Aurora 4 Training Data Sets • 3 Training Conditions • (Clean – MultiCondition – Noisy)

  21. SET 1 SET 2 SET 3 SET 7 … 330 utt. (Sennheiser microphone) 330 utt. (Sennheiser mic; Noise 1 added at SNRs between 5 and 15 dB) 330 utt. (Sennheiser mic; Noise 2 added at SNRs between 5 and 15 dB) 330 utt. (Sennheiser mic; Noise 6 added at SNRs between 5 and 15 dB) SET 9 SET 10 SET 14 SET 8 … 330 utt. (2nd mic; Noise 1 added at SNRs between 5 and 15 dB) 330 utt. (2nd mic; Noise 2 added at SNRs between 5 and 15 dB) 330 utt. (2nd mic; Noise 6 added at SNRs between 5 and 15 dB) 330 utt. (2nd microphone) Aurora 4 Test Sets • 14 Test Sets • 2 sizes: small (166 utts) and large (330 utts)

  22. Lattices • Obtained from SONIC recognizer • real time decoding for WSJ 5k task • State-of-the-art performance (8% WERR) • Lattices obtained from clean models • Three sizes lattices: small, medium, large • Fixed branching factor for each lattice size (small=2.5, medium=4, large=5.5) • Speed-up factor compared to HTK decoding: x100, x50, x10

  23. Baseline Aurora 4 with Lattices

  24. Baseline Aurora 4 with Lattices

  25. Baseline Aurora 4 (Comparing Lattices)

  26. Aurora4 BaselineConclusions on Lattices • Lattices speed up recognition • Medium Size Lattice is ~ 60 times faster • Small Size Lattice is ~ 108 times faster • Problem: improved performance in noisy test • Careful when using lattices in mismatched conditions (clean training-noisy data)! • Solution: • two sets of lattices lattices: matched, mismatched

  27. Audio-Visual ASR: Database • Subset of CUAVE database used: • 36 speakers (30 training, 6 testing) • 5 sequences of 10 connected digits per speaker • Training set: 1500 digits (30x5x10) • Test set: 300 digits (6x5x10) • CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)

  28. CUAVE Database Speakers

  29. Audio-Visual ASR: Feature Extraction • Lip region of interest (ROI) tracking • A fixed size ROI is detected using template matching • ROI minimizes RGB-Euclidean distance with a given ROI template • ROI template is selected from 1st frame of each speaker • Continuity constraint: search within a 20x20 pixel window of previous frame ROI (does not work for rapid speaker movements)

  30. Audio-Visual ASR: Feature Extraction • Features extracted from ROI • ROI is transformed to grayscale • ROI is decimated to a 16x16 pixel region • 2D separable DCT is applied to 16x16 pixel region • Upper-left 6x6 region is kept (excluding first coef.) • 35 feature vector is resampled in time from 29.97 fps (NTSC) to 100 fps • First and second derivatives in time are computed using a 6 frame window (feature size 105) • Sanity check: unsupervised k-means clustering of ROI results in …

  31. Experiments • Recognition experiment: • Open loop digit grammar (50 digits per utterance, no endpointing) • Classification experiment: • Single digit grammar (endpointed digits based on provided segmentation)

  32. Models • Features: • Audio: 39 features (MFCC_D_A) • Visual: 105 features (ROIDCT_D_A) • Audio-Visual: 39+35 feats (MFCC_D_A+ROIDCT) • HMM models • 8 state, left-to-right HMM whole-digit models with no state skipping • Single Gaussian mixture • Audio-Visual HMM uses separate audio and video feature streams with equal weights (1,1)

  33. Results (Word Accuracy] • Data • Training: 1500 digits (30 speakers) • Testing: 300 digits (6 speakers)

  34. Future Work • Multi-mixture models • Front-end (NTUA) • Tracking algorithms • Feature extraction • Feature Combination • Feature integration • Feature weighting

  35. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  36. Feature extraction and combination • Noise Robust Features (NTUA) – m12 • AM-FM Features (NTUA) – m12 • Feature combination – m12 • Supra-segmental features (see also segment models) – m18

  37. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  38. Segment Models • Baseline system • Supra-segmental features • Phone Transition modeling – m12 • Prosody modeling – m18 • Stress modeling – m18 • Parametric modeling of feature trajectories • Dynamical system modeling • Combine with HMMs

  39. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  40. Blind Source Separation (Mokios, Sidiropoulos] • Based on PARallel FACtor (PARAFAC) analysis, i.e., low-rank decomposition of multi-dimensional tensorial data • Collecting spatial covariance matrix estimates which are sufficiently separated in time: • Assumptions • uncorrelated speaker signals and noise • D(t) is a diagonal matrix of speaker powers for measurement period t • denotes noise power (estimated from silence intervals)

  41. Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection

  42. Acoustic Model Adaptation • Adaptation Method: • Bayes’ Optimal Classification • Acoustic Models: • Discrete Mixture HMMs

  43. Bayes optimal classification • Classifier decision for a test data vector xtest: • Choose the class that results in the highest value:

  44. Bayes optimal versus MAP • Assumption: the posterior is sufficiently peaked around the most probable point • MAP approximation: • θMAP is the set of parameters that maximize:

  45. Why Bayes optimal classification • Optimal classification criterion • The prediction of all the parameter hypotheses is combined • Better discrimination • Less training data • Faster asymptotic convergence to the ML estimate

  46. Why Bayes optimal classification • However: • Computationally more expensive • Difficult to find analytical solutions • ....hence some approximations should still be considered

  47. Discrete-Mixture HMMs (Digalakis et. al. 2000) • It is based on sub-vector quantization • Introduces a new form of observation distributions

  48. DMHMMs benefits (Digalakis et. al. 2000) • Speech Recognition performance driven quantization scheme • Quantization of the acoustic space in sufficient detail • Mixtures capture the correlation between sub-vectors • Well-matched in client-server applications • Comparable performance to continuous HMMs • Faster decoding speeds

More Related