1 / 92

Database and Visual Front End

Database and Visual Front End. Makis Potamianos. Active Appearance Model Visual Features. Iain Matthews. Acknowledgments. Cootes, Edwards, Talyor, Manchester Sclaroff, Boston. AAM Overview. Appearance. Region of interest. Warp to reference. Shape & Appearance. Landmarks. Shape.

raisie
Télécharger la présentation

Database and Visual Front End

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database and Visual Front End Makis Potamianos

  2. Active Appearance ModelVisual Features Iain Matthews

  3. Acknowledgments • Cootes, Edwards, Talyor, Manchester • Sclaroff, Boston

  4. AAM Overview Appearance Region of interest Warp to reference Shape & Appearance Landmarks Shape

  5. Relationship to DCT Features Face Detector DCT AAM Tracker AAMFeatures • External feature detector vs. model-based learned tracking • ROI ‘box’ vs. explicit shape + appearance modeling

  6. Training Data • 4072 hand labelled images = 2m 13s (/ 50h)

  7. Final Model 3 mean 3

  8. Fitting Algorithm Current model projection Current model projection Image under model Image under model Difference Difference Iterate until convergence a dc Appearance Appearance Warp to reference Warp to reference Error  weight c is all model parameters dc PredictedUpdate Image

  9. Tracking Results • Worst sequence - mean, mean square error = 548.87 • Best sequence - mean, mean square error = 89.11

  10. Tracking Results • Full-face AAM tracker on subset of VVAV database • 4,952 sequences • 1,119,256 images @ 30fps = 10h 22m • Mean, mean MSE per sentence = 254.21 • Tracking rate (m2p decode)  4 fps • Beard area and lips only models will not track • Regions lack sharp texture gradients needed locate model?

  11. Features • Use AAM full-face features directly (86 dimensional)

  12. Audio Lattice Rescoring Results Lattice random path = 78.14% DCT with LM = 51.08% DCT no LM = 61.06%

  13. Audio Lattice Rescoring Results • AAM vs. DCT vs. Noise

  14. Tracking Errors Analysis • AAM vs. Tracking error

  15. Analysis and Future Work • Models are under trained • Little more than face detection on 2m of training

  16. Analysis and Future Work reproject • Models are under trained • Little more than face detection on 2m of training • Project face through a more compact model • Retain only useful articulation information?

  17. Analysis and Future Work reproject • Models are under trained • Little more than face detection on 2m of training • Project face through a more compact model • Retain only useful articulation information? • Improve the reference shape • Minimal information loss through the warping?

  18. Asynchronous Stream Modelling Juergen Luettin

  19. The Recognition Problem M: word (phoneme) sequence M*: most likely word sequence OA: acoustic observation sequence OV: visual observation sequence

  20. Integration at the Feature Level • Assumption: • conditional dependence between modalities • integration at the feature level

  21. Integration at the Decision Level • Assumption: • conditional independence between modalities • integration at the unit level

  22. Multiple Synchronous Streams • Assumption: • conditional independence • integration at the state level Two streams in each state: X: state sequence aij : trans. prob. from i to j bj: probability density cjm: mth mixture weight of multivariate GaussianN

  23. Multiple Asynchronous Streams • Assumption: • conditional independence • integration at the unit level Decoding: individual best state sequences for audio and video

  24. Composite HMM definition 4 7 9 8 2 5 1 3 6 Speech-noise decomposition (Varga & Moore, 1993) Audio-visual decomposition (Dupont & Luettin, 1998)

  25. Stream Clustering

  26. AVSR System • 3-state HMM with 12 mixture components, 7-state HMM for composite model • context dependent phone models (silence, short pause), tree-based state clustering • cross-word context dependent decoding, using lattices computed at IBM • trigram language model • global stream weights in multi stream models, estimated on held out set

  27. Speaker independent word recognition

  28. Conclusions • AV 2 Stream asynchronous model beats other models in noisy conditions Future directions: • Transition matrices: context dependent, pruning transitions with low probability, cross-unit asynchrony • Stream weights: model based, discriminative • Clustering: taking stream-tying into account

  29. Phone Dependent Weighting Dimitra Vergyri

  30. Weight Estimation Hervé Glotin

More Related