1 / 33

Automatic Lip-Synchronization Using Linear Prediction of Speech

Automatic Lip-Synchronization Using Linear Prediction of Speech. Christopher Kohnert SK Semwal University of Colorado, Colorado Springs. Topics of Presentation. Introduction and Background Linear Prediction Theory Sound Signatures Viseme Scoring Rendering System Results Conclusions.

lulu
Télécharger la présentation

Automatic Lip-Synchronization Using Linear Prediction of Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Lip-Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs

  2. Topics of Presentation • Introduction and Background • Linear Prediction Theory • Sound Signatures • Viseme Scoring • Rendering System • Results • Conclusions

  3. Justification • Need: • Existing methods are labor intensive • Poor results • Expensive • Solution: • Automatic method • “Decent” results

  4. Applications of Automatic System • Typical applications benefiting from an automatic method: • Real-time video communication • Synthetic computer agents • Low-budget animation scenarios: • Video games industry

  5. Automatic Is Possible • Spoken word is broken into phonemes • Phonemes are comprehensive • Visemes are visual correlates • Used in lip-reading and traditional animation

  6. Existing Methods of Synchronization • Text Based • Analyze text to extract phonemes • Speech Based • Volume tracking • Speech recognition front-end • Linear Prediction • Hybrids • Text & Speech • Image & Speech

  7. Speech Based is Best • Doesn’t need script • Fully automatic • Can use original sound sample (best quality) • Can use source-filter model

  8. Source-Filter Model • Models a sound signal as a source passed through a filter • Source: lungs & vocal cords • Filter: vocal tract • Implemented using Linear Prediction

  9. Speech Related Topics • Phoneme recognition • How many to use? • Mapping phonemes to visemes • Use visually distinctive ones (e.g. vowel sounds) • Coarticulation effect

  10. The Coarticulation Effect • The blending of sound based on adjacent phonemes (common in every-day speech) • Artifact of discrete phoneme recognition • Causes poor visual synchronization (transitions are jerky and unnatural)

  11. Speech Encoding Methods • Pulse Code Modulation (PCM) • Vocoding • Linear Prediction

  12. Pulse Code Modulation • Raw digital sampling • High quality sound • Very high bandwidth requirements

  13. Vocoding • Stands for VOice-enCODing • Origins in military applications • Models physical entities (tongue, vocal cord, jaw, etc.) • Poor sound quality (tin can voices) • Very low bandwidth requirements

  14. Linear Prediction • Hybrid method (of PCM and Vocoding) • Models sound source and filter separately • Uses original sound sample to calculate recreation parameters (minimum error) • Low bandwidth requirements • Pitch and intonation independence

  15. Linear Prediction Theory • Source-Filter model • P coefficients are calculated Filter Source

  16. Linear Prediction Theory (cont.) • The ak coefficients are found by minimizing the original sound (St) and the reconstructed sound (si). • Can be solved using Levinson-Durbin recursion.

  17. Linear Prediction Theory (cont.) • Coefficients represent the filter part • The filter is assumed constant for small “windows” on the original sample (10-30ms windows) • Each window has its own coefficients • Sound source is either Pulse Train (voiced) or white noise (unvoiced)

  18. Linear Prediction for Recognition • Recognition on raw coefficients is poor • Better to FFT the values • Take only first “half” of FFT’d values • This is the “signature” of the sound

  19. Sound Signatures • 16 values represent the sound • Speaker independent • Unique for each phoneme • Easily recognized by machine

  20. Viseme Scoring • Phonemes were chosen judiciously • Map one-to-one to visemes • Visemes scored independently using history • Vi= 0.9 * Vi-1 + 0.1 * {1 if matched at i, else 0} • Ramps up and down with successive matches/mismatches

  21. Rendering System • Uses Alias|Wavefront’s Maya package • Built-in support for “blend shapes” • Mapped directly to viseme scores • Very expressive and flexible • Script generated and later read in • Rendered to movie, QuickTime used to add in original sound and produce final movie.

  22. Results (Timing) • Precise timing can be achieved • Smoothing introduces “lag”

  23. Results (Other Examples) • A female speaker using male phoneme set Slower speech, male speaker

  24. Results (Other Examples) (cont.) • Accented speech with fast pace

  25. Results (Summary) • Good with basic speech • Good speaker independence (for normal speech) • Poor performance when speech: • Is too fast • Is accented • Contains phonemes not in the reference set (e.g. “w” and “th”)

  26. Conclusion • Linear Prediction provides several benefits: • Speaker independence • Easy to recognize automatically • Results are reasonable, but can be improved

  27. Future Work • Identify best set of phonemes and visemes • Phoneme classification could be improved with better matching algorithm (neural net?) • Larger phoneme reference set for more robust matching

  28. Results • Simple cases work very well • Timing is good and very responsive • Robust with respect to speaker • Cross-gender, multiple male speakers • Fails on: accents, speed, unknown phonemes • Problems with noisy samples • Can be smoothed but introduces “lag”

  29. End

  30. Automatic Is Possible • Spoken word is broken into phonemes • Phonemes are comprehensive • Visemes are visual correlates • Used in lip-reading and traditional animation • Physical speech (vocal cords, vocal tract) can be modeled • Source-filter model

  31. Sound Signatures (Speaker Independence)

  32. Sound Signatures (For Phonemes)

  33. Results (Normal Speech) • Normal speech, moderate pace

More Related