Automatic Lip-Synchronization Using Linear Prediction of Speech

Automatic Lip-Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs

Topics of Presentation • Introduction and Background • Linear Prediction Theory • Sound Signatures • Viseme Scoring • Rendering System • Results • Conclusions

Justification • Need: • Existing methods are labor intensive • Poor results • Expensive • Solution: • Automatic method • “Decent” results

Applications of Automatic System • Typical applications benefiting from an automatic method: • Real-time video communication • Synthetic computer agents • Low-budget animation scenarios: • Video games industry

Automatic Is Possible • Spoken word is broken into phonemes • Phonemes are comprehensive • Visemes are visual correlates • Used in lip-reading and traditional animation

Existing Methods of Synchronization • Text Based • Analyze text to extract phonemes • Speech Based • Volume tracking • Speech recognition front-end • Linear Prediction • Hybrids • Text & Speech • Image & Speech

Speech Based is Best • Doesn’t need script • Fully automatic • Can use original sound sample (best quality) • Can use source-filter model

Source-Filter Model • Models a sound signal as a source passed through a filter • Source: lungs & vocal cords • Filter: vocal tract • Implemented using Linear Prediction

Speech Related Topics • Phoneme recognition • How many to use? • Mapping phonemes to visemes • Use visually distinctive ones (e.g. vowel sounds) • Coarticulation effect

The Coarticulation Effect • The blending of sound based on adjacent phonemes (common in every-day speech) • Artifact of discrete phoneme recognition • Causes poor visual synchronization (transitions are jerky and unnatural)

Speech Encoding Methods • Pulse Code Modulation (PCM) • Vocoding • Linear Prediction

Pulse Code Modulation • Raw digital sampling • High quality sound • Very high bandwidth requirements

Vocoding • Stands for VOice-enCODing • Origins in military applications • Models physical entities (tongue, vocal cord, jaw, etc.) • Poor sound quality (tin can voices) • Very low bandwidth requirements

Linear Prediction • Hybrid method (of PCM and Vocoding) • Models sound source and filter separately • Uses original sound sample to calculate recreation parameters (minimum error) • Low bandwidth requirements • Pitch and intonation independence

Linear Prediction Theory • Source-Filter model • P coefficients are calculated Filter Source

Linear Prediction Theory (cont.) • The ak coefficients are found by minimizing the original sound (St) and the reconstructed sound (si). • Can be solved using Levinson-Durbin recursion.

Linear Prediction Theory (cont.) • Coefficients represent the filter part • The filter is assumed constant for small “windows” on the original sample (10-30ms windows) • Each window has its own coefficients • Sound source is either Pulse Train (voiced) or white noise (unvoiced)

Linear Prediction for Recognition • Recognition on raw coefficients is poor • Better to FFT the values • Take only first “half” of FFT’d values • This is the “signature” of the sound

Sound Signatures • 16 values represent the sound • Speaker independent • Unique for each phoneme • Easily recognized by machine

Viseme Scoring • Phonemes were chosen judiciously • Map one-to-one to visemes • Visemes scored independently using history • Vi= 0.9 * Vi-1 + 0.1 * {1 if matched at i, else 0} • Ramps up and down with successive matches/mismatches

Rendering System • Uses Alias|Wavefront’s Maya package • Built-in support for “blend shapes” • Mapped directly to viseme scores • Very expressive and flexible • Script generated and later read in • Rendered to movie, QuickTime used to add in original sound and produce final movie.

Results (Timing) • Precise timing can be achieved • Smoothing introduces “lag”

Results (Other Examples) • A female speaker using male phoneme set Slower speech, male speaker

Results (Other Examples) (cont.) • Accented speech with fast pace

Results (Summary) • Good with basic speech • Good speaker independence (for normal speech) • Poor performance when speech: • Is too fast • Is accented • Contains phonemes not in the reference set (e.g. “w” and “th”)

Conclusion • Linear Prediction provides several benefits: • Speaker independence • Easy to recognize automatically • Results are reasonable, but can be improved

Future Work • Identify best set of phonemes and visemes • Phoneme classification could be improved with better matching algorithm (neural net?) • Larger phoneme reference set for more robust matching

Results • Simple cases work very well • Timing is good and very responsive • Robust with respect to speaker • Cross-gender, multiple male speakers • Fails on: accents, speed, unknown phonemes • Problems with noisy samples • Can be smoothed but introduces “lag”

End

Automatic Is Possible • Spoken word is broken into phonemes • Phonemes are comprehensive • Visemes are visual correlates • Used in lip-reading and traditional animation • Physical speech (vocal cords, vocal tract) can be modeled • Source-filter model

Sound Signatures (Speaker Independence)

Sound Signatures (For Phonemes)

Results (Normal Speech) • Normal speech, moderate pace

Automatic Lip-Synchronization Using Linear Prediction of Speech

Automatic Lip-Synchronization Using Linear Prediction of Speech

Presentation Transcript

Automatic Speech Recognition

Linear Prediction

Automatic Speech Recognition

Automatic Speech Recognition

Automatic speech recognition

Linear Prediction

AIRUS (Automatic Information Retrieval Using Speech)

Linear Prediction

Bengali Parts-of-Speech Tagging using Global Linear Model

Automatic Speech Recognition

Warped Linear Prediction

Automatic Speech Recogniton

Using Synchronization

Linear prediction

Bandwidth Expansion of Narrow band Speech using Linear Prediction

Automatic speech recognition using an echo state network

Automatic Speech Recognition

LINEAR PREDICTION

Using Synchronization

Automatic Speech Recognition

Linear Prediction Coding of Speech Signal