1 / 35

Create Photo-Realistic Talking Face

Create Photo-Realistic Talking Face. Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang. Outline. Introduction of talking face Motivations System overview Techniques Conclusions. Introduction. What is a talking face

long
Télécharger la présentation

Create Photo-Realistic Talking Face

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Create Photo-Realistic Talking Face Changbo Hu 2001.11.26 *This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang

  2. Outline • Introduction of talking face • Motivations • System overview • Techniques • Conclusions

  3. Introduction • What is a talking face • Face (lip) animation, driven by voice • Applications • The process of talking face • Face model • Motion capture • Mapping between audio and video • Rendering, Photo-realistic?

  4. Literatures • Walter,93, DecFace, 2Dwire frame model • Terzopoulos,95, Skin and muscle model • Breglar,97, Video Rewrite, Sample image based • TS Huang,98,Mesh model from range data • Poggio,98, MikeTalk, Viseme morphing • Guenter,99, Making face, 3D from multicamera • Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint • Cosatto,00, Planar quads model

  5. Some Face models

  6. Motivations • Aim: a graphics interface for conversation agent • Photo-realistic • Driven by Chinese • Smooth connection between sentences • Extended from “Video rewrite”

  7. System overview:Pipeline of the system(1)

  8. System overview: Pipeline of the system(2) New text TTS system Wav sound Segmentation Triphone sequence Train database Synthesized triphone sequence Background sequence Lip motion sequence Rewrite to faces

  9. Techniques • Analysis: • Audio process • Image process • Synthesis • Lip image • Background image • Stitch together

  10. Audio part:Sound Segmentation • Given the wav file and the script • Using HMM to train the segment system • Segment wav file to phoneme sequence • Example of the segmentation result: SILOPEN 0 23 SILOPEN 24 42 s 43 61 if4 62 74 j 75 80 ia1 81 97 sh 98 109 ang1 110 121 y 122 130 e4 131 133 y 134 145 in2 146 154 h 155 164 ang2 165 194

  11. Annotation with Phoneme • Using phoneme to annotate video frames • Each phoneme in a sentence corresponds to a short time of video sequence

  12. Phoneme Distance Analysis • Phoneme&triphone basics • Chinese Phoneme vs. English Phoneme • Distance Metrics definitions • Results

  13. Phoneme Basics • Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes. CH, JH, S, EH, EY, OY, AE, SIL… • Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information. T-IY-P, IY-P-AA, P-AA-T…

  14. Chinese Phoneme vs. English • Chinese phoneme has two basic groups: Initials and Finals. Initials: B, P, M, F, … Finals: a3, o1, e2, eng3, iang4, ue5, … • Chinese finals each has 5 tones: 1,2,3,4,5. Different tones: a1, a2, a3, a4, a5. • Chinese finals actually is not a basic elements of speech. For example: iang1, iao1, uang1, iong1… • Chinese phoneme set is much larger than English.

  15. Phoneme Distance Analysis • Define the distance between any two phonemes. • Since we only synthesis video but not sound, so tone is ignored • Lip shape motion is the core element for distance metrics.

  16. Phoneme Distance Analysis Phoneme 1: Video 1 Video 2 Video 3 Video 4 Video 1 Video 2 Video 3 Video 4 Video Average Time Align to an uniform length Average the videos to get an average video Phoneme 2: Video 1 Video 2 Video 1 Video 2 Video Average By comparing the two aligned average videos, we generate the distance matrix of the whole phoneme set.

  17. Image part: Pose Tracking • Assume a plane model for face • Standard minimization method to find transform matrix (affine transform)[Black,95] • Mask is used to constrain interests part of the face Template Picture Mask Image

  18. Pose tracking • Motion prediction using parameters with physical meaning

  19. Pose Tracking Some tracking results:

  20. Lip Motion Tracking • Using Eigen Points (Covell, 91) • Feature Points include Jaw, lip and teeth • Training database specified manually • Auto tracking through all pose-tracked images

  21. Lip motion tracking

  22. Lip MotionTracking Train Database (hand-labeled) Auto Tracking Results

  23. Synthesis new sentences • New text converted by TTS system to wav • Wav is segmented to phoneme sequence • Using DP to find an optimal video sequence from the training database • Time-align triphone videos and stitch them together. • Transform the lip sequence and paste them to background faces.

  24. Lip sequence synthesis New phoneme sequences Optimal phoneme sequences New phoneme sequences Triphone 1 Triphone 4 Triphone 7 Triphone A Triphone 2 Triphone 5 Triphone 8 Triphone B Triphone 3 Triphone 6 Triphone 9 Triphone C

  25. Dynamic Programming Begin End Triphone1 Triphone2 Triphone3 Triphone4 Triphone5

  26. Edge Cost Definition • Two parts: • phoneme distance: 3 phonemes’ distances added together • Lip shape distance for the overlap portion of triphone video • Weighted add together two part

  27. Background video generation • Background is a video sequence when the virtual character spoke something else • Similarity measurement of background • Select “standard frame” • The frame with maximal number of frames similar to it • Filter out the frames with jerkiness

  28. Stitch the time-aligned result to background faces • Write back with a mask • Transform the synthesized lip to the background face

  29. Mask image for write-back operation Original background frame Write-back result of the same frame

  30. More video results

  31. More video results

  32. Conclusion and Future Work • Pose tracking and lip motion tracking • Size of the train database • Talking face with expression • Real-time generation? • Fast modeling for different person

  33. Animation

  34. Thank you

More Related