Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion

Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion Presented by: Iwan Boksebeld and Marijn Suijten

Authors Tero Karras NVIDIA Timo Aila NVIDIA Samuli Laine NVIDIA Antti Herva Remedy Entertainment Jaakko Lehtinen NVIDIA & Aalto University

Goals of the paper • Create 3D mesh from just audio • With the use of CNNs • While keeping low latency • And factoring in emotions

The Problem • Create a full face mesh for added realism • Use emotions in the animation • Dealing with ambiguity of audio • Creating a CNN and training this

Related Work

Linguistic based animation • Input audio often with transcript • Animation results from language based rules • The strengths of this method is the high level of control • The weakness is the complexity of the system • Example of such a model is the Dominance model

Machine learning techniques • Mostly in 2D • Learn the rules given in linguistic based animation • Blend and/or concatenate images to produce results • Not really useful for the application here

Capturing emotions • Mostly based on user parameters • Some work with neural networks • Creates mapping from emotion parameters to facial expression

The technical side

Audio processing • 16 kHz mono; normalized volume • 260ms of past and future samples, total of 520ms • Value empiricallychosen • Take 64 audio frames of 16ms • 2x overlap: every 8ms used twice • Hann window: remove temporal aliasing effects

Autocorrelation • Calculate K=32 autocorrelation coefficients • 12 enough for identifying individual phonemes • Need more to identify pitch • No special techniques for linear separation of phonemes • Tests indicate this process is clearly superior

CNN Layout • Formant analysis network: • First layer is audio-processing and autocorrelation • Time axis of 64 samples • 32 autocorrelation coefficients • Followed by 5 convolution layers • Convert formant audio features to 256 abstract feature maps

CNN Layout • Articulation network • Analyze temporal evolution • 5 layers as well • Emotion vector concatenated

Working with emotions • Speech highly ambiguous • Consider silence: what does it look like?

Representing Emotions • Emotional state stored as "meaningless" E-dimensional vector • Learns with the network • Vector concatenated to convolution layers in articulation network • Concatenated to every layer: significantly better result • Support early layers with nuanced control over details such as coarticulation • Later layers have more control over the output pose

Training

Training Target • Use 9 cameras to get unstructured mesh and optical flow • Project template mesh onto unstructured mesh • Link optical flow to template • Template mesh is then tracked across performance • Use some vertices to stabilize head • Limitation no tongue

Training Data • Pangrams and in-character material • 3-5 minutes per actor (trade off quality vs. time/cost) • Pangrams: Designed sentences with as many sounds of a language • in-character: Capture emotions based on character narrative • Time-shifting data augmentation

Loss Function • Loss function in 3-terms: • Position term • Motion term • Regularization term • Use normalization scheme to balance these terms

Position Term • Ensure correct vertex location • V: # of vertices • y: desired position • ŷ: actual position

Motion Term • Ensure correct motion • Comparing paired frames • m(~): Difference between paired frames

Regularization Term • Ensure no erratic emotion • Normalized to prevent becoming ineffective • E: # of emotion components • e(i):ith component of the emotion vector for sample x

Inference

Inferring emotion • Step 1: Cull “wrong” emotion vectors • Bilabials -> closed mouth • Vowels -> opened mouth • Step 2: Visually inspect animation • Remove short-term effects • Step3: Use voice from different actor • Unnatural -> lack of generalization • Manually assign semantic meaning • Interpolate emotion vectors for transition/complex emotion

Results

User Study Setup • Blind user study with 20 participants • User were asked to choose between 2 which was more realistic • Two sets of experiments • Comparing against other methods • DM vs PC vs Ours • Audio from validation set not used in training • 13 clips of 3-8 seconds long • Generalization over language and gender • 14 clips form several languages • From online database without checking output

User Study Results

User Study Results • Clearly better then DM • Still quite a bit worse then PC • Generalizes quite well over languages • Even compared to linguistic method

Critical Review

Drawbacks of solution • No residual motion • No blinking • No head movement • Assumes higher power handles these • Problems with similar looking sounds • E.g. Confuse B and G • Fast languages are a problem • Novel data needs to be somewhat similar to training data • Misses detail compared to PC • Emotions have no defined meaning

Questions?

Discussion

Discussion • Is it useful to gather emotions like this paper describes? • Why not just tell what the emotion is?

Discussion • Should they have used more participants?

Discussion • Why do you think blinking and eye/head motion is not covered by the network?

Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion

Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion

Presentation Transcript

Facial expression of emotion

End to End Protocols

End-to-End Joint Learning of Natural Language Understanding and Dialogue Manager

End-to-End Issues

straight ahead action pose to pose animation

End-to-end Security and Condor

End-to-End Protocols

End-to-End and Innovation

Emotion-Driven Reinforcement Learning

End to End Quality of Service

End-to-End Stewardship

End-to-End Protocols

Realistic Performance-driven Facial Animation

High End Facial in Toronto

End-to-End Data

End-to-end eProcurement

End-to-end Security and Condor

End-to-End Protocols

End-to-End and Innovation

End to End Protocols