Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED September 18 th , 2007

Transforming The Emotion In Speech: Techniques for Prosodic and Spectral Emotion Conversion Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED September 18th, 2007

Agenda • Problem Definition • Emotional Speech Data • Emotion Conversion System (Interspeech 07) • Spectral Conversion • Duration Conversion • HMM-based F0 Generation • Subjective Evaluation • F0 Segment Selection • Unit Preparation • Segment Selection • Weight Determination • Comparative Evaluation

Problem Definition • Emotion conversion can be very useful in TTS frameworks where collecting a unit-selection corpus in each target emotion is costly. • Given a handful of utterances spoken in a desired emotion (a few hundreds of utterances vs. thousands), find a way to transform the neutral speech to emotional speech. • Assume you are in a TTS framework • A high-quality neutral utterance has been synthesized • Annotation files and transcriptions are available. • In transforming the emotional quality, make sure the speech is still overall natural and artifact free.

Emotional Speech Data • Parallel data collection of neutral, happy, sad, surprised and angry speech.* • Professional female voice talent. • For the experiments mentioned henceforth, 272 utterances per emotion used for training and 28 used for testing. • Post processing: • Phonetic Alignment - Syllable Boundaries • Pitch Marks - Text Analysis • Perceptual Test on Orignal Speech Data • 15 participants, 10 utternaces per emotion * All data collection efforts were supervised and funded by Toshiba Research.

Target Emotion Spectral Conversion (GMM) Context VQ Converted Waveform Neutral Waveform Duration Tier Phone Duration Trees TD-PSOLA (Praat) Syllable F0 HMMs F0 contour Final Waveform A Multi-Level Emotion Conversion System

Step 1: Spectral Conversion • The goal is to convert a neutral spectrum to an emotional spectrum. • Analysis of long-term average spectra reveal emotion specific pattens. • Anger has significantly more energy in the 1550-2500 band than in 0-800 • Sadness has a sharper spectral tilt and more low frequency energy • Happy & surprisedfollow similar spectral patterns /ae/ /ei/

GMM-based Linear Transformation • A popular method for speaker conversion (Stylianou et al, 1998) • 100,000 parallel neutral and emotional LSF vectors aligned • GMM (Q=16) fitting on the neutral (source) data. • A linear transformation F is learned from the aligned features. • Pitch-synchronous source-filter analysis (order=30) and OLA synthesis. P(c1 | xk) f_1 P(c2 | xk) f_2 F(xk) xk P(cQ | xk) f_Q

Spectral Conversion Results Long term avarage spectra for phone /ae/ in the test data /ae/

Step 2: Duration Conversion Context Duration Trees Duration Tier • A regression tree was built for each broad class • Vowels, nasals, glides and fricatives. • 3300 vowels, 826 nasals, 1025 glides, 1610 fricatives • Relative trees outperform absolute trees. • Four feature groups were investigated: • FG0: original duration • FG1: FG0 + phone identity, previous phone, next phone, • FG2: FG1 + lexical stress, sentence position • FG3: FG2 + word position, word length, part of speech • Cross-validation of training data to find the best pruning level. • Training data held out in groups of 10 utterances Original Durations

Duration Conversion Results

Step 3: F0 Generation Intonation Models F0 contour • Each syllable F0 segment is modelled as a three-state left-to-right HMM. (HTK) • Each model is represented in terms of context sensitive features: • Feature set used: lexical stress, sentence position, word position, part of speech, part of speech of prev word, prevowel, postvowel. • Syllable models trained with interpolated F0 values as well as first and second order differentials. • 4200 models for which decision tree-based parameter tying was performed. Syllable Context

F0 Generation From HMM • The goal is to generate an optimal sequence of observations directly from syllable HMMs given the intonation models and state durations: • We used the cepstral parameter generation algorithm of HTS synthesis system to generate smooth F0 contours (Tokuda et al, 1995) • Dynamic F0 features (Δf and ΔΔf) are used as constraints in generation. • Given O=WF in matrix form, the maximization is done on static parameters F rather than the entire observation vector O.

F0 Synthesis from HMM Angry Surprised Sad

Conversion Samples

Subjective Evaluation • 10 conversions per emotion were randomly presented to 15 listeners. • Listeners were asked to guess the intended emotion in a three-way forced choice setup. • Binary quality ratings (natural / unnatural) Recognition rates for sad and surprised comparable with perception of human emotional speech. Recognition rates for anger lag behind perception of angry human speech. Quality ratings for anger not as good as the other conversions. One reason can be oversmoothing of F0 contours with the HMM framework?

F0 Segment Selection Overview • Alternative to HMM-based F0 generation. • Unit selection applied to F0 syllable segments. • A corpus of syllable F0 segments in the target emotion • A parallel corpus of neutral F0 segments • An input specification sequence I consisting of: • Syllable context for each syllable • Input F0 segment of each syllable • Define a syllable target cost T and an inter-syllable concatenation cost J such that the total cost over S syllables for a given unit sequence U and input sequence I is defined as:

F0 Unit Preparation • Training utterances chopped to form syllable F0 segments. • Median smoothing applied to F0 segments • Parallel neutral-emotional F0 pairs are saved in codebooks along with syllable context. • lexical stress, position in word, position in sentence, part of speech, part of speech of previous word, prevowel, postvowel. neutral emotional syllable context Pre=1 Post=2 Lex=0 Pofs=4 spos=1 . . . Pre=1 Post=2 Lex=1 Pofs=3 spos=2 . . . Pre=1 Post=2 Lex=1 Pofs=3 spos=2 . . . Pre=1 Post=2 Lex=1 Pofs=3 spos=2 . . . Pre=1 Post=2 Lex=1 Pofs=3 spos=2 . . .

F0 Segment Selection • Given a trellis of possible syllable F0 unit combinations for an utterance, choose the optimal path using Viterbi search. • Target cost T is a Manhattan distance, consisting of P subcosts: • Two types of target subcosts: • Binary value (0 or 1) indicating context match for a given context feature. • Euclidian distance between input F0 segment (neutral) for ij and the neutral F0 segment of unit uj

F0 Segment Selection Continued • Concatenation cost J? • Zero if adjacent syllable contours are detached, i.e. separated by an unvoiced region. • Non-zero if syllable contoursare attached, i.e. part of one continous voiced segment. • Distance between last F0 value for unit s-1 and first F0 value of unit s. • Viterbi Search and Pruning • 3544 units in the corpus. (U=3544) • Viterbi Search operates over O(U2S) time – it is important to prune! • Duration based pruning – only segments with similar durations to the input are considered. B Js,s+1 = |B-E| E B s s - 1

Weight Estimation • Weights have two functions: • To be able to normalize subcosts with different ranges (acoustic versus categorical costs) • To be able to assign different significance levels to costs. • Two sets of weights: wT and wTJ • Case 1: Syllable is detached from the previous syllable (i.e. no concatenation cost, only target cost) P weights are estimated. • Case 2: Syllables are attached. Concatenation cost also has a weight. P+1 weights are estimated

Weight Estimation (Case 1) • Weights wT are optimized using held-out utterances. • Least squares framework with X equations P unknowns where X >> P • We already have the target F0 segment we would like to predict. • Find N-best and N-worst candidates in the unit database. • Each error E represents the RMSE error between the target contour and the best and worst units. . . .

Weight Estimation (Case 2) • Weights wTJ are estimated with a similar framework. Instead of using single syllables, the errors of two attached syllables are summed. • Once again N-best and N-worst values are included in the system of equations. . . .

HMMs versus Segment Selection

Conclusions & Future Work • We have proposed a multi-level emotion conversion system. • We have implemented two alternative F0 generation/conversion methods to be plugged into the overall system. • Preference test showed that F0 segment selection method performed better than HMMs for anger and surprise. • A further forced-choice emotion identification test needs to be done with segment selection. • Segment selection can be improved to contain a more powerful concatenation cost.

Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED September 18 th , 2007