CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech

CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech HUMAINE Plenary, Paris, June 4th, 2007 Anton Batliner Lehrstuhl für Mustererkennung (Informatik 5) (Chair for Pattern Recognition) Friedrich-Alexander-Universität Erlangen-Nürnberg

What is CEICES?

CEICES Initiative • Combining Efforts for Improving automatic Classification of Emotional user States - a "forced co-operation" initiative • Partners active: • from outside HUMAINE: TUM (Technische Universität München), FBK-irst (inside/outside) • from within HUMAINE, WP4: FAU, UA, LIMSI, TAU/AFEKA • People: • Anton Batliner, Stefan Steidl, Björn Schuller, Dino Seppi, Thurid Vogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loic Kessous, Vered Aharonson

Idea Behind • different research traditions at different sites • somehow fossilized approaches at different sites • co-operation pays off: pooling together competence and feature sets

Which data do we use?

Database • German corpus with recordings of 51 ten to twelve year old children communicating with Sony's Aibo pet robot (9.2 hours of speech, 51.393 words) • data ± reverberated, transliterated, annotated: 5 labellers, 11 word-based "emotion" labels • originator site (FAU) provides speech files, phonetic lexicon, definition of train and test samples, etc. • effort for manual “pre-processing” only: ~80 k € researcher, ~80 k € students (conservative estimation)

segmentation, transliteration emotion labelling annotation of interaction manual word segmentation manual correction of F0 syntactic annotation rule-based chunking system ............. A “Vertical” Approach

Basics of Chosen Approach • children: not exotic but normal • annotation: with context (it’s speech, not sounds)majority voting (≤ 3 out of 5 agree) • unit of annotation: the word, because • link to ASR • link to higher processing (syntax, dialogue, semantics) • smallest possible emotional unit • can be combined onto higher units of different size • mapping onto 4 cover classes, due to sparse data: • Motherese(positive valence) • default class Neutral • "pre-stage" to negative: Emphatic • negative (Angry) "dimension" (smearing fine-grained differencesbetween: touchy, reprimanding, angry) • AMEN sub-sample

The AMEN Sub-sample • syntactically/semantically meaningful chunks with at least one AMEN word • syntactic-prosodic chunking rules: IF (synt. bound. = sentence/free phrase/between vocatives) OR (pause  500 ms at any other synt. bound.) • frequencies: • Motherese: 586 • Neutral: 1998 • Emphatic: 1045 • Angry: 914 • experiments so far with 2- or 3-fold speaker-independent cross-validation, upsampling for training

Scenario - acted - prompted - real-life + elicited/induced + volunteering + application-oriented - emotion-oriented Outcome + spontaneous + natural + realistic - selected Type of Database acted - induced - natural

fully exploiting the state of the art:relevance of features

Feature Encoding Scheme (WS at FAU 12/06) SEM S8I02102M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PPOV_positive_valence BOW S5I01413M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_und S S6I03001M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00C0000010000F00.10.00N00X0000000000T0000000000Pspectral_cog_mean E S1I00004M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000001000F00.10.00N00X0000000000T__________PeneMean___ BOW S5I01270M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_kom BOW S5I01344M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_rum E S8I01036M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.02.01N00X0000000000T0000000000PEnMax0_0_f36_min S S5I04027M1D4R5102L000000A00.00.00.00.13.00.00.00.00.00C0000010000F10.30.00N00X0000000000T0000000000PTUM_0_fa1_band_stdd SEM S8I02108M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.91N04X0000000000T0000000000PPOV_positive_valence_norm BOW S5I01440M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_wieder E S8I01229M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.00.10N00X0000000000T0000000000PEnEneAbs0_0_f29_mean POS S5I00042M1D4R5101L020000A00.00.00.00.00.00.00.00.00.00C0000010000F10.35.00N00X0000000000T0000000000PTUM_sum_APN POS S5I00045M1D4R5101L050000A00.00.00.00.00.00.00.00.00.00C0000010000F10.35.00N00X0000000000T0000000000PTUM_sum_PAJ SEM S8I02106M1D4R5111L009000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PRES_rest D S8I01056M1D4R5111L000000A10.00.00.00.00.00.00.00.00.00C0000010000F00.99.01N00X0000000000T0000000000PDurAbsSyl0_0_f56_min P S8I01263M1D4R5111L000000A00.00.10.00.00.00.00.00.00.00C0000010000F20.61.10N00X0000000000T0000000000PF0RegCoeff0_0_f63_mean S S4I00080M1D4R5111L000000A00.00.00.10.00.00.04.00.00.00C0000010000F00.00.00N00X1000000000T0000000000Pvnhr P S4I01055M1D4R5111L000000A00.00.10.00.00.00.00.00.00.00C0000010000F00.22.00N00X1000000000T0000000000Pprctilep4A E S4I01001M1D4R5111L000000A00.10.10.00.00.00.00.00.00.00C0000010000F30.02.00N00X1000000000T0000000000Ploud_maxval V S4I00075M1D4R5111L000000A00.00.00.00.00.00.02.00.00.00C0000010000F00.00.00N00X1000000000T0000000000Pvshimapq3 E S8I01029M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.00.01N00X0000000000T0000000000PEnEneAbs0_0_f29_min V S4I00074M1D4R5111L000000A00.00.00.00.00.00.02.00.00.00C0000010000F00.00.00N00X1000000000T0000000000Pvshimloc BOW S5I01217M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_halt BOW S5I01382M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_sollst C S5I03116M1D4R5102L000000A00.00.00.00.00.10.00.00.00.00C0000010000F10.10.00N00X0000000000T0000000000PTUM_MFCC10Average C S5I05195M1D4R5102L000000A00.00.00.00.00.12.00.00.00.00C0000010000F11.18.00N00X0000000000T0000000000PTUM_0_mfcc_c12_d_cnt E S6I01002M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.02.00N00X0000000000T0000000000Penergy_max C S6I04113M1D4R5112L000000A00.00.00.00.00.04.00.00.00.00C0000010000F00.31.00N00X0000000000T0000000000Pmfcc4_var E S1I00006M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000101010F00.49.00N00X0000000000T__________PeneTau____ S S6I03006M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00C0000010000F00.21.00N00X0000000000T0000000000Pspectral_cog_median SEM S8I02103M1D4R5111L003000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PNEV_negative_valence E S6I01114M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F02.21.00N00X0000000000T0000000000Penergy_deltadelta_median

Zoom on Feature Encoding Scheme linguistic encoding SEM S8I02102M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00 .. PPOV_positive_valence BOW S5I01413M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00 .. PTUM_logTF_ILS_und S S6I03001M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00 .. Pspectral_cog_mean E S1I00004M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00 .. PeneMean___ BOW S5I01270M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00 .. PTUM_logTF_ILS_kom BOW S5I01344M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00 .. PTUM_logTF_ILS_rum E S8I01036M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00 .. PEnMax0_0_f36_min S S5I04027M1D4R5102L000000A00.00.00.00.13.00.00.00.00.00 .. PTUM_0_fa1_band_stdd acoustic encoding

Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150) SFFSsep SFFScomb energy 265 58.5 60.0 56.9 56.3 duration 391 55.1 60.654.9 49.6 F0 333 56.1 55.1 46.7 46.8 spectral/formant 656 54.4 56.0 49.9 46.2 cepstral 1699 52.7 57.1 50.4 46.4 voice quality 154 51.5 51.641.538.7 wavelets 216 56.0 56.3 44.935.3 bag of words 476 62.662.3 53.2 37.4 part-of-speech 31 54.7 - 54.9 48.1 higher semantics 12 57.6 - 57.9 56.0 non-verbal 8 24.2 - - - disfluencies 4 26.8 - - -

Types of Approaches, SFFS, F Values • knowledge-based and sequential (FAU, FBK: 118)*: 58.8 • knowledge-based (TAU, LIMSI: 312): 53.3 • brute-force (TUM, UA: 3304): 54.9 • all acoustic features (3714) 63.4 • all linguistic features (531) 62.6 • all together (4245) 65.5 * word-based features, using manually corrected word boundaries, combined into chunk-based features

beyond the state of the art:units of analysis

Performance for Different Chunks, Preliminary Experiments at FBK, Small Feature Set, F Values # F optimal, i.e. adjacent identical labels 7008 64.0 (e.g.: NNN EE AAAA NNNNN M N MMM) turns (pause > 1.5 sec.) 3990 50.0 syntactic-prosodic rule system 9152 55.2 words 17611 55.0 syntactic rule system (clauses/phrases/ …) 9102 53.9 prosodic rule system (pause > 0.5 sec.) 5129 53.0 LM2 (bi-gram language model) 5480 52.8 POS-LM (part-of-speech language model) 4637 56.0

Summing up our Results • impact of acoustic feature types: energy most important, voice quality less important, other types in between (note domain-dependency!) • impact of linguistic feature types: very high - to be checked with real Automatic Speech Recognition (ASR) output • sequential approach promising • chunking is the right way to do • emotion recognition seems to be less prone to noise than comparable speech processing tasks (ICASSP 2007) • PDA (Pitch Detection Algorithm) extraction errors deteriorate performance consistently but not detrimentally (ICPhS 2007)

In a Nutshell • full exploitation of state-of-the-art approaches • > 4 k features • knowledge-based vs. brute-force • selection and classification • and beyond state-of-the-art • towards new dimensions (UMUAI 2007) • meaningful units of analysis (chunking) • interaction/dialogue modelling • prototyping • personalization • …

and the Message of the Day • people stare at classification performance • which is tuned explicitely by highly sophisticated classifiers • and implicitely by settings not obvious to the 'normal' reader such as • manual emotion chunking • using only prototypes • using acted data • and other devices

Thank you for your attention

CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech