1 / 26

CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech

CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech. HUMAINE Plenary, Paris, June 4th, 2007. Anton Batliner Lehrstuhl für Mustererkennung (Informatik 5) (Chair for Pattern Recognition) Friedrich-Alexander-Universität Erlangen-Nürnberg. What is CEICES?.

lew
Télécharger la présentation

CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech HUMAINE Plenary, Paris, June 4th, 2007 Anton Batliner Lehrstuhl für Mustererkennung (Informatik 5) (Chair for Pattern Recognition) Friedrich-Alexander-Universität Erlangen-Nürnberg

  2. What is CEICES?

  3. CEICES Initiative • Combining Efforts for Improving automatic Classification of Emotional user States - a "forced co-operation" initiative • Partners active: • from outside HUMAINE: TUM (Technische Universität München), FBK-irst (inside/outside) • from within HUMAINE, WP4: FAU, UA, LIMSI, TAU/AFEKA • People: • Anton Batliner, Stefan Steidl, Björn Schuller, Dino Seppi, Thurid Vogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loic Kessous, Vered Aharonson

  4. Idea Behind • different research traditions at different sites • somehow fossilized approaches at different sites • co-operation pays off: pooling together competence and feature sets

  5. Which data do we use?

  6. Database • German corpus with recordings of 51 ten to twelve year old children communicating with Sony's Aibo pet robot (9.2 hours of speech, 51.393 words) • data ± reverberated, transliterated, annotated: 5 labellers, 11 word-based "emotion" labels • originator site (FAU) provides speech files, phonetic lexicon, definition of train and test samples, etc. • effort for manual “pre-processing” only: ~80 k € researcher, ~80 k € students (conservative estimation)

  7. segmentation, transliteration emotion labelling annotation of interaction manual word segmentation manual correction of F0 syntactic annotation rule-based chunking system ............. A “Vertical” Approach

  8. Basics of Chosen Approach • children: not exotic but normal • annotation: with context (it’s speech, not sounds)majority voting (≤ 3 out of 5 agree) • unit of annotation: the word, because • link to ASR • link to higher processing (syntax, dialogue, semantics) • smallest possible emotional unit • can be combined onto higher units of different size • mapping onto 4 cover classes, due to sparse data: • Motherese(positive valence) • default class Neutral • "pre-stage" to negative: Emphatic • negative (Angry) "dimension" (smearing fine-grained differencesbetween: touchy, reprimanding, angry) • AMEN sub-sample

  9. The AMEN Sub-sample • syntactically/semantically meaningful chunks with at least one AMEN word • syntactic-prosodic chunking rules: IF (synt. bound. = sentence/free phrase/between vocatives) OR (pause  500 ms at any other synt. bound.) • frequencies: • Motherese: 586 • Neutral: 1998 • Emphatic: 1045 • Angry: 914 • experiments so far with 2- or 3-fold speaker-independent cross-validation, upsampling for training

  10. Scenario - acted - prompted - real-life + elicited/induced + volunteering + application-oriented - emotion-oriented Outcome + spontaneous + natural + realistic - selected Type of Database acted - induced - natural

  11. fully exploiting the state of the art:relevance of features

  12. Feature Encoding Scheme (WS at FAU 12/06) SEM S8I02102M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PPOV_positive_valence BOW S5I01413M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_und S S6I03001M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00C0000010000F00.10.00N00X0000000000T0000000000Pspectral_cog_mean E S1I00004M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000001000F00.10.00N00X0000000000T__________PeneMean___ BOW S5I01270M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_kom BOW S5I01344M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_rum E S8I01036M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.02.01N00X0000000000T0000000000PEnMax0_0_f36_min S S5I04027M1D4R5102L000000A00.00.00.00.13.00.00.00.00.00C0000010000F10.30.00N00X0000000000T0000000000PTUM_0_fa1_band_stdd SEM S8I02108M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.91N04X0000000000T0000000000PPOV_positive_valence_norm BOW S5I01440M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_wieder E S8I01229M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.00.10N00X0000000000T0000000000PEnEneAbs0_0_f29_mean POS S5I00042M1D4R5101L020000A00.00.00.00.00.00.00.00.00.00C0000010000F10.35.00N00X0000000000T0000000000PTUM_sum_APN POS S5I00045M1D4R5101L050000A00.00.00.00.00.00.00.00.00.00C0000010000F10.35.00N00X0000000000T0000000000PTUM_sum_PAJ SEM S8I02106M1D4R5111L009000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PRES_rest D S8I01056M1D4R5111L000000A10.00.00.00.00.00.00.00.00.00C0000010000F00.99.01N00X0000000000T0000000000PDurAbsSyl0_0_f56_min P S8I01263M1D4R5111L000000A00.00.10.00.00.00.00.00.00.00C0000010000F20.61.10N00X0000000000T0000000000PF0RegCoeff0_0_f63_mean S S4I00080M1D4R5111L000000A00.00.00.10.00.00.04.00.00.00C0000010000F00.00.00N00X1000000000T0000000000Pvnhr P S4I01055M1D4R5111L000000A00.00.10.00.00.00.00.00.00.00C0000010000F00.22.00N00X1000000000T0000000000Pprctilep4A E S4I01001M1D4R5111L000000A00.10.10.00.00.00.00.00.00.00C0000010000F30.02.00N00X1000000000T0000000000Ploud_maxval V S4I00075M1D4R5111L000000A00.00.00.00.00.00.02.00.00.00C0000010000F00.00.00N00X1000000000T0000000000Pvshimapq3 E S8I01029M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.00.01N00X0000000000T0000000000PEnEneAbs0_0_f29_min V S4I00074M1D4R5111L000000A00.00.00.00.00.00.02.00.00.00C0000010000F00.00.00N00X1000000000T0000000000Pvshimloc BOW S5I01217M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_halt BOW S5I01382M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_sollst C S5I03116M1D4R5102L000000A00.00.00.00.00.10.00.00.00.00C0000010000F10.10.00N00X0000000000T0000000000PTUM_MFCC10Average C S5I05195M1D4R5102L000000A00.00.00.00.00.12.00.00.00.00C0000010000F11.18.00N00X0000000000T0000000000PTUM_0_mfcc_c12_d_cnt E S6I01002M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.02.00N00X0000000000T0000000000Penergy_max C S6I04113M1D4R5112L000000A00.00.00.00.00.04.00.00.00.00C0000010000F00.31.00N00X0000000000T0000000000Pmfcc4_var E S1I00006M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000101010F00.49.00N00X0000000000T__________PeneTau____ S S6I03006M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00C0000010000F00.21.00N00X0000000000T0000000000Pspectral_cog_median SEM S8I02103M1D4R5111L003000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PNEV_negative_valence E S6I01114M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F02.21.00N00X0000000000T0000000000Penergy_deltadelta_median

  13. Zoom on Feature Encoding Scheme linguistic encoding SEM S8I02102M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00 .. PPOV_positive_valence BOW S5I01413M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00 .. PTUM_logTF_ILS_und S S6I03001M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00 .. Pspectral_cog_mean E S1I00004M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00 .. PeneMean___ BOW S5I01270M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00 .. PTUM_logTF_ILS_kom BOW S5I01344M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00 .. PTUM_logTF_ILS_rum E S8I01036M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00 .. PEnMax0_0_f36_min S S5I04027M1D4R5102L000000A00.00.00.00.13.00.00.00.00.00 .. PTUM_0_fa1_band_stdd acoustic encoding

  14. Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150) SFFSsep SFFScomb energy 265 58.5 60.0 56.9 56.3 duration 391 55.1 60.654.9 49.6 F0 333 56.1 55.1 46.7 46.8 spectral/formant 656 54.4 56.0 49.9 46.2 cepstral 1699 52.7 57.1 50.4 46.4 voice quality 154 51.5 51.641.538.7 wavelets 216 56.0 56.3 44.935.3 bag of words 476 62.662.3 53.2 37.4 part-of-speech 31 54.7 - 54.9 48.1 higher semantics 12 57.6 - 57.9 56.0 non-verbal 8 24.2 - - - disfluencies 4 26.8 - - -

  15. Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150) SFFSsep SFFScomb energy 265 58.5 60.0 56.9 56.3 duration 391 55.1 60.654.9 49.6 F0 333 56.1 55.1 46.7 46.8 spectral/formant 656 54.4 56.0 49.9 46.2 cepstral 1699 52.7 57.1 50.4 46.4 voice quality 154 51.5 51.641.538.7 wavelets 216 56.0 56.3 44.935.3 bag of words 476 62.662.3 53.2 37.4 part-of-speech 31 54.7 - 54.9 48.1 higher semantics 12 57.6 - 57.9 56.0 non-verbal 8 24.2 - - - disfluencies 4 26.8 - - -

  16. Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150) SFFSsep SFFScomb energy 265 58.5 60.0 56.9 56.3 duration 391 55.1 60.654.9 49.6 F0 333 56.1 55.1 46.7 46.8 spectral/formant 656 54.4 56.0 49.9 46.2 cepstral 1699 52.7 57.1 50.4 46.4 voice quality 154 51.5 51.641.538.7 wavelets 216 56.0 56.3 44.935.3 bag of words 476 62.662.3 53.2 37.4 part-of-speech 31 54.7 - 54.9 48.1 higher semantics 12 57.6 - 57.9 56.0 non-verbal 8 24.2 - - - disfluencies 4 26.8 - - -

  17. Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150) SFFSsep SFFScomb energy 265 58.5 60.0 56.9 56.3 duration 391 55.1 60.654.9 49.6 F0 333 56.1 55.1 46.7 46.8 spectral/formant 656 54.4 56.0 49.9 46.2 cepstral 1699 52.7 57.1 50.4 46.4 voice quality 154 51.5 51.641.538.7 wavelets 216 56.0 56.3 44.935.3 bag of words 476 62.662.3 53.2 37.4 part-of-speech 31 54.7 - 54.9 48.1 higher semantics 12 57.6 - 57.9 56.0 non-verbal 8 24.2 - - - disfluencies 4 26.8 - - -

  18. Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150) SFFSsep SFFScomb energy 265 58.5 60.0 56.9 56.3 duration 391 55.1 60.654.9 49.6 F0 333 56.1 55.1 46.7 46.8 spectral/formant 656 54.4 56.0 49.9 46.2 cepstral 1699 52.7 57.1 50.4 46.4 voice quality 154 51.5 51.641.538.7 wavelets 216 56.0 56.3 44.935.3 bag of words 476 62.662.3 53.2 37.4 part-of-speech 31 54.7 - 54.9 48.1 higher semantics 12 57.6 - 57.9 56.0 non-verbal 8 24.2 - - - disfluencies 4 26.8 - - -

  19. Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150) SFFSsep SFFScomb energy 265 58.5 60.0 56.9 56.3 duration 391 55.1 60.654.9 49.6 F0 333 56.1 55.1 46.7 46.8 spectral/formant 656 54.4 56.0 49.9 46.2 cepstral 1699 52.7 57.1 50.4 46.4 voice quality 154 51.5 51.641.538.7 wavelets 216 56.0 56.3 44.935.3 bag of words 476 62.662.3 53.2 37.4 part-of-speech 31 54.7 - 54.9 48.1 higher semantics 12 57.6 - 57.9 56.0 non-verbal 8 24.2 - - - disfluencies 4 26.8 - - -

  20. Types of Approaches, SFFS, F Values • knowledge-based and sequential (FAU, FBK: 118)*: 58.8 • knowledge-based (TAU, LIMSI: 312): 53.3 • brute-force (TUM, UA: 3304): 54.9 • all acoustic features (3714) 63.4 • all linguistic features (531) 62.6 • all together (4245) 65.5 * word-based features, using manually corrected word boundaries, combined into chunk-based features

  21. beyond the state of the art:units of analysis

  22. Performance for Different Chunks, Preliminary Experiments at FBK, Small Feature Set, F Values # F optimal, i.e. adjacent identical labels 7008 64.0 (e.g.: NNN EE AAAA NNNNN M N MMM) turns (pause > 1.5 sec.) 3990 50.0 syntactic-prosodic rule system 9152 55.2 words 17611 55.0 syntactic rule system (clauses/phrases/ …) 9102 53.9 prosodic rule system (pause > 0.5 sec.) 5129 53.0 LM2 (bi-gram language model) 5480 52.8 POS-LM (part-of-speech language model) 4637 56.0

  23. Summing up our Results • impact of acoustic feature types: energy most important, voice quality less important, other types in between (note domain-dependency!) • impact of linguistic feature types: very high - to be checked with real Automatic Speech Recognition (ASR) output • sequential approach promising • chunking is the right way to do • emotion recognition seems to be less prone to noise than comparable speech processing tasks (ICASSP 2007) • PDA (Pitch Detection Algorithm) extraction errors deteriorate performance consistently but not detrimentally (ICPhS 2007)

  24. In a Nutshell • full exploitation of state-of-the-art approaches • > 4 k features • knowledge-based vs. brute-force • selection and classification • and beyond state-of-the-art • towards new dimensions (UMUAI 2007) • meaningful units of analysis (chunking) • interaction/dialogue modelling • prototyping • personalization • …

  25. and the Message of the Day • people stare at classification performance • which is tuned explicitely by highly sophisticated classifiers • and implicitely by settings not obvious to the 'normal' reader such as • manual emotion chunking • using only prototypes • using acted data • and other devices

  26. Thank you for your attention

More Related