Emotional Speech

Emotional Speech CS 4706 Julia Hirschberg (thanks to Jackson Liscombe and Lauren Wilcox for some slides)

Outline • Why study emotional speech? • Why is modeling emotional speech so difficult? • Production and perception studies • Voice Quality features: the holy grail CS 4706

Why study emotional speech? • Recognition • Customer-care centers • Tutoring systems • Automated agents (Wildfire) • Generation • Characteristics of ‘emotional speech’ little understood, so hard to produce: …a voice that sounds friendly, sympathetic, authoritative…. • TTS systems • Games CS 4706

Emotion in Spoken Dialogue Systems • Batliner, Huber, Fischer, Spilker, Nöth (2003) • Verbmobil (Wizard of Oz scenarios) • Ang, Dhillon, Krupski, Shriberg, Stolcke (2002) • DARPA Communicator • Liscombe, Guicciardi, Tur, Gokken-Tur (2005) • “How May I Help You?” call center • Lee, Narayanan (2004) • Speechworks call-center • Liscombe, Hirschberg, Venditti (2005) • ITSpoke Tutoring System (physics) CS 4706

Why is emotional speech so hard to model? • Colloquial definitions of speakers and listeners ≠ technical definitions • Utterances may convey multiple emotions simultaneously • Result: • Human consensus low • Hard to get reliable training data CS 4706

Spontaneous Corpora • Unconstrained • [Campbell, 2003] [Roach, 2000] • [Cowie et al., 2001] • Call centers • [Vidrascu & Devillers, 2005] [Ang et al., 2002] • [Litman and Forbes-Riley, 2004] [Batliner et al., 2003] • [Lee & Narayanan, 2005] • Meetings • [Wrede and Shriberg, 2003] CS 4706

Acted Corpora happy sad angry confident frustrated friendly interested anxious bored encouraging CS 4706

LDC Emotional Prosody and Transcripts corpus • Semantically neutral (dates and numbers) • 8 actors • 15 emotions CS 4706

Are Emotions Mutually Exclusive? • User study to classify tokens from LDC Emotional Prosody corpus • 10 emotions only: • Positive: confident, encouraging, friendly, happy, interested • Negative: angry, anxious, bored, frustrated, sad • Example CS 4706

Emotion Intercorrelations (p < 0.001) CS 4706

Results • Emotions are heavily correlated • Positive with positive • Negative with negative • Emotions are non-exclusive • Can they be clustered empirically • Activation • Valency CS 4706

Global Pitch Statistics Different Valence/Activation CS 4706

Different Valence/Same Activation CS 4706

Identifying Emotions • Automatic Acoustic-prosodic [Davitz, 1964] [Huttar, 1968] • Global characterization • pitch • loudness • speaking rate • Intonational Contours [Mozziconacci & Hermes, 1999] • Spectral Tilt [Banse & Scherer, 1996] [Ang et al., 2002] CS 4706

Machine Learning Experiment • RIPPER 90/10 split • Binary classification for each emotion • Results • 62% average baseline • 75% average accuracy • Acoustic-prosodic features for activation • /H-L%/ for negative; /L-L%/ for positive • Spectral tilt for valence? CS 4706

Accuracy Distinguishing One Emotion from the Rest CS 4706

A Call Center Application • AT&T’s “How May I Help You?” system • Customers often angry and frustrated CS 4706

HMIHY Example VeryFrustrated Somewhat Frustrated CS 4706

Pitch, Energy and Rate CS 4706

Features • Automatic Acoustic-prosodic • Contextual [Cauldwell, 2000] • Lexical [Schröder, 2003] [Brennan, 1995] • Pragmatic [Ang et al., 2002] [Lee & Narayanan, 2005] CS 4706

Results CS 4706

Tutoring Systems Should Respond to Uncertainty • SCoT [Pon-Barry et al. 2006] • Responding to uncertainty • Active listening • Hinting vs. paraphrasing • Features examined • Latency • Filled pauses • Hedges • Performance metric • Learning gain • But no improvement by responding to uncertainty CS 4706

What does uncertainty sound like? CS 4706

[pr01_sess00_prob58] CS 4706

Uncertainty in ITSpoke um <sigh> I don’t even think I have an idea here ...... now .. mass isn’t weight ...... mass is ................ the .......... space that an object takes up ........ is that mass? [71-67-1:92-113] CS 4706

ITSpoke Experiment • Human-Human Corpus • AdaBoost(C4.5) 90/10 split in WEKA • Classes: Uncertain vs Certain vs Neutral • Results: CS 4706

ITSpoke Results CS 4706

Voice Quality and Emotion • Perceptual coloring • Derived from a variety of laryngeal and supralaryngeal features • modal, creaky, whispered, harsh, breathy, ... • Correlates with emotion • Laver ‘80, Scherer ‘86, Murray& Arnott ’93, Laukkanen ’96, Johnstone & Scherer ’99, Gobl & Chasaide, ‘03, Fernandez ‘00 CS 4706

Phonation Gestures • Adductive tension: interarytenoid muscles adduct the arytenoid muscles • Medial compression: adductive force on vocal processes- adjustment of ligamental glottis • Longitudinal pressure: tension of vocal folds CS 4706

Modal Voice • “Neutral” mode • Muscular adjustments moderate • Vibration of vocal folds periodic, full closing of glottis, no audible friction • Frequency of vibration and loudness in low to mid range for conversational speech CS 4706

Tense Voice • Very strong tension of vocal folds, very high tension in vocal tract CS 4706

Whispery Voice • Very low adductive tension • Medial compression moderately high • Longitudinal tension moderately high • Little or no vocal fold vibration • Turbulence generated by friction of air in and above larynx CS 4706

Creaky Voice • Vocal fold vibration at low frequency, irregular • Low tension (only ligamental part of glottis vibrates) • The vocal folds strongly adducted • Longitudinal tension weak • Moderately high medial compression CS 4706

Breathy Voice • Tension low • Minimal adductive tension • Weak medial compression • Medium longitudinal vocal fold tension • Vocal folds do not come together completely, leading to frication CS 4706

Estimating Voice Quality • Estimate wrt controlled neutral quality • But how do we know the control is truly “neutral”? • Must must match the natural laryngeal behavior to laboratory “neutral” • Our knowledge of models of vocal fold movements may be inadequate for describing real phonation • Known relationships between acoustic signal and voice source are complex • Only can observe behavior of voicing indirectly so prone to error. • Direct source data obtained by invasive techniques which may interfere with signal CS 4706

Next Class • Deceptive Speech CS 4706

Emotional Speech