Emotional Speech

Emotional Speech Julia Hirschberg CS 6998

Today • Defining emotional speech • Emotional categories • Eliciting judgments • Producing emotional speech • Detecting emotional speech • A Subclass: Deceptive speech

Cowie ‘00 • Is there a good theoretical or practical definition of emotional speech? • “Full-blown” emotion vs. emotional state • Cause and effect descriptions • Primary and secondary (second order) • Everyday descriptions • Representations • Biological

Dimensions in continuous space, e.g. • Valence: positive or negative • Activation level: how disposed to take action • Structural models: different ways of appraising situation that evokes emotion • e.g. positive or negative? Does situation help agent to achieve his/her goals? • Timing as a key variable • sadness vs. grief vs. depression vs. gloominess

How are emotions expressed? • Display rules? In speech? • Mixing • Simulation

Schroeder ‘01: Emotion in Synthesis • How is a given emotion expressed in speech? • What are the properties of the emotion to be expressed? How are they related to those of other emotions? • What kind of synthesizer works best? • Formant • Diphone • Unit selection

Prosody rules: what to modify? • How do we evaluate the results? • Forced choice • Free response • Recognition rate • Perceived naturalness

Ten Bosch ‘00: Emotion Recognition • How hard is the problem? • Is ‘standard’ ASR technology well-suited to it? • Acoustic and language models target short local events • Feature extraction normlizes/excludes e.g. pitch, rate, amplitude -- why? • Interaction: emotional speech and ASR performance • Synthesis needs one good example but...

Ang et al • Challenges: • Use output from ASR system • Use automatic prosodic features • Find good speaker normalization • Combine with lexical features • Pioneered approach of “direct modeling” – no use of intermediate phonological units • Applications: detecting frustration, disappointment/tiredness, amusement/surprise • Results: prediction comparable to human accuracy 70-75%

Method: Prosodic Models • Extract pitch from signal • Speech recognizer outputs word and phone alignments (duration features) • Utterance-level features extracted (e.g., max speaker normalized pitch in the longest phone-normalized vowel, etc) • Decision trees created to provide posterior probabilities of emotion classes given features • Feature selection from development test set • Separate test set used for evaluation

Prosodic Features • Duration features • Phone / Vowel / Syllable Durations • Normalized by Phone/Vowel Means, Speaker • Speaking rate features (vowels/time) • Pause features • Speech to pause ratio, number of long pauses • Maximum pause length • Energy features (RMS energy) • Pitch features • Used pitch stylization algorithm (Sonmez et al.) • LTM model of F0 to estimate speaker range • Pitch ranges, slopes, locations of interest • Spectral tilt features • Other (non-prosodic) features • Position of utterance in dialog • Repeat or correction

Emotion in Deception • Motivation: why might such cues exist? • Deception evokes emotion in deceivers (e.g. Ekman ‘85-92) • Fear of discovery: higher pitch, faster, louder, pauses disfluencies, indirect speech • Elation at successful deceiving: higher pitch, faster, louder, greater elaboration

Acoustic/Prosodic/Lexical Cues • Are deceivers less forthcoming? • Shorter speech with fewer details • Arelies less compelling than truths? • Less plausible, logical, more discrepancies • Less verbal and vocal ‘involvement’ • Less verbal ‘immediacy’: more passives, negations, indirect speech • More uncertainty (subjective) • More repetitions • Are liars less positive, pleasant?

More negative statements, complaints • Are liars more tense? • Nervous overall • Vocal tension • High pitch • Do lies contain fewer ‘imperfections’? • Fewer self-repairs • Fewer admissions of forgetfulness • Fewer scene descriptions, details • More mention of peripheral events or relationships

Current State-of-the-Art • No single cue to deceptive speech: most studied are visual • Other acoustic/prosodic features proposed, but evidence mixed so far • Loudness/intensity • Speaking rate • Response latency • Disfluencies • No attested method to detect deception automatically using acoustic/prosodic/lexical cues • All current findings are descriptive, suggestive • All proposed methods require human intervention

Our Approach • Elicit deceptive and non-deceptive corpus • Motivation: Identity-relevant (self-image) and instrumental (monetary) incentives • “Real” deception vs. acted • Good recording conditions • Tasks/interview paradigm • Transcription/annotation • Acoustic/prosodic/lexical analysis to identify features of interest, test validity of paradigm • Automatic feature extraction and analysis to train models of deceptive and non-deceptive speech

Corpus Collection • Subjects asked to perform tasks for comparison with target profile of 25 top entrepreneurs • Performance manipulated to produce performance same as/differing from target • Monetary incentive to convince an interviewer they matched target • Recorded interview/interrogation • Biographical information (t/f) • “Big lie” on task performance • “Local lie”: Pedal indicators of t/f for each answer

Collection • To date: 15 subjects, totaling ~3h of subject speech • Planned: 7-8h hours of subject speech

Results of Prosodic/Acoustic Analysis • On Arizona Mock Theft data subset: • 32 interviews/72m, required segmentation, recording issues (50/160m more being segmented) • Significant pitch feature differences between deceptive and non-deceptive speech, but... • Highly motivated speakers lower pitch when lying • Low motivation speakers raise pitch when lying • Males lower pitch when lying • Females raise pitch when lying

On Columbia corpus: • Preliminary analyses of 8 speakers for ‘local’ t/f • Significant differences in pitch range for six subjects, but differ from Mock Theft wrt gender • Lexical findings: • Preliminary analyses on Columbia data using LIWC show negative words more prevalent in deceptive speech

Emotional Speech