Modeling and Perceiving of (Un)Certainty in Articulatory Speech Synthesis

Charlotte Wollermann*, Eva Lasarcyk** *Institute of Communication Scienes, University of Bonn ** Institute of Phonetics, Saarland University cwo@ifk.uni-bonn.de, evaly@coli.uni-sb.de Modeling and Perceiving of (Un)Certainty in Articulatory Speech Synthesis

Overview Introduction 1.1 Emotion/attitude and speech synthesis 1.2 Previous studies on (un)certainty 1.3 Goal of the current study Modeling of (un)certainty in articulatory speech synthesis 2.1 Acoustical criteria 2.2 The articulatory speech synthesis system Perception studies 3.1 Experiment 1 3.2 Experiment 2 4.Conclusions

1. Introduction 1.1 Emotion and speech synthesis • The modeling of emotion and attitude has gained extensive importance in the last few years • Generating synthetic speech which is natural and human-like as possible • Multimodal speech synthesis systems as possible applications: Talking Heads (Beskow 2003), Embodied Conversational Agents (Cassell et al. 1999)‏ • Most emotional speech synthesis systems are based on prototypical emotions according to Ekman (1972): happiness, sadness, anger, fear, surprise, disgust

1. Introduction Emotion and speech synthesis: Different approach from emotion psychology: Using evaluation, activation and power as basic dimensions for representing emotional states (Wundt 1896)‏ Ex.: EmoSpeak as part of the TTS-System MARY (Schröder 2004; Trouvain, Schröder 2003)‏ Why investigating emotion and attitude in articulatory speech synthesis? The modeling of attitude has been barely investigated 3D-articulatory synthesizer (Birkholz 2005): great degree of freedom and precise adjustments of single parameters at the same time (Un)Certainty as non-prototypical emotion

1. Introduction 1.2 Previous studies: Production and perception of (un)certainty in natural speech – Acoustic domain Smith, Clark (1993): Studying memory processes in question- answering Feeling of Knowing Paradigm FOK (Hart 1965)‏ Uncertainty prosodically marked by rising intonation, delay, linguistic hedges like ”I guess“ Brennan, Williams (1995): Perception of uncertainty of another speaker (Feeling of Another‘s Knowing FOAK)‏ Intonation, form and latency of answer; fillers like “hm”, “uh” as relevant cues

1. Introduction Previous studies: Production and perception of (un)certainty in natural speech – Audiovisual domain • Swerts et al. (2003, 2005): Production and perception of (un)certainty in audiovisual speech Delay, pause and fillers Smiles, funny faces etc. 1.3 Goal of the current study • Investigation of perceiving of uncertainty in human- machine interaction by using articulatory synthesis

Modeling of uncertainty in articulatory speech synthesis General Setting: Stimuli are embedded into a context Telephone dialog between caller and weather expert system Wie wird das Wetter nächste Woche in ... ? Eher kalt. How is the weather going to be next week in … ? Rather cold. • Different levels of uncertainty indicated by the presence of • High intonation • Delay • Fillers

Wie wird das Wetter nächste Woche in ... ? Wie wird das Wetter nächste Woche in ... ? 1000 ms 1000 ms Eher kalt Eher kalt Wie wird das Wetter nächste Woche in ... ? 2200 ms Eher kalt Wie wird das Wetter nächste Woche in ... ? 1500 ms Hmm 1000 ms Eher kalt Modeling of uncertainty in articulatory speech synthesis Intonation • Variation takes places on the last word • Either rising or falling contour Experiment 1 Experiment 2 Delay and filler structure C U1 U2 U3

Modeling of uncertainty in articulatory speech synthesis - Overview Articulatory Synthesizer Vocal tract One-dimensional tube model Aerodynamic-acoustic simulation Gestural score Speech signal Birkholz (2005)‏

Gestural Score: “ziemlich kühl “

Gestural Score: „… hm …“

3. Perception studies 3.1 Experiment 1 Goal Are subjects able to recognize intended certain/uncertain utterances in articulatory speech synthesis? Does certainty influences intelligibility? Method 38 students from the Univ. of Bonn and Saarland Univ. Audio-presentation in a group experiment and individually testing (two different random orders of the stimuli) Judging the certainty and intelligibility of each answer of the expert-system on a 5-point Likert-Scale (1=uncertain/unintelligible, 5=certain/intelligible)‏ Wilcoxon Signed Rank Test

** n.s. n.s. ** ** * 3. Perception studies Perception of certainty 3.1 Experiment 1 Results * certain * p < 0.05 ** p < 0.001 ns not significant uncertain Perception of intelligibility intelligible • Discussion • Technical problems: Reason for relatively low intelligibility of “relativ heiss” • Which role do fillers play in perceiving uncertainty? unintelligible

3. Perception studies 3.2 Experiment 2 Goal To what extent do different combinations of acoustic cues affect the perception of uncertainty? Method Subjects: 34 students from the University of Bonn Audio-presentation in three group experiments (three different random orders of the stimuli) Same procedure as in Experiment 1, but this time *only* judging the certainty of each answer of system on a 5-point Likert-Scale (1=uncertain, 5=certain)‏ Wilcoxon Signed Rank Test

3. Perception studies 3.2 Experiment 2 Results n.s. n.s. Perception of intended levels of uncertainty certain uncertain n.s. not significant all other pairs differ significantly with p < 0.001

3. Perception studies 3.2 Experiment 2 Discussion Results from Experiment 1 are generally confirmed: Intended certain utterances can be clearly distinguished from uncertain ones (variation of intonation and delay)‏ Levels of uncertainty: What do our data suggest? Signaling uncertainty by high intonation exclusively is sufficient for perceiving uncertainty Delay as additional acoustic cue does not yield a higher degree of uncertainty Combination of fillers, delay and high intonation has the strongest effect BUT: Role of delay and fillers „per se“ can not be inferred from our data

4. Conclusions Study presents a first step towards modeling of certainty and different degrees of uncertainty with means of articulatory speech synthesis Perception: Intonation by itself contributes to a higher degree of perceived uncertainty in our data; Combination of all three acoustic cues yields to the highest degree of perceived uncertainty Open questions: Influence of fillers and delay respectively ”per se“ Problems with judging a machine‘s meta-cognitive state Influence of the choice of wordings Future work: Testing audiovisual stimuli for different degress of uncertainty and making finally use of the 3-D vocal tract provided by the articulatory synthesizer

Literature • Beskow, J. (2003). Talking Heads – Models and Applications for Multimodal Speech Synthesis. Doctoral Dissertation, KTH, Stockholm, Sweden. • Birkholz, P. (2005). 3-D Artikulatorische Sprachsynthese. Berlin: Logos Verlag. • Brennan, S. E. and Williams, M. (1995). “The feeling of another's knowing: Prosody and filled pauses as cues to listeners about the metacognitive states of speakers”. In Journal of Memory and Language , 34, 383-398. • Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjlmsson, H., Yan, H. (1999). “Embodiment in conversational interfaces: REA”. In Proceedings of ACM CHI 99, 520-527. • Ekman, P. (1972). “Universals and cultural differences in facial expressions of emotion”. In Cole, J. (ed.), Nebraska Symposium on Motivation 1971, vol. 19, 207-283. Lincoln, NE: University of Nebraska Press. • Hart, J.T. (1965). “Memory and the feeling-of-knowing experience”. In Journal of Educational Psychology, 56, 208–216. • Lasarcyk, E. (2007). “Investigating Larynx Height With An Articulatory Speech Synthesizer”. In Proceedings of the 16th ICPhS, Saarbrücken, August 2007. • Lasarcyk, E. and Trouvain, J. (2007). “Imitating conversational laughter with an articulatory speech synthesizer.” To appear in Proceedings of the Interdisciplinary Workshop on the Phonetics of Laughter, Saarbrücken, August 2007. • Schröder, M. (2004). Speech and Emotion Research: An overview of research frameworks and dimensional approach to emotional speech synthesis. PhD thesis, PHONUS 7, Research Report of the Institute of Phonetics, Saarland University. • Smith, V. and Clark, H. (1993). “On the course of answering questions”. In: Journal of Memory and Language, 32, 25-38. • Swerts, M., Krahmer, E., Barkhuysen, P. & van de Laar, L. (2003). “Audiovisual cues to uncertainty”. In: Proceedings of ISCA workshop on error handling in spoken dialog systems, Chateau-d'Oex, Switzerland, August/September 2003. • Swerts, M. and Krahmer, E. (2005). ”Audiovisual prosody and feeling of knowing”. In Journal of Memory and Language, 53:1, 81-94. • Schröder, M. and Trouvain, J. (2003). “The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching”. In International Journal of Speech Technology, 6, 365-377. • Wundt, W. (1896). Grundriss der Psychologie. Leipzig: Verlag von Wilhelm Engelmann.

Modeling and Perceiving of (Un)Certainty in Articulatory Speech Synthesis

Modeling and Perceiving of (Un)Certainty in Articulatory Speech Synthesis

Presentation Transcript

PHYSICAL PROPERTIES OF SPEECH SOUNDS

Speech Science IX

Speech Generation and Perception

Phonological Analysis of Child Speech

Development of coarticulatory patterns in spontaneous speech

6-Text To Speech (TTS) Speech Synthesis

Synthetic Speech and the Human Behavior it Models

Speech Science IX

Cairo University Faculty of Computers and Information

PHONETICS AND PHONOLOGY

Perspectives for Articulatory Speech Synthesis

Acoustic to articulatory inversion of speech Yves Laprie Speech Group INRIA Lorraine

Today

Articulatory Talking Head driven by Automatic Speech Recognition INRIA, Parole Team

SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch

Major branches of phonetics

Organs of Speech

Acoustic to articulatory inversion of speech Yves Laprie Speech Group INRIA Lorraine

CHILDHOOD APRAXIA OF SPEECH