Recognition and Understanding of Prosody

Recognition and Understanding of Prosody Scott Simpson Ling 575: Spoken Dialog Systems

Overview • Prosody and Speech Recognition • Overview of prosody in the context of linguistics, general and computational. • Approaches to prosodic modeling • Overview of successful approaches of prosodic modeling • Applications of prosodic modeling • Summarization of a few applications using the framework • Questions

Prosody and Speech Recognition • Linguistic prosody • “use of suprasegmental features to convey sentence-level pragmatic meanings”, Ladd(1996) • Functions of prosody • Marks discourse structure or functions • Saliency • Conveys affective and emotional meaning.

Prosody and Speech Recognition (cont.) • Prosodic aspects important to speech recognition • Prosodic Structure • Prosodic Prominence • Tune • Reasons for Using Prosodic Features • Additional knowledge • May help in overcoming word recognition errors • Most plausible prosody problems can be statistically classified.

Approaches to Prosodic Modeling • Probabilistic Framework • P(S|F) where S represents a target class in some linguistic unit (U) and F represents a set of feature to help predict S. • P(S|W,F) where W represents the information contained in a word sequence associated with U. • Direct Modeling of Target Classes • Dependence between prosodic features and target classes modeled directly in a statistical classifier. • No use of intermediate phonological categories • Hand annotation not needed

Approaches to Prosodic Modeling (cont.) • Prosodic Features • Features extracted from forced alignment(phone-level) of transcripts • Features include: pause duration, measures of lengthening, speaking rate. • Postprocessing regularizes pitch-based features. • Prosodic Models • Decision trees used as classifiers • Problems associated with decision trees • Greediness • Highly skewed class sizes

Approaches to Prosodic Modeling (cont.) • Lexical Models • Target classes derived by lexical and prosodic information. • Uses statistical models from speech recognition • P(S|W) used to predict possible classes • Model combination • Posterior interpolation-compute P(S|F,W) via the Prosodic model and P(S|W) via the language model informed in many are a combination of the two. • Posteriors as features-compute P(S|W) and use the language model posterior estimates as an additional feature in the present prosodic classifier. • HMM-based integration-compute the likelihoods P(F|S,W) from the prosody model and use them as observation likelihoods in a hidden Markov model derived from the language model.

Applications • Sentence segmentation and disfluency detection • Topic segmentation in Broadcast News

Applications • Dialog act labeling in conversational speech • Word recognition in Conversation speech • Not optimized for speech recognition. • Have had some success, but still far from perfect.

Readings • Primary: “Prosody Modeling for Automatic Speech Recognition and Understanding”, Shriberg and Stolcke, 2002 • Secondary: "Turn-taking cues in task-oriented dialogue". Gravano & Hirschberg,2011 • Secondary: “ Prosody-Based Automatic Detection of Annoyance and Frustration in Human-Computer Dialog”. Ang et al., 2002. • Additional Source: “Speech and Language Processing”, Daniel Jurasky & James H. Martin, 2008, pp. 262-269.

Questions?

Recognition and Understanding of Prosody