200 likes | 319 Vues
This lecture explores the role of prosody in communication, focusing on what information is conveyed by intonation patterns, including whether a statement is a question and the emotional state of the speaker. It discusses dialogue structure, the dynamics of interaction, and the need for robust labeling schemes like ToBI for analyzing speech patterns. Current trends in corpus-based research, machine learning applications, and the investigation of spontaneous speech phenomena are highlighted. Practical applications for systems such as speech recognition and synthesis (TTS) are also examined.
E N D
Lecture 22 Intonation and Discourse CS 4705
What does prosody convey? • In general, information about: • What the speaker is trying to convey • Is this a statement or a question? • The speaker state • Is the speaker getting angry, frustrated? • In dialogue, information about: • The structure of the dialogue • Is the user or the system trying to start a new topic? • Is the speaker talking about given or new information? • The state of the interaction: • Is the user having trouble being understood? • Is the user having trouble understanding the system?
Current Trends • New description schemes (e.g. ToBI) • Corpus-based research and machine learning • Emphasis on evaluation of algorithms and systems (NLE ‘00 special issue) • Investigation of spontaneous speech phenomena and variation in speaking style • Applications to CTS, ASR and SDS
Corpora • Public and semi-public databases • ATIS, SwitchBoard, Call Home, Meetings (NIST/DARPA/LDC) • TRAINS/TRIPS (U. Rochester), FM Radio (BU), BDC (Harvard, AT&T) • Private collections • Acquired for speech or dialogue research (August, KTH; Voicemail, AT&T, IBM) • Meetings, call centers, operator services, focus group collections • The Web • Newscasts, radio
To(nes and)B(reak)I(ndices) • Developed by prosody researchers in four meetings over 1991-94 • Goals: • devise common labeling scheme for Standard American English that is robust and reliable • promote collection of large, prosodically labeled, shareable corpora • ToBI standards also proposed for Japanese, German, Italian, Spanish, British and Australian English,....
Minimal ToBI transcription: • recording of speech • f0 contour • ToBI tiers: • orthographic tier: words • break-index tier: degrees of junction (Price et al ‘89) • tonal tier: pitch accents, phrase accents, boundary tones (Pierrehumbert ‘80) • miscellaneous tier: disfluencies, non-speech sounds, etc.
Online training material,available at: • http://www.ling.ohio-state.edu/phonetics/ToBI/ • Evaluation • Good inter-labeler reliability for expert and naive labelers: 88% agreement on presence/absence of tonal category, 81% agreement on category label, 91% agreement on break indices to within 1 level (Silverman et al. ‘92,Pitrelli et al ‘94)
Pitch Accent/Prominence in ToBI • Which items are made intonationally prominent and how? • Accent type: • H* simple high (declarative) • L* simple low (ynq) • L*+H scooped, late rise (uncertainty/ incredulity) • L+H* early rise to stress (contrastive focus) • H+!H* fall onto stress (implied familiarity)
Downstepped accents: • !H*, L+!H*, L*+!H • Degree of prominence: • within a phrase: HiF0 • across phrases
Functions of Pitch Accent • Given/new information • S: Do you need a return ticket? • U: No, thanks, I don’t need a return. • Contrast (narrow focus) • U: No, thanks, I don’t need a RETURN…. (I need a time schedule, receipt,…) • Disambiguation of discourse markers • S: Now let me get you the train information. • U: Okay (thanks) vs. Okay….(but I really want…)
Predicting Accent: Is it accented or not? • Applications: TTS and CTS • Corpora: read and spontaneous speech • Features: pos window of 3, sentence position, position within NP, # of syllables, position in complex nominal, inferred given/new status, inferred focus, mutual information • Results: 75-85% correct, depending on genre
Prosodic Phrasing in ToBI • ‘Levels’ of phrasing: • intermediate phrase: one or more pitch accents plus a phrase accent (H- or L- ) • intonational phrase: 1 or more intermediate phrases + boundary tone (H% or L% ) • ToBI break-index tier • 0 no word boundary • 1 word boundary • 2 strong juncture with no tonal markings • 3 intermediate phrase boundary • 4 intonational phrase boundary
Functions of Phrasing • Disambiguates syntactic constructions, e.g. PP attachment, restrictive/non relative clause: • S: You should buy the ticket with the discount coupon. • S: The itinerary which I faxed includes deluxe accommodations • Disambiguates scope ambiguities, e.g. Negation: • S: You aren’t booked through Rome because of the fare. • Or modifier scope: • S: This fare is restricted to retired politicians and civil servants.
Predicting Phrase Boundaries • Applications: TTS, CTS, ASR • Corpora: AP news, Penn Treebank, ATIS • Features: sentence position, sentence length, pos window of 4, location of previous predicted boundary, mutual information, constituent information, dependency structure • Results: 96% correct
Contours: Accent + Phrasing • What do intonational contours ‘mean’ (Ladd ‘80, Bolinger ‘89)? • Speech acts (statements, questions, requests) S: That’ll be credit card? (L* H- H%) • Propositional attitude (uncertainty, incredulity) S: You’d like an evening flight.(L*+H L- H%) • Speaker affect (anger, happiness, love) U: I said four SEVEN one! (L+H* L- L%) • “Personality” S: Welcome to the Sunshine Travel System.
Pitch Range and Timing • Level of speaker engagement • S: Welcome to InfoTravel. How may I help you? • Contour interpretation • S: You can take the L*+H bus from Malpensa to Rome L-H%. • U: Take the bus. vs. Take the bus! • Discourse/topic structure • Topic beginnings have higher pitch range, faster, preceded by longer pauses • Endings the opposite
Prosody and Speaker Emotion • What makes an utterance sound angry? Sad? • How much comes from the lexical information? • How much from the acoustic/prosodic? • Does all anger, e.g., sound the same? • Cahn ‘88 (examples)
Applications • Text-to-Speech and Concept-to-Speech generation: improve naturalness • Speech Recognition: identify suprasegmental meaning • Spoken Dialogue Systems: understand when people are confused, angry • Audio Browsing: format corpora for browsing and search
Challenges • We don’t really know what most contours ‘mean’ • Our accent prediction needs more sensitivity to better model of given/new, focus, grammatical function • Our phrasing prediction needs better information about e.g. attachment • We don’t know much about emotional speech or ‘personality’ -- critical to applications