Srinivas Bangalore AT&T Research srini@research.att.com

Srinivas BangaloreAT&T Researchsrini@research.att.com Spoken Dialog Systems

Outline • Components of a spoken dialog system • Automatic Speech Recognition • Spoken Language Understanding • Dialog Management • Language Generation • Text to speech synthesis • Evaluation Issues in spoken dialog system

Spoken dialog system architecture “What number did you want to call?” Customer’s request Text-to-SpeechSynthesis Automatic SpeechRecognition LG/TTS ASR Data Words spoken “I dialed a wrong number” What’s next? “Determine correct number” Dialog SLU Spoken LanguageUnderstanding dialog Management Meaning “Billing credit”

Noisy Channel Source Decoder ASR: The Noisy Channel Model Input to channel: spoken sentence s • Output from channel: an observation O • Decoding task: find s’ = P(s|O) • Using Bayes Rule • And since P(O) doesn’t change for any hypothetical s’ • s’ = P(O|s) P(s) • P(O|s) is the observation likelihood, or Acoustic Model, and P(s) is the prior, or Language Model

What do we need to build use an ASR system? • Corpora for training and testing of components • Feature extraction component • Acoustic Model: maps acoustic features into phones • Waveform is discretized into a feature vector • Phones are assumed to have a three-state structure • HMM training • Pronunciation Model: maps phone sequences into words • Language Model: ranks word sequences according to their likelihood in a language • Algorithms to search hypothesis space efficiently • Components can be represented as finite-state automata/transducers

Mode Isolated words  continuous Style Read, prepared, spontaneous Enrollment Speaker-dependent or independent Vocabulary size <20  5K --> 60K -->~1M Language Model Finite state, ngrams, CFGs, CSGs Perplexity <10  > 100 SNR > 30dB (high)  < 10dB (low) Input device Telephone, microphones Varieties of Speech Recognition

Challenges for Transcription • Robustness to channel characteristics and noise • Portability to new applications • Adapatation: to speakers, to environments • LMs: simple ngrams need help • Confidence measures • Out of Vocabulary words • New speaking styles/genres • New applications

Spoken Language Understanding • Transforming the recognized words to a semantic representation. • What’s “semantic representation”? • Depends on the application • Call routing application • Each utterance is assigned a label (destination route the call) • Airline reservation: Each utterance has an attribute-value representation I want to go from Boston to Baltimore on September 29 Domain concepts Values source city Boston target city Baltimore travel date September 29 • Problem solving: More fine-grained representation for detailed reasoning • Parsing of a speech utterance • Transduce the parse into a meaning representation

How May I Help You? • Prompt is “AT&T. How may I help you?” • User responds with totally unconstrained fluent speech • System recognizes the words and determines the meaning of users’ speech, then routes the call • dialog technology enables task completion HMIHY . . . Local Account Balance Calling Plans Unrecognized Number

What happens when ASR goes wrong? S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the morning.

S: You can say the arrival city name, such as “New York City." U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.) U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore.

Dialog Manager • What is it? • coordinates the interaction with the user • interfaces with the back-end • produces system responses that moves the conversation to an “end state”. • Types of dialog managers • Finite-state/form-filling models of dialog managers • Pre-specified structure to the flow of the dialog • Attempts to fill a form specified by the task • Rigid in its interaction and developer needs to know the task structure. • Many practical systems developed on this model • Plan-based dialog managers • Dialog is about building a shared plan • No a priori structure in the flow of the dialog • No dialog about planning and execution and does not track decision making process. • More suitable for problem-solving dialogs.

Context Interpretation Dialog strategies (modules) Backend (Database access) Components of a Dialog Manager Dialog Manager

User Input Action Current Context New Context State(t+1) State(t) Contextual Interpretation • Interpreting the meaning of an utterance in the context of everything that has thus far… • Resolving referents (for example) • Requires the representation of the dialog history • Determine the next action based on the current context and the input utterance • Move to a new state of the dialog • What should be the vocabulary of the next actions?

Dialog Strategies • Communication Related (General) • Greeting/Closing: maintains social protocol at the beginning and end of an interaction • Contextual Help: provides user with help during periods of misunderstanding of either the system or the user. • Confirmation: verifies that the system understood correctly; strategy may differ depending on SLU confidence measure. • Re-prompting: used when the system expected input but did not receive any or did not understand what it received. • Communication Related (Task Specific) • Completion (continuation): elicits missing information from the user • Constraining (disambiguation): reduces the scope of the request when multiple information has been retrieved • Context Shift: Allows the user to change the focus of the conversation and shift to a different request • Mixed-initiative: allows both users and the system to manage the dialog at the ‘appropriate’ points within a conversation • Back-end Access Related • Relaxation: increases the scope of the request when no information has been retrieved

Structured, Brittle Natural, Robust Longer Interactions Shorter Interactions Who should drive the conversation? Mixed Initiative Directed/Menu based Open/“Normal” Please say ... How may I help you?

A Dialogue System with NLG Prosody-Annotated Text String Communicative Goals Semantic (?) Representation Text String/Lattice • Language generation is achieved in three steps (Reiter 1994) • Text planning: What to say • Transform communicative goals into sequence of elementary communicative goals • Sentence planning: How to say • Choose linguistic resources to express the elementary communicative goals • Surface realization: Say it • Produce surface word order according to the grammar of the language.

A Dialogue System with NLG Prosody-Annotated Text String Communicative Goals Semantic (?) Representation Text String/Lattice

Designing Dialog Systems • Early focus on Users and Task: • Understand the task and the user population. • Study human-human dialog data for the task • Build Prototypes: (Wizard-of-Oz) • Collect data with a person operating the system • Useful to understand the architecture and operation without building one. • Iterative design • Build models from the data collected during WOZ • Revise the user-interface after analysis of dialog flow.

Issues in making human-machine interactions more natural

Java Implementation SharedPlan Discourse Theory Intentional purposes, contributes focus stack focus spaces, focus stack segments, lexical items Linguistic Attentional purpose tree (Grosz, Sidner, Kraus, Lochbaum 1974-1998) Collagen: Theory and Implementation

(fixing an air compressor, E = expert, A = apprentice) E: Replace the pump and belt please. A: Ok, I found a belt in the back. A: Is that where it should be? A: [removes belt] A: It’s done. E: Now remove the pump. … E: First you have to remove the flywheel. … E: Now take the pump off the base plate. A: Already did. replace belt replace pump and belt replace pump (Grosz, 1974) Collagen: Discourse Segments and Purposes

Focus Stack Purpose Tree replace pump and belt current focus space replacebelt replace pump and belt replace pump replace belt E: Replace the pump and belt please. A: Ok, I found a belt in the back. A: Is that where it should be? A: [removes belt] A: It’s done replace pump and belt replace belt (Grosz & Sidner, 1986) Discourse state representation

focus stack purpose tree • directly achieves the purpose • is a step in the plan for the purpose * • identifies the recipeused to achieve the purpose • identifieswho should perform the purpose or a step in the plan • identifies a parameter of the purpose or a step in the plan An act contributes to the purpose of a segment if it: * does not include recursive plan recognition (see later topic) Discourse interpretation algorithm The current (communication or manipulation) act either: • starts a new segment/focus space (push) • ends the current segment/focus space (pop) • continues (contributes to) the current segment/... (add) (Lochbaum, 1998)

COLLAGEN • Separation of task from dialog/discourse engine • Recipes / Domain plans / Task tree • Full-blown HTN (Hierarchical Task Network) • Hierarchical • Preconditions (constraints) • Effects • Completion / failure • Live nodes • Stack to keep track of focus and discourse structure • Tree explicitly contains agent and user nodes

Learning the Structure of Task-driven Human-Human Dialogs ACL 2006 paper

Motivation • Robustness is a crucial feature for speech and language processing systems. • Graceful degradation in output when faced with unexpected input. • Hand-crafted models: • Provide rich representation of the phenomena being modeled • Extensive human engineering • Not robust models • Data-driven models: • Helped by the availability of large amounts of speech and language data • Quick turn-around, once data is available • Provide robust and adaptable models • Capture the distribution of the phenomena being modeled • Hybrid approaches: • Combine rich representations with robust modeling techniques

User Experience Expert Transcribed Wizard of OzSessions Annotation Guide Transcribed Call Data Labelers ASR DM NLU Dialog Application Transcribed &Annotated Data Models ApplicationQuality Assurance QA TEST Current Paradigm for Dialog System Creation • Wizard-of-oz based creation • Human-machine dialogs are collected from few scenarios • Some components (ASR, NLU) are trained using that data • Data may be used to guide the designers of dialog managers • Expensive to collect and repeat • Reusable dialog modules • Creation of domain-independent dialog modules • Credit-card, telephone number, dollar amount etc. • Reinforcement learning • tuning dialog systems. • Proposal is to bootstrap all components of a dialog system from data. • - Inducing the structure of dialogs

Motivation • In current dialog systems: Most of dialog management and response generation is hand-crafted. • Typically based on wizard-of-oz data collection • brittle and rigid dialogs • Call centers: Vast repositories of human-human dialog data • Data driven models for all components of dialog systems • Analysis could help build robust hand-crafted human-machine dialog systems • May not be exactly how humans interact with machines • Speech recognition, understanding and synthesis errors • but a better start point than wizard-of-oz data • Larger set of scenarios and unscripted dialogs • CHILD: Copying Human Interactions through Learning and Discovery

Structural Analysis of Dialogs Dialog … • Task oriented dialogs • Incremental creation of a shared plan • Shared plan is a single tree • Dominance/precedence among tasks • Sequence of dialog acts • Inter-clausal relations • Predicate/argument relations among words • As dialog proceeds, utterances are accommodated incrementally into this tree Task Task Task … Topic/Subtask … Topic/Subtask Topic/Subtask Dialog Act Dialog Act Dialog Act Utterance/Clause Predicate/Argument Allows for tight coupling of Understanding, Dialog Management and Response Generation

Sample Task-oriented Dialog: Catalog ordering domain • thank you for calling XYZ catalog this is mary how may i help you • yes i would like to place an order • Yes one second please • thank you • can i have your home telephone number with the area code • <number> • your name • uh mine's alice smith it's probably under ronald smith • Okay you're still on Third Court • Yes mm hmm • do you have access to a email address • yes I do • would you like to receive email announcements for sales promotions and other events • No thank you • may we deliver this order to your home • Yes please • ………. Segments of the dialog are agent-initiated and part are user-initiated

Order Item Task …. …. Opening Contact-info Shipping-Address Hello Ack Request(MakeOrder) Ack can i have your home telephone number with the area code may we deliver this order to your home yes please …. …. thank you yes one second please yes i would like to place an order thank you for calling XYZ catalog this is mary how may i help you Dialog Structure

Utterance Segmentation: An Example • Clean up of ASR output: • Sentence boundaries, • restarts, • coordinating conjunctions, • filled pauses, • discourse markers. • Segment an utterance into clauses using a series of classifiers.

Dialog Act Annotation • Domain-specific dialog act tagging. • More specific than DAMSL • utility for natural language generation. • 67 dialog act tags • Composed of types and subtypes.

Dialog Act Labeling: Prediction Model • Assignment of the best dialog act sequence T* given the sequence of utterances U. • Probability is estimated using Maximum Entropy (Maxent) distribution. • Features: • Word trigrams and supertags from current and previous utterances • Supertags: Encapsulate predicate-argument information (TAGs) • Multiclass Maxent is encoded as (1-vs-other) binary maxents in LLAMA (Haffner 2006) • Speed in training and scalability to large output class vocabulary

Dialog Act Labeling: Experiments • 2K utterances (20 dialogs) • Annotated with dialog act labels • 10-fold cross-validation experiment on dialogs • Error rates across different corpora. • SWBD-DAMSL: 42 tags • DAMSL’s 375 tags clustered (Jurafsky et.al. 1998) • Previous best: 28% error (Jurafsky et.al. 1998) • Maptask: 13 tags • Speaker+Move • Eg: giver-instruct; follower-instruct • Previous best: 42.8% error(Poesio and Mikheev 1998) • Use of previous utterances decreases error rate • Use of supertags decreases error rate • Increasing data decreases error rate

Order Placement summary closing payment info opening contact info order item delivery info shipping info Subtask Structure Prediction • Two Models for structure prediction • Chunk based model • Recovers precedence relations, not the dominance relations • Parse based model • Recovers both precedence and dominance relations Structure prediction models can be used as on-line dialog models • Left-to-right, incremental predictions

ST ST_begin ST_middle ST_end ui uj … Subtask Structure Prediction: Chunk based model ST ST ui…uj uj+1 … uk … … uj uk ui uj+1 • Labels span multiple utterances • Tag set enriched to include begin, middle and end of subtasks. • Each utterance classified using this label set • Subtask labeling is similar to dialog act labeling. • Local disambiguation • Global sequence constraint: • Precedence constraint between labels (represented as L(G)) • begin < middle < end • Classifier output encoded as a lattice • Lattice composed with regular expression L(G) • Off-line model

Composition of the output of the classifier with Well-formedness constraint begin middle end > : …. : > Well-formedness constraint network Output of the Classifier

Subtask Structure Prediction: Parse based model ST • Parse based model captures precedence and dominance relations among subtasks (Plan Tree) • Parse based model: • Roark (2001) statistical parser: • Top-down incremental parser with bottom-up information • Retrained using the plan trees from the catalog domain • Parse k-best subtask sequences provided by the chunk model • Desired: Parses for subtask lattices; (not implemented) ui…uj uj+1 … uk ST ST … … uj uk ui uj+1 where

Subtask Structure Prediction: Data and Features • Data • Utterances from 915 dialogs • Annotated with 18 domain-specific subtask labels. • Features used for classification • based on utterance information

Subtask Structure Prediction: Experiments and Results • Random split of 915 annotated dialogs. • 864 training dialogs; 51 test dialogs • 18 x 3 = 54 labels • Observations: • Increasing utterance context decreases error rate • Integrating precedence/parse constraint decreases error rate over no constraint • Integrating predicted SWBD-DA tags increases the error rate • Parser constraint does not improve significantly over precedence constraint • Global precedence constraint requires the complete subtask sequence • Useful for off-line dialog analysis and mining • Unrealistic for on-line dialog modeling

Conclusions • Robustness in dialog management can be achieved using data-driven approaches • Call centers: source of large task-oriented dialog corpora • Better than Wizard-of-OZ collections • Size as well as variations • Models that predict the structure of task-oriented human-human dialogs. • Dialog act labeling • Subtask structure prediction • Exploiting global structure information improves labeling accuracy. • Useful for off-line dialog processing (dialog data mining) • On-going work: Learning response generation from this data. • Issue: Call center data are proprietary • Community needs to collect large dialog corpus.

Evaluating a Dialog System

How do we evaluate Dialog Systems? • PARADISE framework (Walker et al ’00) • “Performance” of a dialog system is affected both by whatgets accomplished by the user and the dialog agent and howit gets accomplished Minimize Costs Maximize Task Success Efficiency Measures Qualitative Measures

What metrics should we use? • Efficiency of the Interaction:User Turns, System Turns, Elapsed Time • Quality of the Interaction: ASR rejections, Time Out Prompts, Help Requests, Barge-Ins, Mean Recognition Score (concept accuracy), Cancellation Requests • User Satisfaction • Task Success: perceived completion, information extracted

Was Annie easy to understand in this conversation? (TTS Performance) In this conversation, did Annie understand what you said? (ASR Performance) In this conversation, was it easy to find the message you wanted? (Task Ease) Was the pace of interaction with Annie appropriate in this conversation? (Interaction Pace) In this conversation, did you know what you could say at each point of the dialog?(User Expertise) How often was Annie sluggish and slow to reply to you in this conversation? (System Response) Did Annie work the way you expected her to in this conversation? (Expected Behavior) From your current experience with using Annie to get your email, do you think you'd use Annie regularly to access your mail when you are away from your desk? (Future Use) UserSatisfaction: Sum of Many Measures Annie: A dialog system to retrieve email messages over the phone

Performance Model • Weights trained for each independent factor via multivariate (linear) regression modeling: how much does each contribute to User Satisfaction? • Sat = W . F • Result useful for system development • Making predictions about system modifications • Distinguishing ‘good’ dialogs from ‘bad’ dialogs • But … can we also tell on-line when a dialog is ‘going wrong’

Issues in making human-machine interactions more natural

Srinivas Bangalore AT&T Research srini@research.att.com