Spoken Dialogue Systems

Spoken Dialogue Systems Julia Hirschberg CS 4706

Today • Some Swedish examples • Controlling the dialogue flow • State prediction • Controlling lexical choice • Learning from human-human dialogue • User feedback • Evaluating systems

The Waxholm Project at KTH • tourist information • Stockholm archipelago • time-tables, hotels, hostels, camping and dining possibilities. • mixed initiative dialogue • speech recognition • multimodal synthesis • graphic information • pictures, maps, charts and time-tables • Demos at http://www.speech.kth.se/multimodal

The Waxholm system Is it possible to eat in Waxholm? I think I want to go to Waxholm Information about the restaurants in Waxholm is shown in this table This is a table of the boats... When do the evening boats depart? The city I want to go tomorrow There are lots of boats from Stockholm to Waxholm on a Friday, At what time do you want to go? Which day of the week do you want to go? I am looking for boats to Waxholm From where do you want to go Thank you Where can I find hotels? Thank you too Information about the hotels in Waxholm is shown in this table Waxholm is shown on this map Information about hotels is shown in this table Which hotels are in Waxholm? Where is Waxholm?

Dialogue control - state prediction Dialog grammar specified by a number of states Each state associated with an action database search, system question… … Probable state determined from semantic features Transition probability from one state to state Dialog control design tool with a graphic interface

Waxholm Topics TIME_TABLE Task: get a time-table. Example: När går båten? (When does the boat leave?) SHOW_MAP Task : get a chart or a map displayed. Example: Var ligger Vaxholm? (Where is Vaxholm located?) EXIST Task : display lodging and dining possibilities. Example: Var finns det vandrarhem? (Where are there hostels?) OUT_OF_DOMAIN Task : the subject is out of the domain. Example: Kan jag boka rum. (Can I book a room?) NO_UNDERSTANDING Task : no understanding of user intentions. Example: Jag heter Olle. (My name is Olle) END_SCENARIO Task : end a dialog. Example: Tack. (Thank you.)

Topic selection FEATURES TOPIC EXAMPLES • TIME SHOW FACILITY NO UNDER- OUT OF END • TABLE MAP STANDING DOMAIN • OBJECT .062 .312 .073 .091 .067 .091 • QUEST-WHEN .188 .031 .024 .091 .067 .091 • QUEST-WHERE .062 .688 .390 .091 .067 .091 • FROM-PLACE .250 .031 .024 .091 .067 .091 • AT-PLACE .062 .219 .293 .091 .067 .091 • TIME .312 .031 .024 .091 .067 .091 • PLACE .091 .200 .500 .091 .067 .091 • OOD .062 .031 .122 .091 .933 .091 • END .062 .031 .024 .091 .067 .909 • HOTEL .062 .031 .488 .091 .067 .091 • HOSTEL .062 .031 .122 .091 .067 .091 • ISLAND .333 .556 .062 .091 .067 .091 • PORT .125 .750 .244 .091 .067 .091 • MOVE .875 .031 .098 .091 .067 .091 { p(ti | F )} • argmax • i

Topic prediction results 15 12,7 12,9 All 8,8 8,5 10 % Errors “no understanding”excluded 5 3,1 2,9 0 raw data no extra linguistic sounds complete parse

User answers to questions? The answers to the question: “What weekday do you want to go?” (Vilken veckodag vill du åka?) • 22% Friday (fredag) • 11% I want to go on Friday (jag vill åka på fredag) • 11% I want to go today (jag vill åka idag) • 7% on Friday (på fredag) • 6% I want to go a Friday (jag vill åka en fredag) • - are there any hotels in Vaxholm?(finns det några hotell i Vaxholm)

Examples of questionsand answers Hur ofta åker du utomlands på semestern? Hur ofta reser du utomlands på semestern? jag reser en gång om året utomlands jag reser inte ofta utomlands på semester det blir mera i arbetet jag reserreser utomlands på semestern vartannat år jag reser utomlands en gång per semester jag reser utomlands på semester ungefär en gång per år jag brukar resa utomlands på semestern åtminståne en gång i året en gång per år kanske en gång vart annat år varje år vart tredje år ungefär nu för tiden inte så ofta varje år brukar jag åka utomlands jag åker en gång om året kanske jag åker ganska sällan utomlands på semester jag åker nästan alltid utomlands under min semester jag åker ungefär 2 gånger per år utomlands på semester jag åker utomlands nästan varje år jag åker utomlands på semestern varje år jag åker utomlands ungefär en gång om året jag är nästan aldrig utomlands en eller två gånger om året en gång per semester kanske en gång per år ungefär en gång per år åtminståne en gång om året nästan aldrig

Results no reuse no answer 4% 2% other 24% reuse 52% 18% ellipse

The August system Yes, it might be that we will! What is your name? Strindberg was born in 1849 What do you do for a living? When were you born? Do you like it here? Good bye! Thank you! You are welcome! Strindberg was married three times! Strindberg was married three times! Over a million people live in the Stockholm area I can answer questions about Strindberg, KTH and Stockholm I call myself Strindberg, but I don’t really have a surname How many people live in Stockholm? People who live in glass houses should not throw stones I come from the department of Speech, Music and Hearing The information is shown on the map Yes, that was a smart thing to say! The Royal Institute of Technology! Perhaps we will meet soon again!

Evidence from Human Performance • Users provide explicit positive and negative feedback • Corpus-based vs. laboratory experiments – do these tell us different things?

Adapt – demonstrationof ”complete” system

Feedback and ‘Grounding’: Bell & Gustafson ’00 • Positive and negative • Previous corpora: August system • 18% of users gave pos or neg feedback in subcorpus • Push-to-talk • Corpus: Adapt system • 50 dialogues, 33 subjects, 1845 utterances • Feedback utterances labeled w/ • Positive or negative • Explicit or implicit • Attention/Attitude • Results: • 18% of utterances contained feedback • 94% of users provided

65% positive, 2/3 explicit, equal amounts of attention vs. attitude • Large variation • Some subjects provided at almost every turn • Some never did • Utility of study: • Use positive feedback to model the user better (preferences) • Use negative feedback in error detection

The HIGGINS domain • The primary domain of HIGGINS is city navigation for pedestrians. • Secondarily, HIGGINS is intended to provide simple information about the immediate surroundings. This is a 3D test environment

ASR Speaks Reads Vocoder Speaks Listens User Operator Initial experiments • Studies on human-human conversation • The Higgins domain (similar to Map Task) • Using ASR in one direction to elicit error handling behaviour

Non-Understanding Error Recovery (Skantze ’03) • Humans tend not to signal non-understanding: • O: Do you see a wooden house in front of you? • U: ASR: YES CROSSING ADDRESS NOW (I pass the wooden house now) • O: Can you see a restaurant sign? • This leads to • Increased experience of task success • Faster recovery from non-understanding

Evaluating Dialogue Systems • PARADISE framework (Walker et al ’00) • “Performance” of a dialogue system is affected both by whatgets accomplished by the user and the dialogue agent and howit gets accomplished Maximize Task Success Minimize Costs Efficiency Measures Qualitative Measures

Task Success • Task goals seen as Attribute-Value Matrix • ELVIS e-mail retrieval task(Walker et al ‘97) • “Find the time and place of your meeting with Kim.” Attribute Value Selection Criterion Kim or Meeting Time 10:30 a.m. Place 2D516 • Task success defined by match between AVM values at end of with “true” values for AVM

Metrics • Efficiency of the Interaction:User Turns, System Turns, Elapsed Time • Quality of the Interaction: ASR rejections, Time Out Prompts, Help Requests, Barge-Ins, Mean Recognition Score (concept accuracy), Cancellation Requests • User Satisfaction • Task Success: perceived completion, information extracted

Experimental Procedures • Subjects given specified tasks • Spoken dialogues recorded • Cost factors, states, dialog acts automatically logged; ASR accuracy,barge-in hand-labeled • Users specify task solution via web page • Users complete User Satisfaction surveys • Use multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factors

Was Annie easy to understand in this conversation? (TTS Performance) In this conversation, did Annie understand what you said? (ASR Performance) In this conversation, was it easy to find the message you wanted? (Task Ease) Was the pace of interaction with Annie appropriate in this conversation? (Interaction Pace) In this conversation, did you know what you could say at each point of the dialog? (User Expertise) How often was Annie sluggish and slow to reply to you in this conversation? (System Response) Did Annie work the way you expected her to in this conversation? (Expected Behavior) From your current experience with using Annie to get your email, do you think you'd use Annie regularly to access your mail when you are away from your desk? (Future Use) User Satisfaction:Sum of Many Measures

Performance Functions from Three Systems • ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET • TOOT User Sat.= .35* COMP + .45* MRS - .14*ET • ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help • COMP: User perception of task completion (task success) • MRS: Mean recognition accuracy (cost) • ET: Elapsed time (cost) • Help: Help requests (cost)

Performance Model • Perceived task completion and mean recognition score are consistently significant predictors of User Satisfaction • Performance model useful for system development • Making predictions about system modifications • Distinguishing ‘good’ dialogues from ‘bad’ dialogues • But can we also tell on-line when a dialogue is ‘going wrong’

Next Class • Turn-taking (J&M, Link to conversational analysis description, Beattie on Margaret Thatcher)

Spoken Dialogue Systems