1 / 46

constructing accurate beliefs in task-oriented spoken dialog systems

This paper discusses the problem of understanding errors in spoken language interfaces and proposes approaches for increasing robustness through interaction, recognition improvement, and recovery strategies.

kahns
Télécharger la présentation

constructing accurate beliefs in task-oriented spoken dialog systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University dbohus@cs.cmu.edu Pittsburgh, PA 15213

  2. problem spoken language interfaces lack robustness when faced with understanding errors • errors stem mostly from speech recognition • typical word error rates: 20-30% • significant negative impact on interactions

  3. more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

  4. NONunderstanding MISunderstanding two types of understanding errors S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

  5. approaches for increasing robustness • gracefully handle errors through interaction • improve recognition • detect the problems • develop a set of recovery strategies • know how to choose between them (policy)

  6. misunderstandings non-understandings detection strategies policy six not-so-easy pieces …

  7. today’s talk … • construct more accurate beliefs by integrating information over multiple turns in a conversation misunderstandings detection S: Where would you like to go? U: Huntsville [SEOUL / 0.65] destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}

  8. belief updating: problem statement • given • an initial belief Pinitial(C) over concept C • a system action SA • a user response R • construct an updated belief • Pupdated(C) ← f (Pinitial(C), SA, R) destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}

  9. outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work

  10. current solutions • most systems only track values, not beliefs • new values overwrite old values • use confidence scores yes → trust hypothesis • explicit confirm + no → delete hypothesis “other” → non-understanding • implicit confirm: not much “users who discover errors through incorrect implicitconfirmations have a harder time getting back on track” [Shin et al, 2002] related work : restricted version : data : user response analysis : results : current and future work

  11. confidence / detecting misunderstandings • traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar, and many others] • recently: detecting misunderstandings[Walker, Wright, Litman, Bosch, Swerts, San-Segundo, Pao, Gurevych, Bohus, and many others] • machine learning approach: binary classification • in-domain, labeled dataset • features from different knowledge sources • acoustic, language model, parsing, dialog management • ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work

  12. detecting corrections • detect if the user is trying to correct the system [Litman, Swerts, Hirschberg, Krahmer, Levow] • machine learning approach binary classification • in-domain, labeled dataset • features from different knowledge sources • acoustic, prosody, language model, parsing, dialog management • ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work

  13. integration • confidence annotation and correction detection are useful tools • but separately, neither solves the problem • bridge together in a unified approach to accurately track beliefs related work : restricted version : data : user response analysis : results : current and future work

  14. outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work

  15. belief updating: general form • given • an initial belief Pinitial(C) over concept C • a system action SA • a user response R • construct an updated belief • Pupdated(C) ← f (Pinitial(C), SA, R) related work : restricted version : data : user response analysis : results : current and future work

  16. two simplifications 1. belief representation • system unlikely to “hear” more than 3 or 4 values for a concept within a dialog session • in our data [considering only top hypothesis from recognition] • max = 3 (conflicting values heard) • only in 6.9% of cases, more than 1 value heard • compressed beliefs: top-K concept hypotheses + other • for now, K=1 2. updates following system confirmation actions related work : restricted version : data : user response analysis : results : current and future work

  17. {boston/0.65; austin/0.11; … } + ExplicitConfirm( Boston ) + [NOW] {boston/ ?} belief updating: reduced version • given • an initial confidence score for the current top hypothesis Confinit(thC) for concept C • a system confirmation action SA • a user response R • construct an updated confi-dence score for that hypothesis • Confupd(thC) ← f (Confinit(thC), SA, R) related work : restricted version : data : user response analysis : results : current and future work

  18. outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work

  19. data • collected with RoomLine • a phone-based mixed-initiative spoken dialog system • conference room reservation • explicit and implicit confirmations • confidence threshold model (+ some exploration) • unplanned implicit confirmations • I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? • I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? related work : restricted version : data : user response analysis : results : current and future work

  20. corpus • user study • 46 participants (naïve users) • 10 scenario-based interactions each • compensated per task success • corpus • 449 sessions, 8848 user turns • orthographically transcribed • manually annotated • misunderstandings • corrections • correct concept values related work : restricted version : data : user response analysis : results : current and future work

  21. outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work

  22. user response types • following [Krahmer and Swerts, 2000] • study on Dutch train-table information system • 3 user response types • YES: yes, right, that’s right, correct, etc. • NO: no, wrong, etc. • OTHER • cross-tabulated against correctness of system confirmations related work : restricted version : data : user response analysis : results : current and future work

  23. ~10% user responses to explicit confirmations [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work

  24. other responses to explicit confirmations • ~70% users repeat the correct value • ~15% users don’t address the question • attempt to shift conversation focus • how often users correct the system? related work : restricted version : data : user response analysis : results : current and future work

  25. user responses to implicit confirmations [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work

  26. ignoring errors in implicit confirmations • explanation • users correct later (40% of 118) • users interact strategically / correct only if essential • how often users correct the system? related work : restricted version : data : user response analysis : results : current and future work

  27. outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work

  28. machine learning approach • problem: Confupd(thC) ← f (Confinit(thC), SA, R) • need good probability outputs • low cross-entropy between model predictions and reality • logistic regression • sample efficient • stepwise approach → feature selection • logistic model tree for each action • root splits on response-type related work : restricted version : data : user response analysis : results : current and future work

  29. features. target. • target: was the top hypothesis correct? related work : restricted version : data : user response analysis : results : current and future work

  30. baselines • initial baseline • accuracy of system beliefs before the update • heuristic baseline • accuracy of heuristic update rule used by the system • oracle baseline • accuracy if we knew exactly what the user said related work : restricted version : data : user response analysis : results : current and future work

  31. results: explicit confirmation initial heuristic logistic model tree oracle Hard error (%) Soft error 31.15 30% 0.6 0.51 20% 0.4 0.19 10% 0.2 8.41 0.12 3.57 2.71 0% 0.0 related work : restricted version : data : user response analysis : results : current and future work

  32. results: implicit confirmation initial heuristic logistic model tree oracle Hard error (%) Soft error 30.40 1.0 30% 0.8 23.37 0.67 0.61 20% 0.6 16.15 15.33 0.43 0.4 10% 0.2 0% 0.0 related work : restricted version : data : user response analysis : results : current and future work

  33. results: unplanned implicit confirmation initial heuristic logistic model tree oracle Hard error (%) Soft error 20% 0.6 15.40 0.46 14.36 0.43 12.64 0.4 0.34 10.37 10% 0.2 0% 0.0 related work : restricted version : data : user response analysis : results : current and future work

  34. informative features • initial confidence score • prosody features • barge-in • expectation match • repeated grammar slots • concept identity related work : restricted version : data : user response analysis : results : current and future work

  35. summary • data-driven approach for constructing accurate system beliefs • integrate information across multiple turns • bridge together detection of misunderstandings and corrections • performs better than current heuristics • user response analysis • users don’t correct unless the error is critical related work : restricted version : data : user response analysis : results : current and future work

  36. outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work

  37. k hyps + other • multinomial GLM • all actions • confirmation (expl/impl) • request • unexpected features • added priors current extensions • top hypothesis + other • logistic regression model belief representation system action • confirmation actions related work : restricted version : data : user response analysis : results : current and future work

  38. 2 hypotheses + other 15.49% 30.83% 30.46% 30% 30% 15.15% 14.02% 26.16% 12.95% 22.69% 12% 21.45% 10.72% 20% 20% 17.56% 16.17% 8% 10% 10% 7.86% 4% 6.06% 5.52% 0% 0% 0% unplanned impl. conf. implicit confirmation explicit confirmation 80.00% 98.14% initial heuristic lmt(basic) lmt(basic+concept) oracle 45.03% 12% 9.64% 40% 9.49% 8% 25.66% 6.08% 19.23% 20% 4% 0% 0% unexpected update request related work : restricted version : data : user response analysis : results : current and future work

  39. other work misunderstandings non-understandings • belief updating [ASRU-05] • costs for errors • rejection threshold adaptation • nonu impact on performance [Interspeech-05] • transfering confidence annotators across domains [in progress] detection • comparative analysis of 10 recovery strategies [SIGdial-05] strategies • impact of policy on performance • towards learning non-understanding recovery policies [SIGdial-05] policy • RavenClaw: dialog management for task-oriented systems - RoomLine, Let’s Go Public!, Vera, LARRI, TeamTalk, Sublime [EuroSpeech-03, HLT-05] related work : restricted version : data : user response analysis : results : current and future work

  40. thank you! questions …

  41. a more subtle caveat • distribution of training data • confidence annotator + heuristic update rules • distribution of run-time data • confidence annotator + learned model • always a problem when interacting with the world! • hopefully, distribution shift will not cause large degradation in performance • remains to validate empirically • maybe a bootstrap approach?

  42. KL-divergence & cross-entropy • KL divergence: D(p||q) • Cross-entropy: CH(p, q) = H(p) + D(p||q) • Negative log likelihood

  43. logistic regression • regression model for binomial (binary) dependent variables • fit a model using max likelihood (avg log-likelihood) • any stats package will do it for you • no R2 measure • test fit using “likelihood ratio” test • stepwise logistic regression • keep adding variables while data likelihood increases signif. • use Bayesian information criterion to avoid overfitting

  44. logistic regression

  45. regression tree, but with logistic models on leaves logistic model tree f f=0 f=1 g g<=10 g>10

  46. user study • 46 participants, 1st time users • 10 scenarios, fixed order • presented graphically (explained during briefing) • participants compensated per task success

More Related