Enhancing Dialogue Systems: Confidence Annotation in the CMU Communicator

Is This Conversation on Track? Utterance Level Confidence Annotation in the CMU Communicator spoken dialog system Presented by: Dan Bohus (dbohus@cs.cmu.edu) Work by: Paul Carpenter, Chun Jin, Daniel Wilson, Rong Zhang, Dan Bohus, Alex Rudnicky Carnegie Mellon University – 2001

Outline • The Problem. The Approach • Training Data and Features • Experiments and Results • Conclusion. Future Work Is This Conversation on Track ?

The Problem • Systems often misunderstand, take misunderstanding as fact, and continue to act using invalid information • Repair costs • Increased dialog length • User Frustration • Confidence annotation provides critical information for effective confirmation and clarification in dialog systems. Is This Conversation on Track ?

The Approach • Treat the problem as a data-driven classification task. • Objective: accurately label misunderstood utterances. • Collect a training corpus. • Identify useful features. • Train a classifier ~ identify the best performing one for this task. Is This Conversation on Track ?

Data • Communicator Logs & Transcripts: • Collected 2 months (Oct, Nov 1999). • Eliminated conversations with < 5 turns. • Manually labeled OK (67%) / BAD (33%)BAD ~ RecogBAD / ParseBAD / OOD / NONSpeech • Discarded mixed-label utterances (6%). • Cleaned corpus of 4550 utterances / 311 dialogs. Is This Conversation on Track ?

Feature Extraction 12 Features from various levels: • Decoder Features: • Word Number, Unconfident Percentage • Parsing Features: • Uncovered Percentage, Fragment Transitions, Gap Number, Slot Number, Slot Bigram • Dialog Features: • Dialog State, State Duration, Turn Number, Expected Slots • Garble:handcrafted heuristic currently used by the CMU Communicator Is This Conversation on Track ?

Experiments with 6 different classifiers • Decision Tree • Artificial Neural Network • Naïve Bayes • Bayesian Network • Several network structures attempted • AdaBoost • Individual feature-based binning estimators as weak learners, 750 boosting stages • Support Vector Machines • Dot, Polynomial, Radial, Neural, Anova Is This Conversation on Track ?

Evaluating performance • Classification Error Rate (FP+FN) • CDR = 1-Fallout = 1-(FP/NBAD) • Cost of misunderstanding in dialog systems depends on • Error type (FP vs. FN) • Domain • Dialog state • Ideally, build a cost function for each type of error, and optimize for that Is This Conversation on Track ?

Results – Individual Features • Baseline error 32.84% (when predicting the majority class) • All experiments involved 10-fold cross-validation Is This Conversation on Track ?

Results – Classifiers • T-Test showed there is no statistically significant difference between the classifiers except for the Naïve Bayes • Explanation: independence between feature assumption is violated • Baseline error 25.32% (GARBLE) Is This Conversation on Track ?

Future Work • Improve the classifiers • Additional features • Develop a cost model for understanding errors in dialog systems. • Study/optimize tradeoffs between F/P and F/N; • Integrate value and confidence information to guide clarification in dialog systems Is This Conversation on Track ?

Confusion Matrix • FP = False acceptance • FN = False detection/rejection • Fallout = FP/(FP+TN) = FP/NBAD • CDR = 1-Fallout = 1-(FP/NBAD) Is This Conversation on Track ?

Enhancing Dialogue Systems: Confidence Annotation in the CMU Communicator