Towards a Method For Evaluating Naturalness in Conversational Dialog Systems

Towards a Method For Evaluating Naturalness in Conversational Dialog Systems Victor Hung, Miguel Elvir, Avelino Gonzalez & Ronald DeMara Intelligent Systems Laboratory University of Central Florida IEEE International Conference on Systems, Man, and Cybernetics San Antonio, Texas October 12, 2009

Agenda • Introduction • Background • Approach • Project LifeLike

Introduction Interactive Conversation Agent Evaluation Cannot rely solely on quantitative methods Subjectivity in ‘naturalness’ No general method to judge how well a conversation agent performs Pivotal focus will be defining naturalness How well a chatbot can maintain a natural conversation flow LifeLike virtual avatar project as a backdrop Provide a suitable validation and verification method

Background: Early Systems Declarative knowledge to process data Explicitly defined rules Constrained knowledge Limited capacity to assess and adapt Goal-oriented and data-driven behavior ALICEbot

Background: Naturalness Automatic Speech Recognition Context retrieval experimentation Intelligent tutoring Adaptive Control of Thought Knowledge Acquisition agents Quality of the information received Conversation length metric ALICE-based bots

Background: Recent Advances Sentence-based template matching Simple conversational memory CMU’s Julia, Extempo’s Erin Interaction occurs in a reactive manner Wlodzislaw et al Development of cognitive modules and human interface realism Ontologies, concept description vectors, semantic memory models, CYC

Background: Recent Advances Becker and Wachsmuth Representation and actuation of coherent emotional states Lars et al Model for sustainable conversation Awareness of the human users and the conversation topics Relies on textual input similar to ELIZA Use of natural language processing for reasoning about human speech

Background: Conclusion Breadth of research using chatbots Focus on creating more sophisticated interpretative conversational modules Need exists for generalizable metrics Conversational agents widely experimented with, but it has been lacking a basic framework for universal performance comparison

Approach: Previous Approaches Mix of quantitative and qualitative measures Subjective matters employ human user questionnaire Semeraro et al’s bookstore chatbot 7 characteristics: impression, command, effectiveness, navigability, ability to learn, ability to aid, comprehension. Does not provide statistical conclusiveness General indicator of performance

Approach: Previous Approaches Shawar and Atwell’s universal chatbot evaluation system ALICE-based Afrikaans conversation agent Dialog efficiency Dialog quality: reasonable, weird but understandable, and nonsensical Users’ satisfaction, qualitatively measured Proper assessment is end result in how successfully it accomplishes its intended goals

Approach: Previous Approaches Evaluation of naturalness similar to general chatbot assessment Rzepka et al’s 1-to-10 scale metrics Naturalness degree Willing to continue a conversation degree Human judges used these measures to evaluate a conversation agent’s utterances No concrete baseline for naturalness Able to make relative measurements of naturalness between dialog agents

Approach: Chatbot Objectives Walker et al’s PARAdigm for DIalogue System Evaluation (PARADISE) Dialog performance relates to the experience of the interaction (means) Task success is concerned with the utility of the dialog exchange (ends) Objectives Better than other dialog system solutions Similar to a human-to-human (naturalness) interaction

Approach: Task Success Measure of goal satisfaction Attribute-value matrix Derived from PARADISE Expected vs. actual Task success (κ) computed as the percentage of correct responses

Approach: Performance Function Derived from PARADISE Total effectiveness Task success (κ) weighted by (α) Dialog costs (ci) weighted by (wi) Function (N) uses Z-score normalization Balance out (κ) and (ci)

Approach: Proposed System Task success Dialog costs Efficiency Resource consumption Quantitative Quality Actual conversational content Quantitative or qualitative

Questions

Towards a Method For Evaluating Naturalness in Conversational Dialog Systems

Towards a Method For Evaluating Naturalness in Conversational Dialog Systems

Presentation Transcript

Multilingual Conversational Systems

Spoken Dialog Systems

Grounding in Conversational Systems

Inventory - Naturalness

Towards Conversational Human Computer Interaction

Methodologies for Evaluating Dialog Structure Annotation

Multifaceted Conversational Systems

Sentence Unit Detection in Conversational Dialog Speech

Evaluating Systems

Naturalness in Inflation

A Practical Method For Quickly Evaluating Program Optimizations

Evaluating Systems

Flexible Dialog Management for In-vehicle Dialog Systems

Towards Open-Domain Conversational AI

Grounding in Conversational Systems

A Practical Method For Quickly Evaluating Program Optimizations

Conversational Systems Thinking

Judicial Dialog Systems