300 likes | 403 Vues
This project focuses on utilizing simulation models for student dialogue systems evaluation, comparing real and simulated corpora, and assessing the effectiveness of reinforcement learning strategies. The study aims to improve dialogue system performance and explore cost-effective ways for corpus acquisition. Results indicate simulated models can mimic real corpus behaviors to an extent, with potential for enhancing future system developments.
E N D
Student simulation and evaluation DOD meeting Hua Ai (hua@cs.pitt.edu) 03/03/2006
Outline • Motivations • Backgrounds • Corpus • Student Simulation Model • Comparisons • Conclusions & Future Work
Motivations • For larger corpus • Reinforcement Learning (RL) is used to learn the best policy for spoken dialogue systems automatically • Best strategy may often not even be present in small dataset • For cheaper corpus • Human subjects are expensive
Dialog Manager Simulated User Reinforcement Learning Strategy Dialog Corpus Simulation models Strategy learning using a simulated user (Schatzmann et al., 2005)
Backgrounds (1) • Education community • Focusing on changes of student’s inner-brain knowledge representation forms • Usually not dialogue based • Simulated students for (Venlehn et al., 1994) • tutor training • Collaborative learning
Backgrounds (2) • Dialogue community • Focusing on interactions and dialogue behaviors • Simulated users have limited actions to take • (Schatzmann et al., 2005) • Simulating on DA level
Corpus (1) • Spoken dialogue physics tutor (ITSPOKE)
(T) Question (T) Question (S) Answer (S) Answer Dialogue (T) Q (S) A … Dialogue (T) Q (S) A … Essay revision Essay revision Dialogue Dialogue Corpus (2) 5 problems • Tutoring procedure … …
Corpus (3) • Tutor’s behaviors • Defined in KCD (Knowledge Construction Dialogues) Correct Incorrect/ Partially Correct
Corpus (4) f03:s05 Different groups of subjects
Simulation Models (1) • Simulating on word level • Student’s have more complex behaviors • DA info alone isn’t enough for the system • Two models trained on two corpus 03ProbCorrect ProbCorrect f03 03Random 05ProbCorrect Random s05 05Random
Simulation Models (2) • ProbCorrect Model • Simulates average knowledge level of real students • Simulate meaningful dialogue behaviors • Random Model • Non-sense • As a contrast
Real corpus question1 Answer1_1 (c) Answer1_2 (ic) Answer1_3 (ic) question2 Answer2_1 (c) Answer2_2 (ic) Candidate Ans: For question1 c:ic = 1:2 c: Answer1_1 ic: Answer1_2 Answer1_3 For question2 c:ic = 1:1 c: Answer2_1 ic Answer2_2 • ProbCorrect Model: • Question 1 • Answer: • Choose to give a c/ic answer with the same average probability as real student • Randomly choose one answers from the corresponding answer set ProbCorrect Model
HC03&05 Question1 Answer1_1 Answer1_2 Answer1_3 Answer1_4 Question2 Answer2_1 Answer2_2 Candidate Ans: 1) Answer1_1 2) Answer1_2 3) Answer1_3 4) Answer1_4 5) Answer2_1 6) Answer2_2 Big random Model: Question i: Answer: any of the 6 answers with the same probability (Regardless the question!) Random Model
Experiments • Comparisons between real corpora • Comparisons between real & simulated corpora • Comparisons between simulated corpora
Real Corpora Comparisons (1) • Evaluation metrics • High-level dialog features • Dialog style and cooperativeness • Dialog Success Rate and Efficiency • Learning Gains
Real corpora comparisons (2) • High-level dialog features
Real corpora comparisons (3) • Dialogue style features
Real corpora comparisons (3) • Dialogue success rate
Real corpora comparisons (4) • Learning gains features
Results • Differences captured by these simple metrics can’t help to conclude whether a corpus is real or not (Schatzmann et al., 2005) • Differences could be due to different user population
Results (1) • Most of the measurements are able to distinguish between Random and ProbCorrect model • ProbCorrect model generates more realistic behaviors • We can’t conclude on the power of these metrics since the two simulated corpus are really different
Results (2) • Differences between real and random models are captured clearly, but differences between real and ProbCorrect is not clear • We don’t expect this simple model to give very real corpus. It’s surprising that the differences are small
Results (3) • S05 variety > f03 variety 05probCorrect variety > 03probCorrect variety • However, we don’t get significantly more varieties in the simulated corpus than the real ones • Could be the computer tutor is simple (c/ic) • We’re using the same candidate answer set
Results (4) • ProbCorrect models trained on different real corpora are quite different • The ProbCorrect model is more similar to the real corpus it is trained from than to the other real corpus
Comparisons between simulated dialogues with different dialogue structure
Results • Larger differences between the two simulated corpora in prob7 than in prob34 • Dialogue structure of prob34 is more restricted • The power of these simple metrics is restricted by the dialogue structure
Conclusions • The simple measurements can distinguish between • real corpora • Different population • simulated and real corpora • To different extent • simulated corpora • Different models • Trained on different corpora • Limited to different Dialog structure
Future work • Explore “deep” evaluation metrics • Test simulated corpus on policy • More simulation models • More human features • Emotion, learning • Special cases • Quick learners, slow learners