Comparing Synthesized vs Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue System

Kate Forbes-Riley and Diane Litman and Scott Silliman and Joel Tetreault Learning Research and Development Center University of Pittsburgh Comparing Synthesized versus Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue System

Outline • Overview • System and Corpora • Evaluation Metrics and Methodology • Results • Conclusions, Future Work

Overview: Motivation • Intelligent tutoring systems adding speech capabilities (e.g. LISTEN Reading Tutor, SCoT, AutoTutor) • Enhance communication richness, increase effectiveness • Question:What is relationship between qualityof speech technology and system effectiveness? • Ispre-recorded tutor voice(costly, inflexible, human)more effective than synthesized tutor voice(cheaper, flexible, non-human)? • If not, put effort elsewhere in system design!

Overview: Recent Work (mixed results) • Math Tutor System: (Non-)Visual tutor with prerecorded voicealways rated higher and yielded deeper learning. (Atkinson et al., 2005) • Instructional Plan Tutor System:Pre-recorded voice always rated more engaging. Non-visual tutor: prerecorded voice yields more motivation. Visual tutor: synthesized voice yields more motivation. (Baylor et al., 2003) • Smart-Home System: More natural-sounding voice preferred. Characteristics (effort, pleasantness) more important than type(Moller et al., 2006)

Overview: Our Study • Two Tutoring System Versions: pre-recorded tutor voice, synthesized tutor voice(non-visual tutor) • Evaluate Effectiveness: student learning, system usability, dialogue efficiency across corpora (subsets) • Hypothesis:more human-sounding voice will perform better • Results:tutor voice quality has only a minor impact • Does not impact learning • May impact usability and efficiency: in certain corpora subsets, pre-recorded preferred, in others, synthesized

Intelligent Tutoring Spoken Dialogue System • Back-end: text-based Why2-Atlas system (VanLehn et al., 2002)

Sphinx2 speech recognizer - Why2-Atlas performs NLP on transcript

Scrollable dialogue history is available to the student

2 ITSPOKE 2005 Corpora • Pre-Recorded voice: paid voice talent, 5.85 hours of audio, 25 hours of time (at $120/hr) • Synthesized voice: Cepstral text-to-speech system voice of “Frank” for $29.95

Example of 2 ITSPOKE Tutor Voices TUTOR TURN: Right. Let's now analyze what happens to the keys. So what are the forces acting on the keys after they are released? Please, specify their directions (for instance, vertically up). ITSPOKE Pre-Recorded Tutor Voice (PR) ITSPOKE Synthesized Tutor Voice (SYN)

Experimental Procedure • Paid subjects w/o college physics recruited via UPitt ads: • Read a small background document • Took a pretest • Worked 5 training problems (dialogues) with ITSPOKE • Took a posttest • Took a User Satisfaction Survey

ITSPOKE User Satisfaction Survey S1. It was easy to learn from the tutor. S2. The tutor interfered with my understanding of the content. S3. The tutor believed I was knowledgeable. S4. The tutor was useful. S5. The tutor was effective on conveying ideas. S6. The tutor was precise in providing advice. S7. The tutor helped me to concentrate. S8. It was easy to understand the tutor. S9. I knew what I could say or do at each point in the conversations with the tutor. S10. The tutor worked the way I expected it to. S11. Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly. ALMOST ALWAYS (5), OFTEN (4), SOMETIMES (3), RARELY (2), ALMOST NEVER (1)

Evaluation Metrics • Student Learning Gains • SLG (standardized gain): posttest score – pretest score • NLG (normalized gain): posttest score – pretest score 1 – pretest score • Dialogue Efficiency • TOT (time on task): total time over all 5 dialogues (min.) • System Usability • S# (S1 – S11): score for each survey statement

Evaluation Methodology • For each of 14 evaluation metrics, compute a 2-tailed t-test over student set in each corpus, for 13 student sets: • All students (PR and SYN) • Students who may be more susceptible to tutor voice quality, based on 3 criteria: • Highest/High/Low/Lowest Time on Task • Highest/High/Low/Lowest Pretest Score • Highest/High/Low/Lowest Word Error Rate • High/Low Partition: criterion median in corpora • Highest/Lowest Partition: cutoffs above/below median

Student Learning Results • No significant difference ( p < .05) in learning gains (SLG or NLG) for any of the 13 student sets • No trend for a significant difference (p < .10) in learning gains (SLG or NLG) for any of the 13 student sets • Students learned significantly in both conditions (p=.000)

Dialogue Efficiency Results • Most knowledgeable SYN students may take more time to read transcript (PR most efficient) • PR voice marginally slower than SYN voice (e.g. in our example, PR = 13 seconds, SYN = 10 seconds)

System Usability Results (1) • S3. The tutor believed I was knowledgeable:more human-like qualities attributed to more human voice(PR preferred)

System Usability Results (2) • S11. Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly:more consistent with experience, not too human(SYN preferred)

System Usability Results (3) • S2. The tutor interfered with my understanding of the content:when voice & WER human-like, notice inflexible NLU/G (SYN preferred)

Summary • Evaluate impact of pre-recorded vs. synthesized tutor voice on system effectiveness in ITSPOKE • Student Learning Results: no impact • Dialogue Efficiency Results: little impact • High Pretest students took less time with PR (trend) • System Usability: little impact, mixed voice preference • All, High Word Error Rate and Highest TOT students felt PR believed them more knowledgeable (sig, trends) • Low Word Error Rate students felt SYN interfered less (trend) • High(est) Word Error Rate students preferred SYN for regular use (trend)

Conclusions and Future Work • Tutor voice quality has minimal impact in ITSPOKE • Why is text-to-speech sufficient in ITSPOKE context? • Transcript dilutes impact of voice • Students have time to get used to voice • Future Work: • Show transcript after tutor speech/not at all • Extend survey: how often is transcript read/how much effort to understand voice (Moller et al., 2006) • Try using other voices

Thank You! Questions? Further information: http://www.cs.pitt.edu/~litman/itspoke.html

Comparing Synthesized vs Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue System