Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

Large Scale Evaluation of Corpus-based Synthesizers:The Blizzard Challenge 2005 Christina Bennett Language Technologies Institute Carnegie Mellon University Student Research Seminar September 23, 2005

Speech Synthesizer Transcript Voice talent speech Corpus New text + = New speech What is corpus-based speech synthesis?

Need for Speech Synthesis Evaluation Motivation • Determine effectiveness of our “improvements” • Closer comparison of various corpus-based techniques • Learn about users' preferences • Healthy competition promotes progress and brings attention to the field

Blizzard Challenge Goals Motivation • Compare methods across systems • Remove effects of different data by providing & requiring same data to be used • Establish a standard for repeatable evaluations in the field • [My goal:]Bring need for improved speech synthesis evaluation to forefront in community (positioning CMU as a leader in this regard)

Blizzard Challenge: Overview Chal lenge • Released first voices and solicited participation in 2004 • Additional voices and test sentences released Jan. 2005 • 1 - 2 weeks allowed to build voices & synthesize sentences • 1000 samples from each system (50 sentences x 5 tests x 4 voices)

Evaluation Methods Chal lenge • Mean Opinion Score (MOS) • Evaluate sample on a numerical scale • Modified Rhyme Test (MRT) • Intelligibility test with tested word within a carrier phrase • Semantically Unpredictable Sentences (SUS) • Intelligibility test preventing listeners from using knowledge to predict words

Challenge setup: Tests Chal lenge • 5 tests from 5 genres • 3 MOS tests (1 to 5 scale) • News, prose, conversation • 2 “type what you hear” tests • MRT – “Now we will say ___ again” • SUS – ‘det-adj-noun-verb-det-adj-noun’ • 50 sentences collected from each system, 20 selected for use in testing

Challenge setup: Systems Chal lenge • 6 systems: (random ID A-F) • CMU • Delaware • Edinburgh (UK) • IBM • MIT • Nitech (Japan) • Plus 1: “Team Recording Booth”(ID X) • Natural examples from the 4 voice talents

Challenge setup: Voices Chal lenge • CMU ARCTIC databases • American English; 2 male, 2 female • 2 from initial release • bdl (m) • slt (f) • 2 new DBs released for quick build • rms (m) • clb (f)

Challenge setup: Listeners Chal lenge • Three listener groups: • S – speech synthesis experts (50) • 10 requested from each participating site • V – volunteers (60, 97 registered*) • Anyone online • U – native US English speaking undergraduates (58, 67 registered*) • Solicited and paid for participation *as of 4/14/05

Challenge setup: Interface Chal lenge • Entirely online http://www.speech.cs.cmu.edu/blizzard/register-R.html http://www.speech.cs.cmu.edu/blizzard/login.html • Register/login with email address • Keeps track of progress through tests • Can stop and return to tests later • Feedback questionnaire at end of tests

Fortunately, Team X is clear “winner” Results

Team D consistently outperforms others Results

Speech experts are biased “optimistic” Results

Speech experts are better in fact experts Results

Voice results: Listener preference Results • slt is most liked, followed by rms • Type S: • slt - 43.48% of votes cast; rms - 36.96% • Type V: • slt - 50% of votes cast; rms - 28.26% • Type U: • slt - 47.27% of votes cast; rms - 34.55% • But, preference does not necessarily match test performance…

Voice results: Test performance Results Female voices - slt

Voice results: Test performance Results Female voices - clb

Voice results: Test performance Results Male voices - rms

Voice results: Test performance Results Male voices - bdl

Voice results: Natural examples Results What makes natural rms different?

Voice results: By system Results • Only system B consistent across listener types: (slt best MOS, rms best WER) • Most others showed group trends, i.e. (with exception of B above and F*) • S: rms always best WER, often best MOS • V: slt usually best MOS, clb usually best WER • U: clb usually best MOS and always best WER  Again, people clearly don’t preferthe voices they most easily understand

Lessons learned: Listeners Lessons • Reasons to exclude listener data: • Incomplete test, failure to follow directions, inability to respond (type-in), unusable responses • Type-in tests very hard to process automatically: • Homophones, misspellings/typos, dialectal differences, “smart” listeners • Group differences: • V most variable, U most controlled, S least problematic but not representative

Lessons learned: Test design Lessons • Feedback re tests: • MOS: Give examples to calibrate scale (ordering schema); use multiple scales (lay-people?) • Type-in: Warn about SUS; hard to remember SUS; words too unusual/hard to spell • Uncontrollable user test setup • Pros & Cons to having natural examples in the mix • Analyzing user response (+), differences in delivery style (-), availability of voice talent (?)

Goals Revisited Lessons • One methodology clearly outshined rest • All systems used same data allowing for actual comparison of systems • Standard for repeatable evaluations in the field was established • [My goal:]Brought attention to need for better speech synthesis evaluation (while positioning CMU as the experts)

For the Future Future • (Bi-)Annual Blizzard Challenge • Introduced at Interspeech 2005 special session • Improve design of tests for easier analysis post-evaluation • Encourage more sites to submit their systems! • More data resources (problematic for the commercial entities) • Expand types of systems accepted (& therefore test types) • e.g. voice conversion

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

Presentation Transcript

Surviving Large Scale Internet Outages

The Cloud Resolving Storm Simulator: Large-scale Parallel Computations

Large Scale Integrated Circuits

Design and Analysis of Large Scale Log Studies A CHI 2011 course v11

The Cloud Resolving Storm Simulator: Large-scale Parallel Computations

Using Large-Scale Climate Information to Forecast Seasonal Streamflows in the Truckee and Carson Rivers

What is a CORPUS?

Large-Scale Financial Risk Management Services

Large-Scale Copy Detection

Introduction to Large Scale Modeling Systems

Large Scale Machine Learning for Content Recommendation and Computational Advertising

An Introduction to Game-Based Assessment

Scalability and Efficiency Challenges in Large-Scale Web Search Engines

Direct synthesis of large-scale asynchronous controllers using a Petri-net-based approach

Corpus Linguistics: Introduction

Routing in Large Scale Ad Hoc and Sensor Networks

Transformation based assessment of dependability

Geologic Time Scale

STRING Large-scale data and text mining

CLASSROOM GAMES FROM CORPORA

Large Scale Studies of Dyslexia in Florida