140 likes | 277 Vues
This presentation explores the evaluation of corpus-based speech synthesis systems using data from the 2005 Blizzard Challenge. It discusses the quality of speech corpora, the approaches to building text-to-speech systems, and the criteria for evaluating their effectiveness. Key topics include the reliability of different evaluation methods, the significance of common databases, and understanding quality measures like intelligibility and naturalness. The session aims to provide insights into the trade-offs between corpus quality, system development speed, and listener evaluations across various teams and systems.
E N D
Evaluation of Corpus based Synthesizers The Blizzard Challenge – 2005: Evaluating corpus-based speech synthesis on common datasets Alan W. Black and Keiichi Tokuda Large Scale Evaluation of Corpus based Synthesizers: Results and Lessons from the Blizzard Challenge 2005 Christina L. Bennett Presented by:Rohit Kumar
What are they Evaluating ? • Corpus based Speech Synthesis Systems • 2 primary Elements of any such system • Corpus (High quality speech data) • Approach to build a Text to Speech System • The Quality of the Text to Speech System developed by this Corpus based System is heavily tied with the Quality of Speech Corpus • How do we evaluate the Approach then ?? • Common Corpus (Database)
What are they Evaluating ? • Quality of the Approach • Not considering how good the corpus itself is • Capability to quickly build systems given the Corpus • TTS development has evolved from being a science to being a Toolkit • Again not considering the time to create the corpus. • Tug of War between Time taken to create a high quality corpus, fine tuning the system (Manual work) and Merit of the Approach itself. • Reliability of each Particular Evaluation Method (Black & Tokuda) • Reliability of each Listener Group forEvaluation (Black & Tokuda)(Bennett)
Alrite. How to Evaluate ? • Create Common Databases • Issues with common databases • Design parameters, Size of Databases, etc. • Non Technical Logistics: Cost of creating databases • Using the CMU-ARCTIC Databases
Alrite. How to Evaluate ? • Evaluate different Quality Measures • Quality is a realllllllllllly broad term • Intelligibility, Naturalness, etc.. • 5 Test of 3 types • 3 Mean opinion score tests- different domains • Novels (in-domain), News, Conversation • DRT/MRT • Phonetically confusable words embedded in sentences • Semantically Unpredictable Sentences • Create Common Databases
Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • Another 7th System added: Real Human Speech
Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Databases released in Phase 1 to develop approaches (Practice databases) • Another 2 Databases released in Phase 2 (with time bounded submission)
Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Phase Challenge • Web based Evaluation • Participants choose a test and Complete it • Can do the whole set of test in multiple sessions • Evaluates 100 sentences per participant
Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Phase Challenge • Web based Evaluation • Different types of Listeners • Speech Experts, Volunteers, US Undergrads • Special Incentive to take test 2nd time
Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Phase Challenge • Web based Evaluation • Different types of Listeners • Any Question about Evaluation Setup ?
Fine. So what did they get ? • Evaluation of 6 Systems + 1 Real Speech • Observations:Real Speech consistently BestLot of inconsistency across tests But Agreement on the Best SystemListener Groups V & U very similar for MOS test
Additional Agenda • Comparing Voices • Exit Poll • Votes for Voices • Inconsistencies between Votes and Scores • Consistency in votes of voices across Listener Groups
Discussion • Numbers given are all averages • No variance figures • Consistency of scores of each system ?? • Ordering of Tests: Participant’s choice • Measuring Speed of Development ?? • Nothing in the Evaluation method as such to measure speed of development • Some of the participants who submitted papers about their system in this competition did give those figures • Also, no control on Number of Man-Hours, Computational Power • Testing Approach on Quality of Speech • Issues like How much computational effort it takes not looked at • Web based Evaluation (Black & Tokuda) • Uncontrolled Random variables • Participant’s Environment, Network connectivity • Ensuring usage of the Common Database (and no additional Corpus) • Voice Conversion: Similarity Tests (Black & Tokuda) • Word Error Rate calculation for Phonetically Ambiguous Pairs ? • Non-Native Participant’s effect on Word Error Rates (Bennett) • Homophone Words (Bean/Been) (Bennett) • Looking back what they were Evaluating