HMM-based Singing Voice Synthesis in Mandarin: A Comprehensive Overview

Synthesis Unit and Question Set Definition for Mandarin HMM-based Singing Voice Synthesis

Outline • Introduction • Background • Motivation • Related work • Singing voice synthesis system • Evaluation • Discussion • Conclusion • Future work

Introduction- Background • Speech and singing are both important ways to communicate and present emotion • Speech synthesizer can generate fluency and natural speech well, even with personal characteristics • Singing voice synthesis has been one of the emerging and popular research topics recently • enables computers to sing any songs without the need of the actual singing of human

Introduction- Background • There are two main methods in the corpus-based singing synthesis area • sample-based approach: unit-selection • appropriate sub-word units are selected from large speech databases • Pros: high-quality speech at the waveform level • Cons: require huge amount of recorded data, discontinuous, unstable quality, fixed voice characteristics lyrics Note Score editor Synthesis score Synthesis output Sample selection concatenation Singer Library

Introduction- Background • sample-based approach: unit-selection • chosen from singing voice corpus with the lyrics of the song and corresponding MIDI file [Zhou, 2008] • Vocaloid • a singing synthesizer developed by Yamaha Corporation, initial released in January 2004 • Pitch conversion and timbre manipulation to smoothing concatenate samples

Introduction- Background • There are two main methods in the corpus-based singing synthesis area • statistical approach : HMM-based • Parameters model with context-dependent HMMs and waveforms are generated from the HMMs. • Pros: relatively little training data, smooth and stable quality, flexibility to control voice characteristics • Cons: vocoder sound, over-smoothing Singing waveform labels labels parameter extraction parameter generation Acoustic model training Waveform generation Synthesis output Acoustic model parameters Singing parameters

Introduction- Background • statistical approach : HMM-based • Sinsy • A free on-line singing voice synthesis service which provide JapaneseandEnglish version • Users can obtain synthesized singing voices by uploading musical scores represented in MusicXML

Introduction- Background • Another method for singing voice synthesis system • HNM (Harmonic plus Noise Model) • HNM parameters of a source syllable are used to synthesize singing syllables of diverse pitches and durations [Gu, 2008] • Speech-to-singing • Synthesize singing voice by parameters control model from lyrics of a song and its musical score [Akagi, 2007] • lyrics are converted into speech by TTS, then melody control model convert speech signal into singing voice by modifying the acoustic parameters [Cai, 2011]

Introduction- Motivation • In order to synthesize smooth and continuous singing voice, we chose HMM-based method to build our singing voice synthesis system • HMMcan model temporal sequence of singing voice • parameter generation from an HMM composed by concatenation of phoneme HMMs HMM state sequence State duration Spectral and lf0 parameters

Introduction- Improvement in Sinsy • These are a series of papers written by the producer of Sinsy’s team • [An HMM-based Singing Voice Synthesis System,2006] • The first paper about HMM-based singing voice synthesis system • [HMM-based Singing Voice Synthesis System using Pitch-shifted Pseudo Training Data,2010] • To increase the amount of F0 training data, pitch-shifted pseudo data can be prepared by shifting F0 up or down in halftone • [Recent Development of the HMM-based Singing Voice Synthesis System – Sinsy ,2010] • Introduce the free on-line singing voice synthesis service • [Pitch Adaptive Training For HMM-based Singing Voice Synthesis ,2012] • model-level normalization of pitch

Singing voice synthesis system- features extraction • STRAIGHT[H. Kawahara 1997] • A high-quality analysis synthesis method and offers high flexibility in parameter manipulation with no further degradation • extract parameters with relatively good performance in not professional recording environment • Features: Pitch, Smoothed Spectrum, Aperiodic factors Fixed-point analysis F0 extraction Analysis waveform Smoothed spectrum Aperiodic factors F0 Mixed excitation with phase manipulation Synthesis Synthetic waveform

Singing voice synthesis system- Proposed method for Mandarin singing • Speech vs. Singing • Pitch contour • Database, Model definition, question set

Singing voice synthesis system- Proposed method for Mandarin singing • Speech vs. Singing • Music Score • pitch: duration: • key: tempo: beat:

Singing voice synthesis system- Proposed method for Mandarin singing • Different from Sinsy • Language: from Japanese to Mandarin • Database, model definition, question sets • Refinement • Japanese Syllabary – hiragana • Japanese syllables are basically from "consonant + vowel" • only five vowel • Bopomofo • Existing 37 (initials 21, finals 16)

Singing voice synthesis system- Proposed method for Mandarin singing Acoustic parameters Model Question sets linguistic info note info cue info Singing Database Different from Sinsy Different from TTS Only for Mandarin Specially for singing

Singing voice synthesis system- system structure Training phase Singing voice database Excitation parameter extraction Spectral parameter extraction Aperiod parameter extraction Context-dependent HMMs & duration models CART-based state tying label Question set Training of HMM Synthesis phase Musical Score State selection byCART conversion label Excitation generation Synthesis filter Synthesized Singing Voice Parameter generation from HMM Spectral generation Aperiod generation

Singing voice synthesis system- Proposed method for Mandarin singing • Singing Voice Database Construction • Building a singing voice database for training and synthesis • MHMC Singing Voice Database • Mandarin singing Model definition • Initial and final modification • Medial modification • Long duration models • Question sets definition of decision trees • Modification for Mandarin • Refinements • Pitch coverage by pitch-shift pseudo data • Vibrato

Singing voice synthesis system- singing voice database construction • Singing Voice Database Construction • Singing corpus design process Music Score Corpus Songs selection Singing database Selected Scores Selected Scores Phonetic transcription Segmentation by phoneme Singing signal

Singing voice synthesis system- singing voice database construction • Singing Voice Database Construction • Songs selection • Selecting scores • Music book and internet version • Choosing criteria and specialization • Simple and no need many skills • Phone coverage • Digitizing data • format: MusicXML • Transposition to appropriate pitch range

Singing voice synthesis system- Model definition • MusicXML file Sheet Music score MusicXML format Key in Convert MusicXML is an XML-based file format for representing Western musical notation. The format is proprietary, but fully and openly documented.

Singing voice synthesis system- singing voice database construction • Singing Voice Database Construction • Singer selection and data processing • Finding candidates to record demo • 4 candidates • Choosing singer • the accuracy of pitch • timbre • Checking recorded data • noise is not allowed • exceed recording criterion • Segmentation and normalization • Phoneme • Let the energy of singing voice data smaller • avoid singing voice becomes loud suddenly • Pitch scale is too large leading to bad synthesize

Singing voice synthesis system- singing voice database • NCKU Singing Voice Database • We choose the 74 songs depends on the lyrics which can cover all mandarin phonemes

Singing voice synthesis system- Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Song Settings Note Absolute Pitch Note Type Measure Long duration Processing transcription Note Calculation Riffs and runs Processing User-defined phrase units Pause Processing Song Structure Note Pitch Note Duration linguistic information Label

Singing voice synthesis system- Model definition • Initial and final processing • tone • instead of the original tone of words, the main pitch of note is more significant • e.g. 不 speech->bu wuH wuL sing->bu wu • Vowel • We define the phonemes by phonology • The medial with the rime rather than the initial • When yi(ㄧ) 、 wu(ㄨ)、yu(ㄩ) is medial, than medial and rime are collectively known as one kind of final. speech singing

Singing voice synthesis system- Model definition • Initial and final processing • Single initial • A syllable only has initial without finals • followed with an empty rime “帀“ to pronounce • 捲舌音: ㄓㄔㄕㄖ+ zr 平舌音: ㄗㄘㄙ+ sr • Total phonemes are 59 (speech: 66)

Singing voice synthesis system- singing voice database • phonetic coverage • final • initial • final contains medial

Singing voice synthesis system- Model definition • Long duration model • To express well in singing, long duration note is important. • shorter notes will soon be over with no special effects. Long tone is different, it provide a larger space to express. • Lengthen the short duration note cannot present long duration note completely • half or whole note -> Final + “L” 一起飛飛就飛叫就叫

Singing voice synthesis system- Model definition • Riffs and runs processing • A syllable corresponding to multiple notes • Repeat the last tonal • Pause processing • In order to present the breathing pause or segmented pause when human singing • The singer suspend more than a threshold (> 0.3seconds) • a rest

Singing voice synthesis system- Model definition • Linguistic information • phoneme • current phoneme, { preceding, succeeding } two phonemes • syllable • # of phonemes at {preceding, current, succeeding} syllable • Phrase • # of phonemes/syllables at {preceding, current, succeeding} phrase • song • # of average phonemes/syllables in measure in this song • # of phrases in this song • Riffs and Run

Singing voice synthesis system- Model definition • Singing is the act of producing musical sounds with the voice, and augments regular speech by the use of both tonality and rhythm • Note pitch • Pitches are compared as "higher" and "lower" in the sense associated with musical melodies • Note duration • An amount of time or a particular time interval. It is the length of a note and one of the bases of rhythm. • Songs structure • what kind of an overall musical form or structure the song adopts • the order of a music score

Singing voice synthesis system- Model definition • User-defined phrase units • phrasing may be necessary for the singer to take catch breaths or to achieve a certain style. • definition in relation to music is ”a short passage or segment, often consisting of four measures or forming part of a smaller/larger unit” • We defined the unit of phrase depend on the song structure. • used in outside label to present breathing pause 4 measures / phrase 2 measures / phrase

Singing voice synthesis system- Model definition • Note Calculation • the basic information is not enough to present one note completely • Relative pitch • means difference between the key note and the current note • Key note depends on numbers of sharps or flats • Note position • different note positions in the measure or phrase may have different expression due to breathing • unit: note, 0.1 second,thirty-second note, % • Note length • 0.1 second(absolute pitch), thirty-second note(relative length)

Singing voice synthesis system- Model definition • Note information • Note Pitch • Absolute pitch (C0-G9), relative pitch(0-11), the difference pitch between previous & current / current & next • Note Duration • Length of note by syllable, thirty-second note, 0.1 second • Song Structure • Beat: 2/4, 3/4, 4/4 • Tempo: 90, 100, 120 • key • Position • Count by note, 0.1 second,thirty-second note, percentage in the measure/phrase • Number of phrases

Singing voice synthesis system- Question sets definition • Question sets definition for singing model clustering (1) Phoneme (current and { preceding, succeeding } two phonemes) • Final • With or without medial • Initial • Initials pronunciation category • Finals pronunciation category (2) Note • Pitch • Tempo • Beat • Duration • Position • (3) phrase • # of phonemes/syllables • preceding, current, succeeding phrase • (4) song • # of phonemes/syllables • # of phrases

Singing voice synthesis system- Refinement • Pitch-shift pseudo data • Pitch coverage • using the nearby notes from other songs and shift to corresponding Hertz

Singing voice synthesis system- Refinement

Evaluation • Experimental Conditions • Database condition • Mel-cepstral analysis condition

Evaluation • Experiments settings • Baseline • RQ : Reduced Question sets duplicate questions, indirect questions, relative questions • PS : Pitch-shift pseudo data • VP : Vibrato post-processing

Evaluation- Subjective evaluation • Pitch contour • Synthesized (baseline) vs. Music score • Synthesized (baseline) vs. Original singing

Evaluation- Subjective evaluation • Mean Opinion Scores(MOS) • 10 synthesize songs • 12 subjects • Quality and Intelligibility evaluation • ABX test • A subject is presented with two known samples (A, the reference, and B, the alternative. X is randomly selected from A and B, and the subject identifies X as being either A or B)

Evaluation- Subjective • Quality evaluation Intelligibility evaluation

Demo • Outside Test baseline+QR baseline baseline+QR+PS 娃娃哭了叫媽媽推你摔下你又站起來

Evaluation- Subjective • The score of quality and intelligibility is lower than baseline • The question set we reduced including the important information to classify • Too few question • 5364->1257 • Find out the better version of reduced question sets

Preference test • Natural- Testing vibrato • different pitch and situation corresponding to different settings • Vibrato is not essential in children’ songs original vibrato

Discussion • Singing corpus quality • Recording in professional environment • Singer’s timbre • Context factor coverage • Too blurred • Not enough training corpus • modeled with priority of singing characteristics

HMM-based Singing Voice Synthesis in Mandarin: A Comprehensive Overview

HMM-based Singing Voice Synthesis in Mandarin: A Comprehensive Overview

Presentation Transcript

HMM An Initial Study on HMM-based TTS for Mandarin Chinese

UNIT 5 Protein Synthesis

Veterinary Synthesis Based on Synthesis 8.1

UNIT-V VHDL Synthesis: VHDL Synthesis, Circuit Design

Question-led systematic reviews: implications for synthesis

Constraint Based Synthesis for Beginners

Creation of HMM-based Speech M odel for Estonian Text-to-Speech Synthesis

Synthesis Essay Question Exploration Guide

Polymer Based Synthesis

A novel irregular voice model for HMM-based speech synthesis

STG-based synthesis and Petrify

HMM-Based Synthesis of Creaky Voice

Design of Tree-based Context Clustering for an HMM-based Thai Speech Synthesis System

Texture Optimization for Example-based Synthesis

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Synthesis Question

HMM-based speech synthesis: the new generation of artificial voices

A Bayesian Approach to HMM-Based Speech Synthesis

Texture Optimization for Example-based Synthesis

Texture Optimization for Example-based Synthesis

Texture Optimization for Example-based Synthesis

UNIT 8: Synthesis Basics