Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University of Washington, Seattle

Limited Domain Synthesis Standard Approach Our Approach Concept Canonical Pronunciation A Network of Pronunciations Prosody Prediction return[H*] Seattle[L*] Boston[L*][H-H%] to H* L* L* H-H% Will you from Will you return from Seattle to Boston? to Prosodic Target return[L*+H] Seattle[none] Boston[H*][H-H%] Find best path Unit Selection Compose Unit DB Dynamic Search from Seattle[L*] Sequence of Units ... ... ... ... Waveform Concatenation C(i,j) Speech

Choice of Units and Prosodic Categories Will you return from Seattle to Boston H* L* L* H-H% • Why symbolic prosodic targets? • They capture categorical perceptual differences Boundary Tones: L-L% L-H% H-L% H-H% Pitch Accents: high H*, L+H* low L*, L*+H downstepped !H*, L+!H*, H+!H*

Modeling Prosody with WFSTs Will you return from Seattle to Boston low/high low/none low/high H-H% Seattle[low] / 0.5 to Boston[low][H-H%] Will you return[high] from / 0.4 template Will you return[low] from / 1.2 Seattle[none] / 0.9 to Boston[high][H-H%] + Union from[none] / 0.2 Seattle[none] / 1.2 from[low] / 1.8 Seattle[low] / 0.3 ... ... prosody prediction Seattle[high] / 0.8 from[high] / 2.2 from[ds] / 2.7 Seattle[ds] / 2.1

Representing Decision Trees with WFSTs a:s/c(0.8) a:t/c(0.2) F=a F=b b:s/c(0.3) P(X=s)=0.8 P(X=t)=0.2 P(X=s)=0.3 P(X=t)=0.7 b:t/c(0.7) c(p) = -log(p)

Modular Structure of Prosody Model Prosody Prediction WFST Phrase Break Template Prosody WFST Utterance level + Phrase breaks Prosody Prediction WFST Accent & Tone Template Prosody WFST Phrase level + Accents Tones Other levels (if necessary)

Representing Unit DB as WFST Seattle to Boston uk ui to:uk/C(ui,uk) ui ui+1 Concatenation Cost: C(ui,uk)=0.5(d1+d2) d1 d2 uk-1 uk

Experiments • 14 target utterances in 3 versions: A. no prosody prediction, unit selection is based entirely on the concatenation costs B. only one zero-cost prosodic target in the template (all others have very high and equal costs) C. a prosody template that allows alternative paths weighted according to their relative frequency • Travel domain corpus from University of Colorado (~2hrs) • Automatically segmented • Annotated with ToBI labels (220 utterances) • 4 subjects - native speakers of American English

Conclusions and Future Work • Combining prosody prediction and unit selection improves naturalness • The WFST architecture is • flexible : accommodates variable size units and different forms of prosody generation • efficient : composition and finding the best path are fast operations, allowing real-time synthesis • Future work will focus on making these techniques applicable to subword units

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis

Presentation Transcript

Speech synthesis

Harnessing Speech Prosody for Human-Computer Interaction

Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents

Prosody Modeling (in Speech)

Speech Synthesis

Speech Synthesis

Modeling Prosody for Language Identification on Read and Spontaneous Speech

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

Perspectives for Articulatory Speech Synthesis

Automatic Pruning of Unit Selection Speech Databases for Synthesis without loss of Naturalness

Speech Synthesis Technology

Speech Synthesis

Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System

Visible Speech Synthesis

4. Speech Synthesis

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

5- Speech Synthesis