1 / 9

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis. Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University of Washington, Seattle. Limited Domain Synthesis. Standard Approach. Our Approach. Concept. Canonical Pronunciation.

LionelDale
Télécharger la présentation

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University of Washington, Seattle

  2. Limited Domain Synthesis Standard Approach Our Approach Concept Canonical Pronunciation A Network of Pronunciations Prosody Prediction return[H*] Seattle[L*] Boston[L*][H-H%] to H* L* L* H-H% Will you from Will you return from Seattle to Boston? to Prosodic Target return[L*+H] Seattle[none] Boston[H*][H-H%] Find best path Unit Selection Compose Unit DB Dynamic Search from Seattle[L*] Sequence of Units ... ... ... ... Waveform Concatenation C(i,j) Speech

  3. Choice of Units and Prosodic Categories Will you return from Seattle to Boston H* L* L* H-H% • Why symbolic prosodic targets? • They capture categorical perceptual differences Boundary Tones: L-L% L-H% H-L% H-H% Pitch Accents: high H*, L+H* low L*, L*+H downstepped !H*, L+!H*, H+!H*

  4. Modeling Prosody with WFSTs Will you return from Seattle to Boston low/high low/none low/high H-H% Seattle[low] / 0.5 to Boston[low][H-H%] Will you return[high] from / 0.4 template Will you return[low] from / 1.2 Seattle[none] / 0.9 to Boston[high][H-H%] + Union from[none] / 0.2 Seattle[none] / 1.2 from[low] / 1.8 Seattle[low] / 0.3 ... ... prosody prediction Seattle[high] / 0.8 from[high] / 2.2 from[ds] / 2.7 Seattle[ds] / 2.1

  5. Representing Decision Trees with WFSTs a:s/c(0.8) a:t/c(0.2) F=a F=b b:s/c(0.3) P(X=s)=0.8 P(X=t)=0.2 P(X=s)=0.3 P(X=t)=0.7 b:t/c(0.7) c(p) = -log(p)

  6. Modular Structure of Prosody Model Prosody Prediction WFST Phrase Break Template Prosody WFST Utterance level + Phrase breaks Prosody Prediction WFST Accent & Tone Template Prosody WFST Phrase level + Accents Tones Other levels (if necessary)

  7. Representing Unit DB as WFST Seattle to Boston uk ui to:uk/C(ui,uk) ui ui+1 Concatenation Cost: C(ui,uk)=0.5(d1+d2) d1 d2 uk-1 uk

  8. Experiments • 14 target utterances in 3 versions: A. no prosody prediction, unit selection is based entirely on the concatenation costs B. only one zero-cost prosodic target in the template (all others have very high and equal costs) C. a prosody template that allows alternative paths weighted according to their relative frequency • Travel domain corpus from University of Colorado (~2hrs) • Automatically segmented • Annotated with ToBI labels (220 utterances) • 4 subjects - native speakers of American English

  9. Conclusions and Future Work • Combining prosody prediction and unit selection improves naturalness • The WFST architecture is • flexible : accommodates variable size units and different forms of prosody generation • efficient : composition and finding the best path are fast operations, allowing real-time synthesis • Future work will focus on making these techniques applicable to subword units

More Related