Vocal Conversion from Speaking voice to Singing voice Using STRAIGHT

Vocal Conversion from Speaking voice to Singing voice Using STRAIGHT Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan Advanced Institute of Science and Technology (JAIST)

Introduction Our research approach focuses on … not text-to-singing (lyric-to-singing) synthesis ♪ ♪ ♪ ♪ singing singing ♪ ♪ ♪ ♪ butspeech-to-singing synthesis (vocal conversion). ⇒Clarifying acoustic differences between singing and speaking. ⇒Developing novel applications for computer music production. speech

Outline of the vocal conversion system • Vocal conversion system is • based on speech manipulation system • STRAIGHT(Kawahara et al,1998) and • comprises three types of model; • F0 control model • Duration control model • Spectral control model

Inputs Speaking voice: reading the lyrics of a song. c v v c v c v c c Synchronization information Musical score

1st step: F0 control Musical score Musical notes • Overshoot :Deflection exceeding the target note after note change. Melody contour • Vibrato :Quasi-periodic frequency • modulation with 4 - 7 Hz. F0 control model: Adding four types of F0 fluctuation into musical note. • Preparation :Deflection in the • opposite direction of note change • observed just before note change. • Fine fluctuation :irregularly • fluctuations higher than 10 Hz • in full contour. F0 contour of singing voice

2nd step: analysis Speaking voice STRAIGHT (analysis part)

3rd step: duration control Spectral sequence AP sequence Duration control model: is lengthened according to the fix rate. is not lengthened. is lengthened so that the duration of the whole combination corresponds to the note duration.

4th step: spectral control 1 Lengthened Spectral and AP sequence Spectral envelope and AP of vowel part. Spectral control model1: Adding singing formant by emphasizing peak of spectral envelope and dip of AP. Modified spectral envelope and AP

5th step: spectral control2 Modified spectral and AP Generated F0 contour STRAIGHT (synthesis) Synthesized singing voice Spectral control model 2: Adding an amplitude modulation (AM) of formants synchronized with vibrato by adding AMs into amplitude envelope of the synthesized singing voice during vibrato. Synthesized singing voice (final version)

Demonstration • Speaking voice (input): (male → female) • Synthesized singing voice: (male → female → chorus)

Thank you!!

F0 control model (saitouet al, 2004)

Fix rates for lengthening consonant part ♪Calculating the ratios of the duration of each consonant in singing-voices to read speech We can control phoneme duration by controlling articulation manner rather than articulation positions: fricative 1.28, plosive 1.00, semivowel 2.37, nasal 1.43, /y/ 1.22

Spectral characteristics in singing-voice • Singers’ formant: Remarkable peak of spectral at around 3 kHz. (Sundberg, 1974) • Amplitude modulationof formants synchronized with vibrato. (Hirano, 1985) Both features are remarkably contained to a professional singing-voice.

Spectral control model Spectral control 2: Amplitude modulation of formants synchronized with vibrato in F0. Spectral control 1: Singing formant that is a remarkable peak of spectrum at around 3 kHz . 2000 Hz

Vocal Conversion from Speaking voice to Singing voice Using STRAIGHT