1 / 45

Toward a high-quality singing synthesizer with vocal texture control

Toward a high-quality singing synthesizer with vocal texture control. Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University, Stanford, CA94305, USA . Score-to-Singing system. Parametric Database. Phoneme. F0 Sound level Duration Vibrato. Score

kass
Télécharger la présentation

Toward a high-quality singing synthesizer with vocal texture control

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University, Stanford, CA94305, USA

  2. Score-to-Singing system Parametric Database Phoneme F0 Sound level Duration Vibrato Score Lyrics Singing style Singing voice Rule system Sound synthesis • Acoustic rendering • Co-articulation rules • Lyrics-to-phoneme • Musical rules

  3. General sound synthesis approaches Cons Pros Physical Modeling • analysis/re-synthesis • difficult • invasive measurements • flexible/intuitive control • expressive • co-articulation easy Source-filter Model • less expressive • co-articulation • difficult Spectral Modeling • analysis/re-synthesis • easy

  4. Contributions A pseudo-physical model for singing voice synthesis which • is an approximate physical model. • can generate high-quality non-nasal singing voice. • has analysis/re-synthesis ability. • is computationally affordable. • provides flexible control of vocal textures. An Automatic analysis procedure for analysis/re-synthesis A parametric model for vocal texture control

  5. Outline • Human voice production system • Synthesis model • Analysis procedure • Vocal texture parametric model • Vocal texture control demo • Contributions and future directions

  6. The human voice production system Nasal sound output Nasal cavity Velum Oral sound output Pharyngeal cavity Oral cavity Vocal folds Tongue hump Lungs Muscle force

  7. Oscillation pattern of the vocal folds Opening period Closing period Close phase Open phase • The oscillation results from the balancing of the subglottal • pressure, the Bernoulli pressure and the elastic restoring force • of the vocal folds. • Prephonatory position : the initial configuration of the • vocal folds before the beginning of oscillation.

  8. Variation of vocal textures Pressed Normal Breathy

  9. Glottal Source Vocal Tract Filter Radiation Aspiration noise Simplified human voice production model • Source-tract interaction: The glottal waveform in general • depends on the vocal tract configuration. • Neglect the source-tract interaction since the glottal impedance • is very high most of the time.

  10. Glottal excitation Filter Voice output Derivative Glottal Wave Vocal Tract Filter Aspiration noise Source-filter type synthesis model Glottal Source Vocal Tract Filter Radiation Aspiration noise

  11. Overview of the proposed synthesis model Glottal excitation Filter Derivative glottal wave Voice output All Pole Filter Transformed Liljencrants-Fant Model Noise Residual Model High-passed aspiration noise

  12. derivative glottal wave from LF model 0.05 pressed phonation amplitude 0 -0.05 0 200 400 600 800 1000 1200 1400 0.05 normal phonation amplitude 0 -0.05 0 200 400 600 800 1000 1200 1400 0.05 breathy phonation amplitude 0 -0.05 0 200 400 600 800 1000 1200 1400 time index Transformed Liljencrants-Fant (LF) model • The transformed LF model controls the wave shape of the derivative • glottal wave via a single parameter, Rd( wave-shape control parameter).

  13. Direct synthesis timing parameters Synthesis: Derivative glottal wave Mapping LF model Rd Transformed Liljencrants-Fant (LF) model • Transformed LF model is an extension of the LF model. It provides • a control interface for the LF model to change the wave shape of the • derivative glottal wave easily. Wave shape control parameter Direct synthesis timing parameters Analysis: Estimated derivative glottal wave LF fitting Mapping-1 Rd

  14. Liljencrants-Fant (LF) model

  15. Direct synthesis timing parameters Synthesis: Derivative glottal wave Mapping LF model Rd Transformed Liljencrants-Fant (LF) model • Transformed LF model is an extension of the LF model. It provides • a control interface for the LF model to change the wave shape of the • derivative glottal wave easily. Wave shape control parameter Direct synthesis timing parameters Analysis: Estimated derivative glottal wave LF fitting Mapping-1 Rd

  16. Noise residual model Bn Noise floor Noise residual Gaussian Noise Generator Amplitude Modulation + An GCI L

  17.      Vocal tract filter • An all-pole filter. • The vocal tract is assumed to be a series of concatenated uniform • lossless cylindrical acoustic tubes. • Assume that sound waves obey planar propagation along the axis • of the vocal tract.  A1 A2 AN Alip glottis lip end 1-kN Ulip Ug -kN -1

  18. Vocal tract filter Kelly-Lochbaum junction : 1-km + + Um Um+1 Scattering coefficient Am -km km Am+1 - - Um+1 Um 1+km • : the propagation time for sound wave to travel one acoustic tube. N : the number of acoustic tubes excluding the glottis and the lip end. • If sampling period T = 2 , the transfer function of the vocal tract • acoustic tubes can be shown to be an Nth order all-pole filter. • The autoregressive coefficients of the vocal tract filter can be • converted to scattering coefficients by Durbin’s method.

  19. Overall synthesis model implementation Degree of breathiness Transformed LF model Ee , F0 Vocal texture model Rd 0.8 + Noise residual model Glottal excitation strength Ee Fundamental frequency F0 Output voice   (No noise input)  

  20. Analysis procedure Inverse filtered glottal excitation Desired voice recording LF model coefficients Fitting the estimated derivative glottal wave via LF model Source-filter de-convolution De-noising by Wavelet Packet Analysis High-passed aspiration noise

  21. Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) Low-pass filter N+1 order all pole filter Source-filter de-convolution • Synthesis model for analysis KLGLOTT88 (KL) derivative glottal wave Basic Voicing Waveform (a, b, OQ)

  22. N+1 order all pole filter Basic Voicing Waveform (a, b, OQ) Source-filter de-convolution • Synthesis model for analysis KLGLOTT88 (KL) derivative glottal wave Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) Low-pass filter

  23. Source-filter deconvolution estimation flowchart Voice signal after removing the low frequency drift GCI detection Phase I One glottal period signal Loop for each period Loop over different OQ values: Vocal tract filter and glottal source estimation via SUMT End Select and store 5 best estimates Loop for each period: Enforce continuity constraints via Dynamic Programming End Phase II Smoothing the vocal tract area by time averaging and linear interpolation Estimated model parameter sequence

  24. N+1 order all pole filter Basic Voicing Waveform (a, b, OQ) Convex optimization formulation Inverse filter • Estimate by minimizing the error between the basic voicing waveform and the estimated one.

  25. Convex optimization formulation • Error for one glottal cycle in vector form, A convex optimization problem Minimize Subject to • L2 norm is used The above problem can be solved by SUMT (sequential unconstrained minimization technique).

  26. De-convolution result (synthetic data)

  27. Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) Low-pass filter Effective analysis/re-synthesis Baritone examples: • Normal phonation original KLGLOTT88 • Pressed phonation original KLGLOTT88 KLGLOTT88 (KL) derivative glottal wave

  28. Analysis procedure Inverse filtered glottal excitation Desired voice recording LF model coefficients Fitting the estimated derivative glottal wave via LF model Source-filter de-convolution De-noising by Wavelet Packet Analysis High-passed aspiration noise

  29. De-noising by Wavelet Packet Analysis De-noising by best basis thresholding : • A noisy data record: X = f + W • Transform the noisy data to another basis • via Wavelet Packet Analysis : XB = fB + WB • Thresholding out the smaller coefficients of XB by assuming • that f can be compactly represented in the new basis by • a few large coefficients. • Select the wavelet filter by energy compactness criteria: • 1/(number of coefficients needed to accumulate 0.9 of the total energy).

  30. De-noising result (synthetic data)

  31. Analysis procedure Inverse filtered glottal excitation Desired voice recording LF model coefficients Fitting the estimated derivative glottal wave via LF model Source-filter de-convolution De-noising by Wavelet Packet Analysis High-passed aspiration noise

  32. Effective analysis/re-synthesis Baritone examples: • Normal phonation original LF • Pressed phonation original LF

  33. Vocal texture control • The parametric vocal texture control model determines the • parameterizations of the glottal excitation to achieve the desired vocal texture. • Reduce the control complexity by exploring the correlations • between the model parameters. Wave shape control parameter Desired vocal texture Non-breathy mode Transformed LF model ? Rd Glottal excitation strength Ee Rd breathy mode Noise residual model ?

  34. Vocal texture control (non-breathy mode) Pressed and normal modes Wave-shape control parameter Rd and normalized glottal excitation strength Ee are highly correlated.

  35. Vocal texture control (non-breathy mode) Degree of pressness interpolation (apress bpress cpress) (anormal bnormal cnormal) Wave shape control parameter (a, b, c) Glottal excitation Glottal excitation strength Ee Transformed LF model Rd

  36. Vocal texture control (breathy mode) High-passed noise energy • NHR per glottal cycle  Glottal excitation strength Ee • NHR is an indicator for the degree of breathiness. • The contour of the noise strength is adjusted by NHR. Glottal excitation Desired vocal texture Transformed LF model NHR + Rd Ee Bn=1 gain Noise residual model An = 2.4138* Bn + 0.213 duty cycle window lag

  37. Overall synthesis model implementation Degree of breathiness Transformed LF model Ee , F0 Vocal texture model Rd Glottal excitation 0.8 + Noise residual model Glottal excitation strength Ee Fundamental frequency F0 Output voice    

  38. Vocal texture control demo

  39. Contributions A pseudo-physical model for singing voice synthesis which • is an approximate physical model. • can generate high-quality non-nasal singing voice. • has analysis/re-synthesis ability. • is computationally affordable. • provides flexible control of vocal textures. An Automatic analysis procedure for analysis/re-synthesis A parametric model for vocal texture control

  40. Future research • Build a complete score-to-singing system using the proposed • synthesis model. Its associated analysis procedure will be used • to construct the parametric database. • Investigate potential usage of the source-filter deconvolution • algorithm to low-bit rate high quality speech coding. • Explore the application of the analysis procedure on sound • transformation of vocal textures.

  41. Thank you !

More Related