Christophe d’Alessandro LIMSI-CNRS Orsay, France

New paradigms for speech analysis and processing: the source-filter model revisited and gesture-controlled analysis-by-synthesis Christophe d’Alessandro LIMSI-CNRS Orsay, France Speech Analysis and Processing for Knowledge Discovery

Acknowledgements Contributions of : Boris Doval, Nathalie Henrich, Baris Bozkurt, Thierry Dutoit, Nicolas Sturmel, Albert Rilliard, Sylvain Le Beux Are gratefully acknowledged Speech Analysis and Processing for Knowledge Discovery

Voice, Speech, Singing, Meaning and Expression • Functions of voice in communication: • Linguistic and pragmatic functions : to convey linguistic meaning (ideas, concepts, facts …), to perform speech acts (command, promise …). Mainly associated to phoneme and words (double articulation). Noted using writing. • Expressive function: to make audible attitudes, feelings, emotions, personality, mood. Speech beyond (or below) linguistic meaning. Mainly associated to prosody and voice quality. Difficult to note using writing. • “The music of speech” • Musical function: singing, non linguistic but highly structured communication Speech Analysis and Processing for Knowledge Discovery

Challenges in speech processing Ubiquitous speech processing, human-machine dialogue Expressive Speech Synthesis (speaking machine vs. reading machines) Recognition of emotion, attitudes, moods, aging: robustness in Automatic Speech Recognition Speaker dependant features: speaker identification, voice aging Voice pathology and diagnosis Speech Analysis and Processing for Knowledge Discovery

Static and dynamic features in speech Knowledge discovery in speech analysis and processing is based on both static and dynamic features of the speech signals. Static features are corresponding to parameters of a model or “settings”. Dynamic features are corresponding to parameter trajectories, or “gestures”. Speech Analysis and Processing for Knowledge Discovery

Content of the talk • Introduction • Voice source, voice quality and parameter estimation • Real-time instruments for synthesis of voice quality and real time analysis/modification/synthesis of prosody Speech Analysis and Processing for Knowledge Discovery

Emergence of voice quality in speech studies Until recently, voice quality and its functions in speech communication has been only marginally considered in the speech communication community. However, there is some evidence that voice quality settings and voice quality modulations are playing a central role in human voice-based communication, i.e. speech, singing and other kinds of expressive vocalizations. Speech Analysis and Processing for Knowledge Discovery

Voice source analysis • Voice source analysis is an important but difficult issue for speech processing: • Source tract decomposition formant estimation, front-end for speech recognition, low-rate coding etc.) • Voice source parameters estimation (prosodic analysis, diagnosis, voice quality, speaker characterisation, singing …) • Pitch period marking (speech synthesis, prosodic analysis, pitch synchronous processing …) Speech Analysis and Processing for Knowledge Discovery

Voice source analysis • Voice source analysis is an important but difficult issue for speech processing: • No reference for the “true” source and vocal tract components • Rapidly time-varying signals • Wide individual and inter-subject variations • Source tract interactions Speech Analysis and Processing for Knowledge Discovery

The source-filter model revisited • Two aspects of voice source analysis recently developed at LIMSI (Orsay) and FPMs (Mons) are discussed. • Causal-anticausal linear model (CALM) of glottal flow signals (Doval, d’Alessandro, Henrich, 2003, 2006) • zeros of the Z-Transform (ZZT) speech representation (Bozkurt, Doval, d’Alessandro, Dutoit, Sturmel, 2005, 2006, 2008) Speech Analysis and Processing for Knowledge Discovery

The Spectrum of Glottal flow modelsCausal-Anticausal Linear Model(CALM) (Doval, d’Alessandro, Henrich, Acustica united with Acta Acustica 2006, ISCA-ITRW Voqual’03) Speech Analysis and Processing for Knowledge Discovery

k2 , r2 k1 ,r1 d1 kc lg d2 Psubglottal Voice source models • Signal models of glottal waveforms: a few acoustic parametersdescribing the shape of one cycle of the glottal flow 2-mass model (IF 1972) • Physical models of speech production: more than 19 physical parameters governing the behavior of the glottis Speech Analysis and Processing for Knowledge Discovery

KLGLOTT88 (Klatt & Klatt, Jasa 1988) Glottal flow models Rosenberg C (Rosenberg Jasa 1971) LF model,( Liljenkrants, Fant, Lin KTH -STL, 1985) Speech Analysis and Processing for Knowledge Discovery

Glottal flow models : time domain Examples: Rosenberg C (Rosenberg, 1971) LF (Liljencrants & Fant, 1985) Klatt (Klatt & Klatt, 1990) R++ (Veldhuis, 1998) Speech Analysis and Processing for Knowledge Discovery

A unified set: 5 time-domain parameters (Doval, d’Alessandro & henrich, Acta Acustica 2006) • T0, fundamental period • Av, voiced amplitude • Oq , open quotient • am, asymmetry coefficient • (equivalent to speed quotient) • Qa, return phase quotient Other parameters of interest :J, total flow of a single pulse • E, negative peak amplitude of the glottal flow derivative Speech Analysis and Processing for Knowledge Discovery

Voice quality dimensions Four main dimensions: voice registers :voice “mechanisms”: creak, modal, falsetto, whistle noise: breathiness, hoarseness Pressure: pressed/lax voice, “strangled” tones. Effort: accentuation, force. Speech Analysis and Processing for Knowledge Discovery

Time-domain equations In the case of Qa = 0 (abrupt closure), the GFM can all be expressed as : normalized glottal flow model : ng (x, am) depends on the model Speech Analysis and Processing for Knowledge Discovery

Glottal flow models : frequency domain Glottal flow: Glottal flow derivative: Ng (x, am) : Fourier transform of ng (x, am) N’g (x, am) : Fourier transform of n’g (x, am) These two functions depend on the model Speech Analysis and Processing for Knowledge Discovery

Glottal flow models : spectral description « glottal formant » : spectral slope : Speech Analysis and Processing for Knowledge Discovery

Spectral / Time domain :open quotient, asymmetry Speech Analysis and Processing for Knowledge Discovery

Spectral / Time domain: spectral tilt Effect of E and Spectral tilt Speech Analysis and Processing for Knowledge Discovery

Glottal flow model and anticausal linear filter Speech Analysis and Processing for Knowledge Discovery

Causal-Anticausal linear voice source model (CALM) Doval, d’Alessandro, Henrich (2003) Anticausal filter Convergence region for a stable CALM Causal filter Frequency response Glottal pulse (CALM vs. R++) Speech Analysis and Processing for Knowledge Discovery

Conclusions • An unified view of glottal flow models • An unified set of time-domain and spectral parameters • Links between time-domain and spectral parameters • A causal-anticausal linear model of the glottal flow signal Speech Analysis and Processing for Knowledge Discovery

Zero of the Z-Transform (ZZT) Representation of Speech Speech Analysis and Processing for Knowledge Discovery

A new signal representation method (Bozkurt, Doval, d’Alessandro & Dutoit, IEEE SIg. Proc Let, 2005): • Inspired by : • Mixed phase nature of speech (Causal/anticausal voice model, CALM) • Group delay representation • A remark by Yegnanarayana and Murthy, 1989, explaining that group delay is noisy because of roots of the z-transform polynomial close to unit circle » The ZZT Zero of the Z-Transform representation Speech Analysis and Processing for Knowledge Discovery

Spectral analysis of signals z-transform Fourier transform Causality: time and phase (or group delay) domains Speech Analysis and Processing for Knowledge Discovery

Zeros of Z-Transform(ZZT) Representation Almost impossible to study analytically for most of the functions, therefore numerical methods are used (roots function of Matlab) Basic elementary signal : power series Speech Analysis and Processing for Knowledge Discovery

Zero-patterns for the ‘LF model’ First phase Return phase Speech Analysis and Processing for Knowledge Discovery

ZZT representation of speech   = + + = first phase of the glottal flow adds zeros outside the unit circle periodicity results in many zeros on the unit circle vocal tract response zeros lie inside the unit circle Speech Analysis and Processing for Knowledge Discovery

ZZT of windowed speech Non-Glottal Closure Instant (GCI) Synchronous windowing GCI Synchronous windowing Rectangular window Rectangular window Speech Analysis and Processing for Knowledge Discovery

ZZT of windowed speech Rectangular window Blackman window Speech Analysis and Processing for Knowledge Discovery

Source/tract decomposition algorithm Speech Analysis and Processing for Knowledge Discovery

Example of decomposition Speech Analysis and Processing for Knowledge Discovery

ZZT for source-tract separation Original amp. spectum Original windowed speech Real speech ZZT reconstructed glottal amp. spectrum reconstructed glottal excitation reconstructed vocal tract response reconstructed tract transfer function Zero-decomposition Copy-Synth Noise excited tract Speech Analysis and Processing for Knowledge Discovery

Comparison of ZZT and LPC • Tested methods: • LPC autocorrelation • LPC covariance • PSIAIF • ZZT Speech Analysis and Processing for Knowledge Discovery

Comparison of ZZT and LPC • Tested methods: • LPC autocorrelation • LPC covariance • PSIAIF • ZZT • Spectral distance analysis: • ZZT performs much better than inverse filtering for Open Quotient and asymetry estimation • Robustness to noise: ZZT is not better than inverse filtering Speech Analysis and Processing for Knowledge Discovery

Glottal formant: ZZT estimation Real speech Synthetic vowels ‘a, u, i OQ from EGG f0=100Hz Fg=f(F0,1/OpenQuotient,Asym.) f0=200Hz Speech Analysis and Processing for Knowledge Discovery

Conclusions • A new speech signal representation exploiting the phase structure of glottal flow signals • An associated estimation algorithm • Applications to source/tract decomposition • Applications to voice source parameter estimation • Better than inverse filtering (LPC) for glottal flow estimation Speech Analysis and Processing for Knowledge Discovery

Source-filter model revisited • Glottal flow models can be represented by a causal-anticausal (mixed phase) filter • Then a method designed for causal-anticausal decomposition is proposed • This method is applied to source/tract decomposition • … and voice source parameter estimation Speech Analysis and Processing for Knowledge Discovery

Real time synthesis of expressive voice:vocal instruments • Real-time vocal instrument: voice quality synthesis • Real-time intonation synthesis: a study of intonation gestures, towards modeling intonation in terms of movements (cinematic modeling) Speech Analysis and Processing for Knowledge Discovery

Aims of real-time voice synthesis • A gesture interface for driving (“conducting”) a speech synthesis system • Aim: Add expression and emotion to the speech flow • Real time modification of voice synthesis • Gesture interpretation algorithms and speech signal modification algorithms Speech Analysis and Processing for Knowledge Discovery

Vocal instruments: • A short historical review • Application to voice quality synthesis • Computerized chironomy: hand-controlled vocal instruments • Experiments in intonation reiteration Speech Analysis and Processing for Knowledge Discovery

Vocal instruments Liénard’s reconstruction, 1968 Mechanical instrument: the Von Kempelen Machine (Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine (1791)) Machine in the Deutsches Museum, Munich. Speech Analysis and Processing for Knowledge Discovery

Vocal instruments Electrical instrument: the Voder (1939) Speech Analysis and Processing for Knowledge Discovery

Recent vocal instruments • Sydney Fels (U. Brit. Colomb): Glove Talk (1993) Speech Analysis and Processing for Knowledge Discovery

Recent vocal instruments • Perry Cook (Comp. Sci., Princeton): Lisa (2001) Speech Analysis and Processing for Knowledge Discovery

Devices VoiceDimensions Noise Vowel Pressure Effort Intonation ModelParameters Formants Oq αm Ta AN F0 AV Bg Fg Shimmer Tilt Jitter Synthesizers Mapping Speech Analysis and Processing for Knowledge Discovery

CALM synthesis algorithm Speech Analysis and Processing for Knowledge Discovery

Real-time Voice quality synthesis • Non-preferred Hand • Joystick: • Central button: 2-D vocalic space • Front-rear: structural noise • Right-left: lax-tense (+ additive noise) • Preferred Hand • Wacom Tablet: • X-axis -> F0 • Y-axis -> Vocal effort • Z-axis -> amplitude Speech Analysis and Processing for Knowledge Discovery

Christophe d’Alessandro LIMSI-CNRS Orsay, France

Christophe d’Alessandro LIMSI-CNRS Orsay, France

Presentation Transcript