A Database of Vocal Tract Resonance Trajectories for Research in Speech Processing

A Database of Vocal Tract Resonance Trajectories for Research in Speech Processing L. Deng (Microsoft Research, Redmond) X. Cui & A. Alwan (U. California, Los Angeles) R. Pruvenok (Georgia Institute of Tech, Atlanta) J. Huang (Carnegie Mellon U., Pittsburg) S. Momen (Princeton U., Princeton) Y. Chen (Cornell U., Ithaca)

Introduction • Joint research project between MSR & IPAM of UCLA • Carried out during 2005 NSF-RIPS summer program • Main Goals: • Create a database of VTR/formant trajectories for research in speech processing (ground truth). • Quantitatively assess various existing automatic VTR/formant tracking algorithms

Background • Vocal tract resonance (VTR or formant-I) --- acoustic resonance in the human tract in speech production • May differ from spectral peaks measured from the speech signal (formant-II) • Importance of VTR/formants for speech perception and production • Many techniques for automatic VTR or formant-II extraction

Background (cont’d) • Difficulty of automatic VTR/formant tracking • When two formants are close to each other (e.g., /iy,y,uw,r/) • Consonant sounds when VTRs are not directly visible from spectrogram (e.g., nasals, fricatives, stops) • CV or VC transitions • Lack of standard database for quantitative evaluation of tracking algorithms • Requirement for extensive human expertise

Data Selection • Subset of TIMIT utterances • 538 utterances in total • 192 utterances in core test set • 346 utterances in training set (173 speakers; one SX & one SI for each) • Balance of speaker, dialect, gender, & phoneme distributions

VTR Trajectory Labeling • Start from the results of a previous VTR tracking algorithm (ICASSP 2004 paper) • Develop a software tool for manual error correction using spectrogram display • Use human expertise

GUI Tool for VTR Labeling/Correction

Human Expertise • Prior knowledge of nominal VTR target values for individual phones • Contextual effects of VTR values (target directed trajectories) • Overall spectral properties across entire utterance (same phones at diff times) • Effects of anti-resonances in splitting VTRs of nasalized vowels • Special formant movement patterns (e.g., velar pinch, etc.) • Etc.

After correction

Two Automatic Algorithms • WaveSurfer http://www.speech.kth.se/wavesurfer) (same algorithm as ESPS/xwaves, Talkin et.al) • based on LPC analysis and dynamic programming • MSR Hidden dynamic model based algorithm • Implemented by Kalman filter/smoother • Piecewise-linearized mapping from VTR to cepstra • By-product of a speech recognizer • Typing all phone VTR targets • Details in ICASSP 2004 paper

Comparisons of Two Algorithms His failure to open the store by eight cost him his job

Comparisons of Two Algorithms We always thought we would die with our boots on

Cross-Labeler Variation Results

Computing Formant Tracking Errors

Computing Formant Tracking Errors--- Focusing on transitions

Summary and Conclusion • VTR/Formants are critical for speech production, perception, and processing • Prior to this work, lack of standard database • Creating a database using human expertise • Immediate application: quantitative evaluation of automatic VTR/formant tracking algorithms • Second-pass verification & correction at MSR recently completed • Data soon to be publicly released from both MSR and UCLA sites.

A Database of Vocal Tract Resonance Trajectories for Research in Speech Processing

A Database of Vocal Tract Resonance Trajectories for Research in Speech Processing

Presentation Transcript

Phonetics: The vocal tract

Speech Processing

Secondary Articulations + Vocal Tract Physiology

Acoustics of the Vocal Tract

Speech Processing

Song is produced via vocal ‘membranes’ (labium), filtering of the vocal tract

Speech Processing

Speech Processing

Speech Processing Research Center

Speech Processing

Speech Processing

Speech Processing

Speech Processing

The Vocal Tract and Initiation of Speech: Anatomy and Physiology

Vocal Tract Physiology

Vocal Tract Physiology

Speech Processing

Speech Processing

Vergina: A Modern Greek Speech Database for Speech Synthesis

Speech Processing

Speech Processing

Speech Processing