Statistical and Signal Processing Approaches for Voicing Detection

Statistical and Signal Processing Approaches for Voicing Detection Alex Park July 25th, 2003

Overview • Motivation and background for voicing detection • Overview of recent methods • Signal Processing approaches • Statistical approaches • Performance comparison of voicing detection methods • Detection error rates on small task • Example outputs • Conclusions and Future Work Introduction

Motivation • Voicing is not necessary for speech understanding • E.g. Whispered speech – excitation is provided by aspiration • E.g. Sinewave speech – no periodic excitation, resonances produced directly • What is the value of adding voicing to the speech signal? • Separability? Pitch is useful for distinguishing between concurrent speakers and background • Redundancy? Harmonics provide regular structure from which we can detect speech in multiple bands • Robustness? Unvoiced speech has lower SNR than voiced speech • Whispering is intended to prevent unwanted listeners from hearing • Shouting/singing not possible without voicing • Low frequencies less attenuated over distances • Current speech recognition systems typically discard voicing information in the front end because • Energy is environment dependent, pitch is speaker dependent • Vocal tract configuration carries most phonetic information Introduction

Irregular pitch periods Missing Fundamental Background • Voicing produced by periodic vibrations of the vocal folds. • In time, voiced speech consists of repeated segments. • In frequency, spectrum has harmonic structure shaped by formant resonances • Pitch estimation and voicing decision can be made • In time, using repetition rate and similarity of pitch periods • In frequency, using spacing and relative height of harmonic peaks Time Domain Freq Domain Introduction

Signal Processing Approaches • Signal processing approaches marked by lack of training phase • Voicing detection typically paired with pitch extraction • Well known approach: peak-picking (spectral or temporal) • Usually followed by smoothing gross errors via Dynamic Programming • Many proposed solutions: • Spectral • Cepstral Pitch tracking • Harmonic Product Sum • Logarithmic DFT pitch tracker (Wang) • Temporal • Autocorrelation • Sinusoid matching (Saul) • Synchrony (Seneff) • Exotic methods • Image based pitch tracking (Quatieri) Signal Processing

I. Autocorrelation • Temporal domain approach, used in ESPS tool ‘get_f0’ • Compute inner product of signal with shifted version of itself • If is a speech frame, then autocorrelation is Speech Frame Peaks occur at multiples of fundamental period Short Time Autocorrelation Signal Processing

w1*, u1* 25-75 Hz : min(ui*) w4*, u4* 134-407 Hz max(x,0) F0 = 2p w4* p(V) = f(u4*) Low Pass Filter : w8*, u8* 264-800 Hz Half-wave rectify Signal Preconditioning Octave Filterbank (8) Sliding temporal window Output II. Band-limited Sinusoid Fitting (Saul 2002) • Filter bandwidths allow at least one filter to resolve single harmonics • Frames of filtered signals fit with sinusoid of frequency w* and error u* • At each step, lowest u* gives voicing probability, w* gives pitch estimate • Algorithm is fast and gives accurate pitch tracks Signal Processing

Statistical Approaches • Statistical voicing detectors are not strictly dependent on spectral features (but these are the features widely used) • Training data useful for capturing acoustic cues of voicing not explicitly specified in signal processing approaches • Possible classifiers suitable for voicing detection include • GMM classifier (w/ MFCC features) • Structured Bayesian Network (alternative features) • Neural Network classifier • Support Vector Machines Statistical

Decision Rule V GMM p(V|x) p(UV|x) Transcribed speech (Training Data) p(x|UV) p(x|V) p(V) p(x|UV) p(UV) p(x|V) p(x|UV) > > < < “voiced” Unknown frame, x p(x|V) > L < p(x|UV) p(x|V) UV GMM “unvoiced” Training Testing I. GMM Classifier • Train two GMMs, p(x|V) and p(x|UV) using frame-level feature vectors (MFCCs + surrounding frames (for D’s and DD’s)) • 50 mixtures each, dimensions reduced to 50 via PCA • Using Bayes’ rule, voicing score is given by likelihood ratio • Discriminative framework is useful because it uses knowledge of unvoiced speech characteristics in making decision Statistical

s(u) Filter 1 Feature Extraction AND Filter 2 Feature Extraction Half-wave rectify : : : : Filter 24 Voicing Decision Feature Extraction OR max(x,0) Auditory Filterbank (Gammatone) Signal Preconditioning Feature Extraction & Channel Tests AND Layer OR Layer II. Bayesian Network (Saul/Rahim/Allen 1999) • Feature vector constructed for frames of narrowband speech • (Autocorrelation peaks and valleys) & (SNR Estimate) = 5 dims/band/frame • Individual voicing decisions made on each channel • Channel sigmoid decision weights (q’s) trained via EM algorithm • Overall voicing decision triggered by positive example in individual channels Statistical

Reported Operating Point (Saul) Comparison: Matched Conditions • Trained on 410 TIMIT sentences from 40 speakers (126k frames) • Evaluated on 100 TIMIT sentences from 10 speakers (28k frames) • Speech was resampled to 8kHz, phone labels used as voicing ref • Also evaluated on Keele database (laryngograph reference) Results

GMM w/ MFCCs Voiced Bayesian Network Unvoiced Sinusoid Uncertainty Autocorrelation Sample Outputs: Matched Conditions • Some example voicing tracks output by individual methods Results

Comparison: Mismatched Conditions • Evaluated with different kinds of signal corruption • Condition not known a priori => same threshold as before • threshold can be adaptive to environment (same as modifying output prob.) • Overall error rates are unsatisfactory • GMM classifier has best performance on clean data, but unpredictable results in varied conditions GMM Autocorrelation Sinusoid Fit Results

GMM w/ MFCCs Voiced Unvoiced Bayesian Network TIMIT Sinusoid Uncertainty NTIMIT Autocorrelation Sample Outputs: Mismatched Conditions • Voicing tracks on NTIMIT utterance Results

Conclusions and Future Work • Error rates are still high compared with literature • Post processing to remove stray frames • Problem with scoring procedure? • Statistical framework with knowledge based features • Weight contribution of multiple detectors using SNR-based variable • Using same approach, apply to phonetic detectors for voiced speech • Nasality – broad F1 bandwidth, low spectral slope in F1:F2 region, stable low frequency energy • Rounding – Low F1, F2. • Retroflex – Low F3, rising formants. • Combine feature streams with SNR based weight as input to HMM Conclusions

References • L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun (2003). “Real time voice processing with audiovisual feedback: Toward autonomous agents with perfect pitch” in S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems 15. MIT Press: Cambridge, MA. • L. K. Saul, M. G. Rahim, and J. B. Allen (2001).A statistical model for robust integration of narrowband cues in speech.Computer Speech and Language 15(2): 175-194. • C. Wang, and S. Seneff (2000). "Robust Pitch Tracking for Prosodic Modeling in Telephone Speech," In Proc. ICASSP ‘00, Istanbul, Turkey. • S. Seneff (1985). “Pitch and spectral analysis of speech based on an auditory synchrony model,” Ph.D Thesis, Dept. of Electrical Engineering, M.I.T., Cambridge, MA. • T. F. Quatieri (2002). "2-D Processing of Speech with Application to Pitch Estimation," In Proc. ICLSP ’02, Denver, Colorado. Conclusions

Statistical and Signal Processing Approaches for Voicing Detection

Statistical and Signal Processing Approaches for Voicing Detection

Presentation Transcript

Signal Processing

Signal Processing

A Linked-HMM for Robust Voicing and Speech Detection

Signal Processing

Signal Processing and Fluorescence

Signal Processing

Signal detection

Signal detection

Statistical Signal and Array Processing

Signal Processing For Acoustic Neutrino Detection (A Tutorial)

Signals and Signal Processing

Signal Processing

Statistical Signal Processing Algorithms for Time-Varying Sensor Arrays

Signal Processing

Signal and Data Processing

[Unix Programming] Signal and Signal Processing

Signal Processing

Detection of Obscured Targets: Signal Processing

Statistical Signal Processing for Gene Microarrays Alfred O. Hero III

Inference and Signal Processing for Networks

Signal Processing For Acoustic Neutrino Detection (A Tutorial)

INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING