160 likes | 166 Vues
Statistical and Signal Processing Approaches for Voicing Detection. Alex Park July 25 th , 2003. Overview. Motivation and background for voicing detection Overview of recent methods Signal Processing approaches Statistical approaches Performance comparison of voicing detection methods
E N D
Statistical and Signal Processing Approaches for Voicing Detection Alex Park July 25th, 2003
Overview • Motivation and background for voicing detection • Overview of recent methods • Signal Processing approaches • Statistical approaches • Performance comparison of voicing detection methods • Detection error rates on small task • Example outputs • Conclusions and Future Work Introduction
Motivation • Voicing is not necessary for speech understanding • E.g. Whispered speech – excitation is provided by aspiration • E.g. Sinewave speech – no periodic excitation, resonances produced directly • What is the value of adding voicing to the speech signal? • Separability? Pitch is useful for distinguishing between concurrent speakers and background • Redundancy? Harmonics provide regular structure from which we can detect speech in multiple bands • Robustness? Unvoiced speech has lower SNR than voiced speech • Whispering is intended to prevent unwanted listeners from hearing • Shouting/singing not possible without voicing • Low frequencies less attenuated over distances • Current speech recognition systems typically discard voicing information in the front end because • Energy is environment dependent, pitch is speaker dependent • Vocal tract configuration carries most phonetic information Introduction
Irregular pitch periods Missing Fundamental Background • Voicing produced by periodic vibrations of the vocal folds. • In time, voiced speech consists of repeated segments. • In frequency, spectrum has harmonic structure shaped by formant resonances • Pitch estimation and voicing decision can be made • In time, using repetition rate and similarity of pitch periods • In frequency, using spacing and relative height of harmonic peaks Time Domain Freq Domain Introduction
Signal Processing Approaches • Signal processing approaches marked by lack of training phase • Voicing detection typically paired with pitch extraction • Well known approach: peak-picking (spectral or temporal) • Usually followed by smoothing gross errors via Dynamic Programming • Many proposed solutions: • Spectral • Cepstral Pitch tracking • Harmonic Product Sum • Logarithmic DFT pitch tracker (Wang) • Temporal • Autocorrelation • Sinusoid matching (Saul) • Synchrony (Seneff) • Exotic methods • Image based pitch tracking (Quatieri) Signal Processing
I. Autocorrelation • Temporal domain approach, used in ESPS tool ‘get_f0’ • Compute inner product of signal with shifted version of itself • If is a speech frame, then autocorrelation is Speech Frame Peaks occur at multiples of fundamental period Short Time Autocorrelation Signal Processing
w1*, u1* 25-75 Hz : min(ui*) w4*, u4* 134-407 Hz max(x,0) F0 = 2p w4* p(V) = f(u4*) Low Pass Filter : w8*, u8* 264-800 Hz Half-wave rectify Signal Preconditioning Octave Filterbank (8) Sliding temporal window Output II. Band-limited Sinusoid Fitting (Saul 2002) • Filter bandwidths allow at least one filter to resolve single harmonics • Frames of filtered signals fit with sinusoid of frequency w* and error u* • At each step, lowest u* gives voicing probability, w* gives pitch estimate • Algorithm is fast and gives accurate pitch tracks Signal Processing
Statistical Approaches • Statistical voicing detectors are not strictly dependent on spectral features (but these are the features widely used) • Training data useful for capturing acoustic cues of voicing not explicitly specified in signal processing approaches • Possible classifiers suitable for voicing detection include • GMM classifier (w/ MFCC features) • Structured Bayesian Network (alternative features) • Neural Network classifier • Support Vector Machines Statistical
Decision Rule V GMM p(V|x) p(UV|x) Transcribed speech (Training Data) p(x|UV) p(x|V) p(V) p(x|UV) p(UV) p(x|V) p(x|UV) > > < < “voiced” Unknown frame, x p(x|V) > L < p(x|UV) p(x|V) UV GMM “unvoiced” Training Testing I. GMM Classifier • Train two GMMs, p(x|V) and p(x|UV) using frame-level feature vectors (MFCCs + surrounding frames (for D’s and DD’s)) • 50 mixtures each, dimensions reduced to 50 via PCA • Using Bayes’ rule, voicing score is given by likelihood ratio • Discriminative framework is useful because it uses knowledge of unvoiced speech characteristics in making decision Statistical
s(u) Filter 1 Feature Extraction AND Filter 2 Feature Extraction Half-wave rectify : : : : Filter 24 Voicing Decision Feature Extraction OR max(x,0) Auditory Filterbank (Gammatone) Signal Preconditioning Feature Extraction & Channel Tests AND Layer OR Layer II. Bayesian Network (Saul/Rahim/Allen 1999) • Feature vector constructed for frames of narrowband speech • (Autocorrelation peaks and valleys) & (SNR Estimate) = 5 dims/band/frame • Individual voicing decisions made on each channel • Channel sigmoid decision weights (q’s) trained via EM algorithm • Overall voicing decision triggered by positive example in individual channels Statistical
Reported Operating Point (Saul) Comparison: Matched Conditions • Trained on 410 TIMIT sentences from 40 speakers (126k frames) • Evaluated on 100 TIMIT sentences from 10 speakers (28k frames) • Speech was resampled to 8kHz, phone labels used as voicing ref • Also evaluated on Keele database (laryngograph reference) Results
GMM w/ MFCCs Voiced Bayesian Network Unvoiced Sinusoid Uncertainty Autocorrelation Sample Outputs: Matched Conditions • Some example voicing tracks output by individual methods Results
Comparison: Mismatched Conditions • Evaluated with different kinds of signal corruption • Condition not known a priori => same threshold as before • threshold can be adaptive to environment (same as modifying output prob.) • Overall error rates are unsatisfactory • GMM classifier has best performance on clean data, but unpredictable results in varied conditions GMM Autocorrelation Sinusoid Fit Results
GMM w/ MFCCs Voiced Unvoiced Bayesian Network TIMIT Sinusoid Uncertainty NTIMIT Autocorrelation Sample Outputs: Mismatched Conditions • Voicing tracks on NTIMIT utterance Results
Conclusions and Future Work • Error rates are still high compared with literature • Post processing to remove stray frames • Problem with scoring procedure? • Statistical framework with knowledge based features • Weight contribution of multiple detectors using SNR-based variable • Using same approach, apply to phonetic detectors for voiced speech • Nasality – broad F1 bandwidth, low spectral slope in F1:F2 region, stable low frequency energy • Rounding – Low F1, F2. • Retroflex – Low F3, rising formants. • Combine feature streams with SNR based weight as input to HMM Conclusions
References • L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun (2003). “Real time voice processing with audiovisual feedback: Toward autonomous agents with perfect pitch” in S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems 15. MIT Press: Cambridge, MA. • L. K. Saul, M. G. Rahim, and J. B. Allen (2001).A statistical model for robust integration of narrowband cues in speech.Computer Speech and Language 15(2): 175-194. • C. Wang, and S. Seneff (2000). "Robust Pitch Tracking for Prosodic Modeling in Telephone Speech," In Proc. ICASSP ‘00, Istanbul, Turkey. • S. Seneff (1985). “Pitch and spectral analysis of speech based on an auditory synchrony model,” Ph.D Thesis, Dept. of Electrical Engineering, M.I.T., Cambridge, MA. • T. F. Quatieri (2002). "2-D Processing of Speech with Application to Pitch Estimation," In Proc. ICLSP ’02, Denver, Colorado. Conclusions