1 / 16

Statistical and Signal Processing Approaches for Voicing Detection

Statistical and Signal Processing Approaches for Voicing Detection. Alex Park July 25 th , 2003. Overview. Motivation and background for voicing detection Overview of recent methods Signal Processing approaches Statistical approaches Performance comparison of voicing detection methods

Télécharger la présentation

Statistical and Signal Processing Approaches for Voicing Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical and Signal Processing Approaches for Voicing Detection Alex Park July 25th, 2003

  2. Overview • Motivation and background for voicing detection • Overview of recent methods • Signal Processing approaches • Statistical approaches • Performance comparison of voicing detection methods • Detection error rates on small task • Example outputs • Conclusions and Future Work Introduction

  3. Motivation • Voicing is not necessary for speech understanding • E.g. Whispered speech – excitation is provided by aspiration • E.g. Sinewave speech – no periodic excitation, resonances produced directly • What is the value of adding voicing to the speech signal? • Separability? Pitch is useful for distinguishing between concurrent speakers and background • Redundancy? Harmonics provide regular structure from which we can detect speech in multiple bands • Robustness? Unvoiced speech has lower SNR than voiced speech • Whispering is intended to prevent unwanted listeners from hearing • Shouting/singing not possible without voicing • Low frequencies less attenuated over distances • Current speech recognition systems typically discard voicing information in the front end because • Energy is environment dependent, pitch is speaker dependent • Vocal tract configuration carries most phonetic information Introduction

  4. Irregular pitch periods Missing Fundamental Background • Voicing produced by periodic vibrations of the vocal folds. • In time, voiced speech consists of repeated segments. • In frequency, spectrum has harmonic structure shaped by formant resonances • Pitch estimation and voicing decision can be made • In time, using repetition rate and similarity of pitch periods • In frequency, using spacing and relative height of harmonic peaks Time Domain Freq Domain Introduction

  5. Signal Processing Approaches • Signal processing approaches marked by lack of training phase • Voicing detection typically paired with pitch extraction • Well known approach: peak-picking (spectral or temporal) • Usually followed by smoothing gross errors via Dynamic Programming • Many proposed solutions: • Spectral • Cepstral Pitch tracking • Harmonic Product Sum • Logarithmic DFT pitch tracker (Wang) • Temporal • Autocorrelation • Sinusoid matching (Saul) • Synchrony (Seneff) • Exotic methods • Image based pitch tracking (Quatieri) Signal Processing

  6. I. Autocorrelation • Temporal domain approach, used in ESPS tool ‘get_f0’ • Compute inner product of signal with shifted version of itself • If is a speech frame, then autocorrelation is Speech Frame Peaks occur at multiples of fundamental period Short Time Autocorrelation Signal Processing

  7. w1*, u1* 25-75 Hz : min(ui*) w4*, u4* 134-407 Hz max(x,0) F0 = 2p w4* p(V) = f(u4*) Low Pass Filter : w8*, u8* 264-800 Hz Half-wave rectify Signal Preconditioning Octave Filterbank (8) Sliding temporal window Output II. Band-limited Sinusoid Fitting (Saul 2002) • Filter bandwidths allow at least one filter to resolve single harmonics • Frames of filtered signals fit with sinusoid of frequency w* and error u* • At each step, lowest u* gives voicing probability, w* gives pitch estimate • Algorithm is fast and gives accurate pitch tracks Signal Processing

  8. Statistical Approaches • Statistical voicing detectors are not strictly dependent on spectral features (but these are the features widely used) • Training data useful for capturing acoustic cues of voicing not explicitly specified in signal processing approaches • Possible classifiers suitable for voicing detection include • GMM classifier (w/ MFCC features) • Structured Bayesian Network (alternative features) • Neural Network classifier • Support Vector Machines Statistical

  9. Decision Rule V GMM p(V|x) p(UV|x) Transcribed speech (Training Data) p(x|UV) p(x|V) p(V) p(x|UV) p(UV) p(x|V) p(x|UV) > > < < “voiced” Unknown frame, x p(x|V) > L < p(x|UV) p(x|V) UV GMM “unvoiced” Training Testing I. GMM Classifier • Train two GMMs, p(x|V) and p(x|UV) using frame-level feature vectors (MFCCs + surrounding frames (for D’s and DD’s)) • 50 mixtures each, dimensions reduced to 50 via PCA • Using Bayes’ rule, voicing score is given by likelihood ratio • Discriminative framework is useful because it uses knowledge of unvoiced speech characteristics in making decision Statistical

  10. s(u) Filter 1 Feature Extraction AND Filter 2 Feature Extraction Half-wave rectify : : : : Filter 24 Voicing Decision Feature Extraction OR max(x,0) Auditory Filterbank (Gammatone) Signal Preconditioning Feature Extraction & Channel Tests AND Layer OR Layer II. Bayesian Network (Saul/Rahim/Allen 1999) • Feature vector constructed for frames of narrowband speech • (Autocorrelation peaks and valleys) & (SNR Estimate) = 5 dims/band/frame • Individual voicing decisions made on each channel • Channel sigmoid decision weights (q’s) trained via EM algorithm • Overall voicing decision triggered by positive example in individual channels Statistical

  11. Reported Operating Point (Saul) Comparison: Matched Conditions • Trained on 410 TIMIT sentences from 40 speakers (126k frames) • Evaluated on 100 TIMIT sentences from 10 speakers (28k frames) • Speech was resampled to 8kHz, phone labels used as voicing ref • Also evaluated on Keele database (laryngograph reference) Results

  12. GMM w/ MFCCs Voiced Bayesian Network Unvoiced Sinusoid Uncertainty Autocorrelation Sample Outputs: Matched Conditions • Some example voicing tracks output by individual methods Results

  13. Comparison: Mismatched Conditions • Evaluated with different kinds of signal corruption • Condition not known a priori => same threshold as before • threshold can be adaptive to environment (same as modifying output prob.) • Overall error rates are unsatisfactory • GMM classifier has best performance on clean data, but unpredictable results in varied conditions GMM Autocorrelation Sinusoid Fit Results

  14. GMM w/ MFCCs Voiced Unvoiced Bayesian Network TIMIT Sinusoid Uncertainty NTIMIT Autocorrelation Sample Outputs: Mismatched Conditions • Voicing tracks on NTIMIT utterance Results

  15. Conclusions and Future Work • Error rates are still high compared with literature • Post processing to remove stray frames • Problem with scoring procedure? • Statistical framework with knowledge based features • Weight contribution of multiple detectors using SNR-based variable • Using same approach, apply to phonetic detectors for voiced speech • Nasality – broad F1 bandwidth, low spectral slope in F1:F2 region, stable low frequency energy • Rounding – Low F1, F2. • Retroflex – Low F3, rising formants. • Combine feature streams with SNR based weight as input to HMM Conclusions

  16. References • L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun (2003). “Real time voice processing with audiovisual feedback: Toward autonomous agents with perfect pitch” in S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems 15. MIT Press: Cambridge, MA. • L. K. Saul, M. G. Rahim, and J. B. Allen (2001).A statistical model for robust integration of narrowband cues in speech.Computer Speech and Language 15(2): 175-194. • C. Wang, and S. Seneff (2000). "Robust Pitch Tracking for Prosodic Modeling in Telephone Speech," In Proc. ICASSP ‘00, Istanbul, Turkey. • S. Seneff (1985). “Pitch and spectral analysis of speech based on an auditory synchrony model,” Ph.D Thesis, Dept. of Electrical Engineering, M.I.T., Cambridge, MA. • T. F. Quatieri (2002). "2-D Processing of Speech with Application to Pitch Estimation," In Proc. ICLSP ’02, Denver, Colorado. Conclusions

More Related