1 / 22

Overview:

PROPAGATION OF STATISTICAL INFORMATION THROUGH NON-LINEAR FEATURE EXTRACTIONS FOR ROBUST SPEECH RECOGNITION. Overview:. Introduction: Automatic speech recognition. Problem: Imperfect noise suppression. Proposed solution: Uncertainty propagation. Tests & results. Conclusions.

Télécharger la présentation

Overview:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PROPAGATION OF STATISTICAL INFORMATION THROUGH NON-LINEAR FEATURE EXTRACTIONS FOR ROBUST SPEECH RECOGNITION Overview: Introduction: Automatic speech recognition. Problem: Imperfect noise suppression. Proposed solution: Uncertainty propagation. Tests & results. Conclusions. R. F. Astudillo, D. Kolossa and R. Orglmeister - TU-Berlin

  2. Automatic Speech Recognizer (ASR) • Feature extraction transforms signal into a domain more suitable for recognition. • Speech recognizer models abstract speech components like phonemes or triphones, generates transcription. • Most of speech recognition applications need noise suppression preprocessing.

  3. Feature Extraction • Non-linear transformations that imitate the way humans process speech. • Robust against inter-speaker and intra-speaker variability. • Mel-cepstral and RASTA-PLP transformations.

  4. Speech Recognition Example: Mel-cepstral features • Statistical models are used to model speech. • Hidden Markov models with mixture of Gaussians (multivariable) for the emitting states.

  5. Noise Suppression • Most methods obtain an estimation of the short-time spectrum (STFT) of the clean signal . • MMSE-LSA bayesian estimation [Ephraim1985] is one of the most used. • Leaves residual noise. • Introduces artifacts in speech. Problem: Imperfect estimation.

  6. Solution: Modeling Uncertainty of Estimation We model each element of the STFT as a complex Gaussian random distribution . • Mean set equal to estimated clean value . • Parameter controls the • uncertainty.

  7. Propagation of Uncertainty • We propagate first and second order moments of the distributions. • Correlation between feature appears (covariance). • Resulting uncertainty is combined with statistical model parameters for robust speech recognition

  8. Propagation of Uncertainty • We propagate first and second order moments of the distributions. • Correlation between feature appears (covariance). • Resulting uncertainty is combined with statistical model parameters for robust speech recognition

  9. Approaches to Uncertainty Propagation Analytic solutions Imply complex calculations. Specific for each transformation. Pseudo-MontecarloUnscented Transform [Julier1996]. Inefficient for high number of dimensions (i.e. STFT 256 dim./frame). ►Piecewise Propagation Efficientcombination of both methods. Valid for many feature extractions (i.e. MELSPEC, MFCC, RASTA-PLP).

  10. Piecewise Uncertainty Propagation • Exemplified with Mel-Ceptral transformation: • Modulus extraction (non-linear). • Filterbank (linear). • Logarithm (non-linear). • Discrete-cosine-transform (linear). • Delta and acceleration coefficients (linear).

  11. Propagation through Modulus • By integrating the phase of a complex Gaussian distribution we obtain the Rice distribution. • Mean and variance can be calculated as: • were L is the Legendre polynom.

  12. Propagation through filterbank • Each filter output m is a weighted sum of frequency moduli. • It can be expressed as a matrix multiplication. • Mean and variance can be calculated as:

  13. Full Covariance and other linear transformations • Covariance after filterbank • is no longer diagonal. • Additional computation costs. • DCT, delta and acceleration can be computed similarly.

  14. Propagation through Logarithm • Non-linear transformation • Distribution after filterbank difficult to model • not diagonal • Dimesionality of the Mel features much smaller than the STFT features • ► Unscented transform can be applied efficiently

  15. Unscented Transform • Only points must be propagated. • Points on the th covariace contour and the mean. • = feature dimension • Example for =2

  16. Unscented Transform II • Mean and covariances are calculated by using weighted averages: • Parameter allows higher moments of the distribution to be considered.

  17. Use of Uncertainty Parameters of state f1 • After propagation of uncertainty, missing feature techniques or uncertainty decoding may be applied. • These techniques combine uncertainty and model information to ignore or reestimate noisy features.

  18. Use of Uncertainty II • Modified imputation [Kolossa2005] showed the best performance. • It reestimates features for state q by maximizing the probability: • Assuming multivariate Gaussian distribution for uncertainty • and model:

  19. Recognition Tests TI-DIGITS database • 200 files (20 different speakers). • Best, second best results. 0 0 0

  20. Conclusions • The use of uncertainty in Mel-cepstral domain is useful to compensateimperfect estimationduring noise suppression. • Piecewise uncertainty propagation is valid for multiple feature extractions. • Better estimation of uncertainty should improve the results.

  21. Thank You! Some literature: [Ephraim1985] Y. Ephraim, and D. Malah, Acoustics, Speech, and Signal Processing, IEEE Transactions on 33, 443–445 (1985). [Julier1996] S. Julier, and J. Uhlmann, A general method for approximating nonlinear transformations of probability distributions, Tech. rep., University of Oxford, UK (1996). [Kolossa2005] D. Kolossa, A. Klimas, and R. Orglmeister, “Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques,” Applications of Signal Processing to Audio and Acoustics, 2005. IEEE Workshop on, 2005, pp. 82-85.

More Related