Speech Recognition in Adverse Environments

Speech Recognition in Adverse Environments Juan Arturo Nolazco-Flores Dpto. de Ciencias Computacinales ITESM, campus Monterrey

Talk Overview • Introduction • Parallel Model Combination(PMC) • SS-PMC • Coments and Conclusions

END!

Introduction • Problem: • Automatic Speech Recognition performance is highly degraded when speech is corrupted for noise (additive noise, convolutional noise, etc.). • Fact: • In order to have real speech recognisers ASR should tackle this problem. • Knowledge. • ASR can be improved either: • Enhancing speech before recognition • Training models in the same environment the ASR is going to be used.

Input Data It needs a model for unit of recognition. M1 M2 Probability of each model. MQ Higher Probability Recognised word Recognition using CD-HMM Recogniser

Enhancing Speech • Features: • Models are trained with clean speech. • Corrupted speech is enhanced. • There are a number of well studied techniques: • Subtract an estimated noise found during nonspeech activity. • Adaptive noise cancelling (ANC). • Successful for low to medium SNR (>0).

Problems: • Enhancers are not perfects, therefore • the speech is distorted and • there are residual noise.

Training models in the same environment • ASR systems which uses this technique can deal with low to high SNR (>0 dB). • In example, for an isolated digit recognition task where digits are corrupted for helicopter(Lynx) noise, you can get the following performance: • For TIMIT • Problem: • There are many possible environments (no practical).

However, using continuous HMM is possible to combine the clean speech model and noise model and obtain a noisy speech model. • Techniques: • Model Decomposition • Parallel Model Combination

Parallel Model Combination (PMC) • Introduction • Scheme • Diagram

Introduction • It is an artificial way to simulate that the system has been trained in the adverse environment the system is going to work. • The clean speech CHMM and the noise CHMM (estimated with the noise before the word is uttered) are combined to obtain models adapted to the adverse environment. • The combination is based in the assumption that that pdf of the state distribution models are completely defined by the mean and variance.

Scheme • For simplicity, it is convenient to combine these models in a linear domain. • Problem: • High performance speech recognition is obtained in a non-linear domain (i.e. mel-cepstral domain). • Solution: • Transform coefficients to a linear domain.

Diagram Clean speech HMM Linear domain C->log exp() PMC HMM C() + log() Noise HMM C->log exp() Simulates training in noise.

SS-PMC • Introduction • Hypothesis prove • SS Combination Development • Diagram • Results

Introduction • How can we improve recognition performance in highly adverse environments (SNR<0dB)? • Thus, PMC does not represent a solution for highly adverse environments. (Upper boundary conditions)

On the other hand, we know that the enhancer returns a “cleaner” speech, but distorted. • Therefore the question is: • Is it possible to improve recognition performance if the models where trained with this “cleaner” speech?

Hypothesis • Training HMMs with enhanced speech makes the HMM learn both the speech distortion and the residual noise. • If we show that this hypothesis is true, we can be confident that indeed we can improve recognition performance.

In order to prove this hypothesis: • An enhancer scheme was selected. • Models were trained with the enhanced speech. • Recognition performance was developed in the same conditions. • The recognition performance obtained for this experiment will be compared with the recognition performance obtained when models were trained in the same environment.

Hypothesis Prove • Introduction • Spectral Subtraction definition • Experiments and results • Conclusions

Introduction • Since it is a simple (and successful) scheme, Spectral Subtraction (SS) was selected.

Spectral Subtraction Definition • Before filterbank • After filterbank.

Experiments and Results. • CHMMs were trained speech enhanced by SS. • Recognition performance was developed over speech enhance by SS in the same conditions.

Example 1 • Task: isolated digit Recognition • Training: Using enhanced speech • Noise: Helicopter • Database: Noisex92 • Real noise is artificially added to clean speech, such that no Lombard effect can bias recognition performance.

Results • bMSS: Training Models in Noise (PMC) This values represent the upper boundary of the ASR system.

bPSS Training Models in Noise (PMC)

Example 2: • Vocabulario: 30 palabras (números: I.e. dos mil quinientos dólares).

Example 3: • TIMIT

Conclusions • Hypothesis was prove to be true. • A new research area is open • Tried these experiments using other databases. • How can we combine CHMM, such that we do not need to train for all enhancement conditions. • Are all the enhancement technique suited for CHMM combination?

Now, we know that ASR can be improved either: • Enhancing speech before recognition • Training CHMM in the same environment the ASR is going to be used. • Training CHMM with the same enhancement technique that is used to get “cleaner” speech at recognition. • Advantage: • Moreover, training with a better enhancement technique means a potential better recognition performance.

SS Model Combination • Introduction • Spectral Subtraction Scheme

Introduction • It was proven, when training and testing CHMMs using the same enhancement condition the recognition performance is improved. • How can we combine CHMMs without having to train for each enhancement and noise condition? • Observation: For CHMMs the state’s pdfs are completelydefined for their means and variances.

Spectral Subtraction Scheme Assuming Y and YD can be modelled as parametric distributions with means E[Y] and E[YD] and variances V[Y] and V[YD]. It can be shown that these parameters are distorted as follows: pdf of Y

Prove: where Re-arranging

Hence:

A(a,P(Y)) Assuming that Y is lognormal: Making ( )

Diagram Adaptation calculations Clean speech HMM SS-PMC HMM C->log exp() C() log() + + PMC Noise HMM C->log exp() Speech is pre-processed using SS.

Results No compensation scheme Spectral Subtraction PMC Spectral Subtraction and parallel model combination

Coments and Conclusions • Since training and recognition with the same speech enhancement scheme have not been tried before, hence a new area of research is now open. • How can we combine CHMM, such that we do not need to train for all enhancement conditions. • Are all the enhancement technique suited for CHMM combination? • We show how to combine clean speech and noise CHMM for SS scheme. • It was shown that equations for CHMM combination, when SS scheme is used, were straightforward.

We expect that training with a better enhancement technique we can also obtain better recognition performance. • Future work: • Develop equations and experiments for other enhancement techniques. • Obtain the optimal alpha for SS scheme.

Speech Recognition in Adverse Environments

Speech Recognition in Adverse Environments

Presentation Transcript

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

SPEECH RECOGNITION:

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition