170 likes | 268 Vues
Joint Constrained Maximum Likelihood Linear Regression for Overlapping Speech Recognition. Kenichi Kumatani, Rita Singh, Friedrich Faubel , John McDonough, Youssef Oualil. Motivation of Overlapping Speech Recognition. Overlapping speech is often observed in natural conversation.
E N D
Joint Constrained Maximum Likelihood Linear Regression for Overlapping Speech Recognition Kenichi Kumatani, Rita Singh, Friedrich Faubel, John McDonough, Youssef Oualil
Motivation of Overlapping Speech Recognition Overlapping speech is often observed in natural conversation. • It was reported that 30 % to 50 % of all utterances contain interfering speech from another speaker in meetings and telephone conversations. We want to recognize overlapping speech in such situations. It is also important to avoid wearing intrusive devices such as close-talking microphones. We develop a distant overlapping speech recognition system.
Technical Issues • Distant speech recognition performance is degraded when multiple talkers speak simultaneously. • Microphone array techniques can improve accuracy of overlapping speech recognition. • However, speech separation performance of the microphone array is in practice imperfect because of steering errors and so on. • The residual speech from an interfering speaker limits the gains from the conventional speaker adaptation techniques. • In this presentation, we propose a new microphone-array-based feature-space adaptation technique robust for overlapping speech.
Conventional Speech Separation System Speaker 2 Speaker 1 Microphone Array Speaker Tracker 1 Speaker Tracker 2 Beamformer 2 + Post-filter Beamformer 1 + Post-filter Feature Extraction 1 Feature Extraction 2 CMLLR 1 CMLLR 2 MLLR 1 MLLR 2 Transcript 1 Transcript 2
Problem in Conventional Speech Separation System Speaker 2 Speaker 1 Microphone Array Speaker Tracker 1 Speaker Tracker 2 Beamformer 2 + Post-filter Beamformer 1 + Post-filter Imperfect separation / heavy distortion Feature Extraction 1 Feature Extraction 2 CMLLR 1 CMLLR 2 MLLR 1 MLLR 2 Transcript 1 Transcript 2
Our Speech Separation System Speaker 2 Speaker 1 Microphone Array Speaker Tracker 1 Speaker Tracker 2 Beamformer 1 + Post-filter Beamformer 2 + Post-filter Feature Extraction 1 Feature Extraction 2 Joint Feature-Space Adaptation CMLLR 1 CMLLR 2 MLLR 1 MLLR 2 Transcript 1 Transcript 2
Formulation of JCMLLR • Let us denote each speech feature vector from the n-th speaker and • Consider the concatenated feature vector • Our goal is to estimate the linear transformation parameters on the concatenated vector as follows: • The non-diagonal matrices can attenuate interfering speech features by subtracting them. • With the non-diagonal matrices zero, this method will be similar with the conventional constrained maximum likelihood linear regression (CMLLR). • For each output , we can further perform the conventional CMLLR:
Estimation Algorithm for JCMLLR Feature Extraction 1 Feature Extraction 2 Joint CMLLR (JCMLLR) CMLLR 2 CMLLR 1 • Under the assumption that individual speech production processes are statistically independent, we can derive the log-probability model for the original vector as • Given observations, we have the log-likelihood function as: where ) is represented with the HMM trained with a sufficient amount of data offline. • Based on the gradient-based numerical optimization algorithm, we find the parameters that provide the maximum log-likelihood.
Summary of Our Unsupervised Adaptation Algorithm Initialize the JCMLLR parameters: and Compute the feature vector with the updated VTLN factor Compute the feature vector with the updated VTLN factor Update the JCMLLR parameters: and. Update the CMLLR parameter for the first speaker: and Update the CMLLR parameter for the first speaker and Update the model with MLLR: Update the model with MLLR: Yes Converged? No End
Test Data for Recognition Experiments: MC-WSJ-AV RT60 = 380 milliseconds • Two real humans read the WSJ sentences and real far-field speech data were captured with a circular, 8-channel microphone array with a diameter of 20 cm. • The audio data contain real reverberation, background noise from computers and the building ventilation as well as non-stationary noise such as car noise running outside. • A close-talking microphone was used to capture the cleanest signal as a reference.
Multi-Pass Speech Recognizer • We consider multi-passes for speech recognition experiments; More powerful adaptation techniques are used in the higher pass as follows. • 1st pass: VTLN & CMLLR • 2nd pass: VTLN, CMLLR & MLLR • 3rd pass: VTLN, CMLLR & MLLR & ML-SAT. • Thus, the higher pass provides the better speech recognition performance. • Every adaptation method is performed in the unsupervised way. • We employ warped minimum variance distortionless (MVDR) feature extraction and LDA is applied to capture the long-term dynamics of the speech feature with CMN.
Word Error Rates on the MC-WSJ-AV Data • Each beamforming algorithm can improve recognition performance for overlapping speech. • Even if the good speech separation performance is achieved at a signal level, the recognition performance can be further improved with JCMLLR. • The more beamformers’ outputs contain residual interfering speech, the more JCMLLR provides relative improvement; see the case of D&S beamforming.
Conclusion • We have proposed the new method which jointly estimates the CMLLR parameters for the feature vectors from multiple beamformers. • It has been demonstrated through a set of speech recognition experiments on the real array data that the new method can improve recognition performance for overlapping speech. • We have also showed that the cascade of JCMLLR and conventional speaker adaptation methods further improves recognition performance.
Word Error Rates on the MC-WSJ-AV Data • Each beamforming algorithm can improve recognition performance for overlapping speech. • Even if the good speech separation performance is achieved at a signal level, the recognition performance can be further improved with JCMLLR. • The more beamformers’ outputs contain residual interfering speech, the more JCMLLR provides relative improvement; see the case of D&S beamforming.
Organization of Presentation • Background • Conventional Overlapping Speech Recognition Method • Our Method • Joint Constrained Maximum Likelihood Regression (JCMLLR) Algorithm • Experiments • Conclusion