AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE

A1 V1 A2 V2 A3 V3 A4 V4 AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE David Dean, Patrick Lucey and Sridha SridharanSpeech, Audio, Image and Video Research Laboratory, QUT ABSTRACT The freely available nature of the CUAVE database allows it to provide a valuable platform to form benchmarks and compare research. This paper shows that the CUAVE database can successfully be used to test speaker identification systems, with performance comparable to existing systems implemented on other databases. Additionally, this research shows that the optimal configuration for decision-fusion of an audio-visual speaker identification system relies heavily on the video modality in all but clean speech conditions. Acoustic Speaker Models Acoustic (MFCC) Feature Extraction Speaker Decision Fusion Visual (PCA) Feature Extraction Visual Speaker Models Lip Location and Tracking LIP LOCATION AND TRACKING MODELLING AND FUSION For these experiments, text-dependent speaker modelling was performed, meaning that the speakers say the same utterance for both training and testing. Ten experiments were ran, one for each digit in the CUAVE database. Using phoneme transcriptions obtained using earlier research on the CUAVE database, speaker independent HMMs were trained for each phoneme separately in both the audio and visual modalities. The structure of these HMMs was identical for both modalities, with 3 hidden states for the phone models, and 1 hidden state for the short pause model. These speaker independent models were then adapted using MLLR adaption into speaker-dependent models. For each word under test in the CUAVE database, the ten most likely speakers in either modality are combined using weighted-sum late fusion of the normalized scores from each modality. The fusion experiments were conducted over a range of α values to discover the best fusion parameters at each noise level. RESULTS Figure A shows the response of the fused system to speech-babble audio corruption over a range of signal-to-noise ratios (SNRs). The dashed line indicates the best possible performance at each noise level. The performance of this system is comparable to existing research on XM2VTS and other audio-visual databases. From Figure A, it would appear that the best non-adaptive fusion performance is obtained for relatively low audio-weights. This is confirmed by Figure B, which shows the optimal audio-weight for best performance at each noise level favors the visual domain (is < 0.5) for all but the clean acoustic speech. FEATURE EXTRACTION Acoustic features were represented by the first 15 Mel frequency cepstral coefficients (MFCCs), a normalised energy coefficient and the first and second derivatives of these 16 coefficients. Acoustic features were captured at 100 samples/second. Visual features were represented by a 50-dimensional vector consisting of each tracked-lip image projected into a PCA-space based on 1,000 randomly selected tracked-lip images. Visual features were captured at 29.97 samples/second.

AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE

AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE

Presentation Transcript

Audio visual hire

Speaker Identification Using a Pitch Detection Algorithm

Audio-Visual Speech and Speaker Recognition

Using classroom based Audio Visual technology

Text Independent Speaker Identification Using Gaussian Mixture Model

AUDIO VISUAL AIDS

Audio Visual Training

Audio-Visual Solutions

Audio-Visual Solutions

Speaker Identification using Gaussian Mixture Model

Language and Speaker Identification using Gaussian Mixture Model

Audio Visual Training

Speaker Identification Using Wavelet Analysis and ANN

Audio-visual speaker verification using continuous fused HMMs

Audio/Visual

Audio Visual Training

Audio Visual Training

Audio Visual Training

Audio-visual projects

Audio Visual Training

Speaker Identification of Customer and Agent using AWS