1 / 1

Introduction

The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT. Rio Akasaka ’09, Youngmoo Kim, Ph.D* Department of Linguistics/Engineering, Swarthmore College *Drexel University. Results. Conclusions

will
Télécharger la présentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT Rio Akasaka ’09, Youngmoo Kim, Ph.D* Department of Linguistics/Engineering, Swarthmore College *Drexel University Results Conclusions While optimal performance in speaker recognition is expected with a larger training set, the availability of testing material did not seem to affect performance if at least three files are used and if the number of training files is equal to or greater than the number of testing files. Though this might be expected to extend to the length of the wav files, it was not necessarily the case because using half a file to test consistently demonstrated poor results. Most importantly, testing and training with vowel phones only provided impressive recognition rates at approximately 93%, meriting further study. With regards to individual phone contributions to recognition, it was found that a single phoneme does not predict a speaker more effectively when using the same phoneme to train, as compared to any other phoneme. However, two phonemes consistently outperform the others in predicting 1a speaker: 'ae' and 'ay'. Of the five trials, 'ae' was ranked most highly recognized 3 times, 'ay' was highest twice, and both were among the top two in four of the trials. More tests are being done to obtain a statistically significant conclusion. Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are derived by obtaining the Fourier transform of a signal and mapping the result on the mel-scale, which is an auditory perception-based scale of pitch differences. With these unique labels on speech files, the similarity between two files can be determined by the Kullback–Leibler (KL) distance, which is based on probability distributions, and, given a training set upon which to base one’s decisions, the corresponding speaker can be identified. The goal of this research is to test the robustness of MFCCs in speaker detection by varying the testing and training parameters with the following methods: 1) using segments of a whole speech file 2) varying the number of speech files used, and 3) splicing together the vowel phones of a speech segment • Vowels only, cont. • If individual phoneme files are used for training and testing, the results are impressive. • Train: 5V, Test: 5V • Evaluated: 570 Correct: 533, Percentage: 0.935088 • Train: 3V, Test: 3V • Evaluated: 342 Correct: 278, Percentage: 0.812865 The following nomenclature is adopted in this poster: F: Full (complete) speech file H: Speech file segmented at middle V: File consisting of vowel phones only • Control • Train: 5F , Test: 5F (not including SA) • Evaluated: 570 Correct: 493, Percentage: 0.864912 • Difference in number • Reducing both the number of training and testing files to be consistent results in an optimal (~84.2%) success rate, but only up to 3 files. • Train: 3F , Test: 5F • Evaluated: 570 Correct: 440, Percentage: 0.771930 • Train: 5F , Test: 3F • Evaluated: 342 Correct: 292, Percentage: 0.853801 • Train: 3F , Test: 3F • Evaluated: 342 Correct: 290, Percentage: 0.847953 Further examination In order to extract more information about the role that individual phones play in speaker recognition, the same algorithm was applied to test recognition based on individual phonemes that are extracted from each speaker. The training set consists of files containing only file segments for a particular phoneme, which are then later tested individually. The TIMIT Corpus The TIMIT corpus was created as a joint effort between Texas Instruments (TI) and MIT and consists of time-aligned orthographic, phonetic and word transcriptions for each of the 6300 16-bit 16kHz speech files. 630 speakers from the 8 major dialects of American English each read from 10 ‘phonetically rich’ texts, among which 2 are common across all speakers. PHONETIC 10160 10733 y 10733 11880 axr WORD 10160 11880 your ORTHOGRAPHIC 0 57140 She had your dark suit in greasy wash water all year. In order to investigate the distribution of the phonemes in TIMIT, the plot shown above was generated. The average sample length is The individual texts may be phonetically rich, but taken as a whole the distribution of the phonemes is unbalanced. Literature cited Cole, Ronald A., et al.. 1996. The Contribution of Consonants Versus Vowels to Word Recognition in Fluent Speech Van Heerden, C.J, E. Bernard. 2008. Speaker-specific variability of Phoneme Durations. Fattah, Mohamed, Ren Fuji, Shingo Kuroiwa. 2006. Phoneme Based Speaker Modeling to Improve Speaker Identification Figure 2. Speaker prediction based on individual phonemes. The results show that while speaker recognition based on individual phoneme is considerably low (μ=3.60%, σ=2.34), the diagonal does show slightly higher recognition rates, as would be expected. Figure 1. Speaker recognition based on 144 vowel-based files Figure 1. The predicted speaker ID plotted against the actual speaker, for 144 full speech files. • Difference in size • Performance is considerably better with full speech files during testing, regardless of which half of the file we use. • Train: 5F, Test: 5H • Evaluated: 570 Correct: 327, Percentage: 0.573684 • Train: 5H, Test: 5F • Evaluated: 570 Correct: 442, Percentage: 0.775439 • Train: 5H, Test: 3F • Evaluated: 342 Correct: 277, Percentage: 0.809942 Acknowledgments Grateful acknowledgement is made to Youngmoo Kim for providing insight and direction throughout my research and to Jiahong Yuan for encouraging my pursuit of corpus phonetics. For further information • Vowels only • Individual phonemes are exceedingly difficult to use when predicting speaker based on entire wav files. • Train: 5V, Test: 3F • Evaluated: 342 Correct: 242, Percentage: 0.707602 • Train: 5F, Test: 5V • Evaluated: 570 Correct: 53, Percentage: 0.092982 Figure 3. To test the possibility that one speaker is consistently retrieved as the ideal candidate for a particular phoneme, the above plot was generated to plot the predicted speaker vs the actual speaker based on speaker ID. Speaker 183 is selected most often in the above scenario. Please contact rakasak1@swarthmore.edu. Further details about the methodology may be read online at wiki.rioleo.org

More Related