1 / 18

BioSec Multimodal Biometric Database in Text-Dependent Speaker Recognition

BioSec Multimodal Biometric Database in Text-Dependent Speaker Recognition. D. T. Toledano, D. Hernández-López, C. Esteve-Elizalde, J. Fiérrez, J. Ortega-García, D. Ramos and J. Gonzalez-Rodriguez ATVS, Universidad Autonoma de Madrid, Spain LREC 2008, Marrakesh, Morocco, 28-30 May 08. Outline.

xiu
Télécharger la présentation

BioSec Multimodal Biometric Database in Text-Dependent Speaker Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioSec Multimodal Biometric Databasein Text-Dependent Speaker Recognition D. T. Toledano, D. Hernández-López, C. Esteve-Elizalde, J. Fiérrez, J. Ortega-García, D. Ramos and J. Gonzalez-Rodriguez ATVS, Universidad Autonoma de Madrid, Spain LREC 2008, Marrakesh, Morocco, 28-30 May 08

  2. Outline • 1. Introduction and Goals • 2. Databases for Text-Dependent Speaker Recognition • 3. BioSec and Related Databases • 4. Experiments with YOHO and BioSec Baseline • 4.1. Text Dependent SR Based on Phonetic HMMs • 4.2. YOHO and BioSec Experimental Protocols • 4.3. Results with YOHO and BioSec Baseline • 5. Conclusions

  3. 1. Introduction and Goal • Text-Independent Speaker Recognition • Unknown lexical content • Research driven by yearly NIST SRE evals and databases • Text-Dependent Speaker Recognition • Lexical content of test utterance known by system • Password set by the user or text prompted by the system • No competitive evaluations by NIST • Less research and less standard benchmarks • YOHO is probably the best known benchmark • Newer databases are available, but results are difficult to compare • Goals • Study BioSec as a benchmark for text dependent Speaker Rec. • Compare results on BioSec and YOHO with the same method

  4. 2. Databases for Text-Dependent Speaker Recognition • YOHO (Campbell & Higgings, 1994): speech – Clean mic. speech, 138 speakers, 24 utt x 4 ses. for enrolment, 4 utt. x 10 ses. for test. (“12-34-56”) – Best known benchmark • XM2VTS (Messer et al. 1999): speech, face – Clean microphone speech, 295 subjects, 4 sessions • BIOMET (Garcia-Salicetti et al.2003): speech, face, fingerprint, hand, signature – Clean mic. speech, 130 subjects, 3 ses. • BANCA (Billy-Bailliere et al. 2003): speech, face – Clean and noisy mic. speech, 208 subjects, 12 sessions • MYIDEA (Dumas et al. 2005): speech, face, fingerprint, signature, hand geometry, handwritting – BIOMET + BANCA contents for speech, 104 subjects, 3 sessions • MIT Mobile Device Speaker Verification (Park and Hazen, 2006): speech – Mobile devices, realistic noisy conditions • M3 (Meng et al. 2006): speech, face, fingerprint - Microphone speech (3 devices), 39 subjects, 3 sessions (+108 single session) • MBioID (Dessimoz et al. 2007): speech, face, iris, fingerprint, signature - Microphone clean speech, 120 subjects, 2 sessions

  5. 3. BioSec and Related Databases (i) • BioSec: (Fiérrez-Aguilar et al. 2007) • Acquired under FP6 EU BioSec IP. • Sites involved: UPM, UPC, TID, MIFIN, UCOL, UTA, KULRD. • Speech, fingerprint (3 sensors), face and iris • 250 subjects, 4 sessions • Speech is recorded using two microphones: • Head-mounted close-talking microphone • Distant webcam microphone • 4 utterances of the user’s PIN + 3 utterances of other users’ PINs (PIN = 8 digit number) • Simulation of informed forgeries • Both in English and Spanish • Most speakers are Spanish speakers • BioSec Baseline: • Subset of BioSec comprising 200 subjects and 2 sessions

  6. 3. BioSec and Related Databases (ii) • BioSecurID: • Speech, iris, face, handwriting, fingerprint, hand geometry and keystroking • Microphone speech in realistic office-like scenario • 400 subjects • BioSecure: • Three scenarios: Internet, office-like and mobile • 1000 subjects (internet), 700 subjects (office-like and mobile) • BioSecure and BioSecureID share subjects with BioSec, which allows long-term studies • BioSec has several other advantages over YOHO: • Multimodal, Multilingual (Spanish/English), Multichannel (close-talking and webcam) • Same lexical content for target trials, allows simulation of informed forgeries • But it also has a clear disadvantage: • It is harder to compare results on BioSec

  7. 4. Experiments with YOHO and BioSec • Goals: • Study BioSec Baseline as a benchmark for text-dependent SR • Compare the difficulty of YOHO and BioSec Baseline • Goals achieved through: • Common text-dependent speaker recognition method • Clear evaluation protocols • Analysis of results for different conditions

  8. 4.1. Text-dependent SR based on phonetic HMMs: Enrollment Phase • Speech parameterization (common to enrollment and test) • 25 ms Hamming windows with 10 ms window shift • 13 MFCCs + Deltas + Double Deltas  39 coeffs • Spk-indep, context-indep phonetic HMMs used as base models • English: 39 phones trained on TIMIT, 3 states left-to-right • Spanish: 23 phones trained on Albayzin, 3 states left-to-right • Spk-dep phonetic HMMs from transcribed enrollment audio Speaker Dependent Phonetic HMMs (speaker model) Enrollment Parameterized Utterances Baum-Welch Retraining Or MLLR Adaptation Phonetic Transcriptions (with optional Sil) Spk-Indep models of the utterances, λI Spk-Indep Phonetic HMMs

  9. 4.1. Text-dependent SR based on phonetic HMMs: Verification Phase • Computation of acoustic scores for spk-dep and spk-indep models • Acoustic scores  Verification score ( removing silences) Spk-Indep model of the utterance, λI Spk-Indep Acoustic Scores Spk-Indep Phonetic HMMs Viterbi Phonetic Transcription (with optional Sil) Parameterized Audio to Verify Spk-Dep model of the utterance, λD Spk-Dep Acoustic Scores Viterbi Spk-Dep Phonetic HMMs

  10. 4.2. YOHO Experimental Protocol • YOHO database • 138 speakers (106 male, 32 female) • Enrollment data: 4 sessions x 24 utterances = 96 utterances • Test data: 10 sessions x 4 utterances = 40 utterances • Utterance = 3 digit pairs (i.e. “twelve thirty four fifty six”) • Usage of YOHO in this work • Enrollment: 3 different conditions • 6 utterances from the 1st enrollment session • 24 utterances from the 1st enrollment session • 96 utterances from the 4 enrollment sessions • Test: always with a single utterance • Target trials: 40 test utterances for each speaker (138 x 40 = 5,520) • Non-tgt trials: 137 test utterances for each speaker (138 x 137 = 18,906) • One random utterance from the test data of each of the other users • Text-Prompted simulation: the utterance spoken is always the utterance expected by the system

  11. 4.2. BioSec Experimental Protocol • BioSec Baseline database • 200 speakers, 2 sessions • Session = 4 utterances of each user’s PIN and 3 of other user’s PINs • PIN = 8 digits (i.e. “one two three four five six seven eight”) • Usage of BioSec in this work • Following BioSec Baseline core protocol (Fiérrez et al. 2005) • This protocol limits the amount of subjects to only 150. • Target trials: 150x4x4 = 2,400: • Enrollment: one of the 4 utterances of each user’s PIN from 1st session • Test: one of the 4 utterances of each user’s PIN from 2nd session • Non Target trials: 150x149/2 = 11,175: • Enrollment: 1st user’s PIN from 1st session • Test: 1st PIN from 1st session of the rest of the users, avoiding symmetric matches • Enrollment and test always with a single utterance • Lexical content is the same in enrollment and test in target trials but is usually different in non-target trials • Text-Prompted simulation: the utterance spoken is always the utterance expected by the system

  12. 4.3. Results with YOHO • DET curves and %EERs comparing • Baum-Welch Re-estimation vs. MLLR Adaptation • Different amounts of enrollment material • 6, 24 or 96 utterances • MLLR Adaptation provides better performance for all conditions • With 96 utterances for enrollment EER is below 1%, with 24 it is about 2% and with 6 it is below 5%

  13. 4.3. Results with BioSec Baseline (i) • Spanish, head-mounted close-talking microphone • MLLR Adaptation • Enrolling with a single utterance in BioSec results (1.7%) are better than enrolling with 24 utterances in YOHO (2.1%) • Probably due to the lexical match between enrollment and verification in target trials

  14. 4.3. Results with BioSec Baseline (ii) • English, webcam distant microphone • MLLR Adaptation • Results for English and webcam distant microphone are almost an order of magnitude worse!! • Possible causes: • Channel variation • Non-native speakers

  15. 4.3. Results with BioSec Baseline (iii) • New results: not in the paper • English, head-mounted close-talking microphone • MLLR Adaptation • Results for English and close-talking microphone (2.2%) are only slightly worse than for Spanish (1.7%) • Possibly due to non-native speakers • The main reason for the poor results with distant microphone is the channel

  16. 4.3. Results with BioSec Baseline (iv) • New results: not in the paper • Spanish, webcam distant microphone • MLLR Adaptation • For Spanish and distant microphone results are again much worse than for close-talking microphone • Huge impact of the channel on speaker recognition performance • No channel robustness techniques used (only CMN)

  17. 5. Conclusions • We have studied BioSec Baseline as a benchmark for text-dependent speaker recognition • We have tried to facilitate comparison of results between YOHO and BioSec Baseline by evaluating the same method on both corpora • For close-talking microphone results on BioSec are much better than results on YOHO • Probably due to the lexical match in enrollment and verification • For distant webcam microphone results on BioSec are much worse than results on YOHO • Due to the channel variation • No channel robustness techniques used (only CMN)

  18. Thanks!

More Related