1 / 16

IRISA 2003 SPEAKER RECOGNITION SYSTEM

IRISA 2003 SPEAKER RECOGNITION SYSTEM. NIST Speaker Recognition Workshop, June 24-25, 2003. 1sp DETECTION Limited Data M. BEN, G. GRAVIER, A. OZEROV & F. BIMBOT for the ELISA consortium. Outline. IRISA 2003 system Introduction Description NIST’03 SRE results Experiments Front-end

nile
Télécharger la présentation

IRISA 2003 SPEAKER RECOGNITION SYSTEM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IRISA 2003 SPEAKER RECOGNITION SYSTEM NIST Speaker Recognition Workshop, June 24-25, 2003 1sp DETECTION Limited Data M. BEN, G. GRAVIER, A. OZEROV & F. BIMBOT for the ELISA consortium

  2. Outline • IRISA 2003 system • Introduction • Description • NIST’03 SRE results • Experiments • Front-end • Modeling • Score normalization • Conclusions

  3. IRISA 2003 system • Introduction  IRISA is a member of the ELISA consortium  IRISA 2003 system is based on a newly developed audio segmentation software : audioseg Web links: - IRISA/METISS : http://www.irisa.fr/metiss/accueil.html - ELISA consortium : http://elisa.ddl.ish-lyon.cnrs.fr

  4. IRISA 2003 system • Description front-end • • 20 ms frames every 10 ms • • 24 filter bank over 340 - 3400 Hz  16 LFCC • RASTA filtering (secondary system) • deltas + delta log-energy are added • • frame selection : bi-gaussian modeling of the energy with ML classification of the frames (speech/silence) • global feature normalization (zero mean, unit var.)

  5. IRISA 2003 system • Description background modeling speaker models • • gender-dependent background models • 256 GMMs with diagonal covariance matrices • prim. system : cellular data (NIST’01) • second. system : cellular+landline data (NIST’01) • • adapted from the background models with MAP estimation of the parameters (mean only adaptation)

  6. IRISA 2003 system • Description scoring • frame score : • log-likelihood ratio using the 10-best matching gaussians in the background model • utterance score : • NT : number of frames in the utterance

  7. IRISA 2003 system • Description score normalization : DT-norm • D-norm : D(spk) : symmetric Kullback-Leibler distance between the speaker (spk) and the background models • DT-norm: : mean and standard deviation of the D-norm scores of the test utterance using cohort impostor models (50 mal. + 50 fem. from NIST’01 SRE)

  8. IRISA 2003 system • NIST’03 SRE results : 1sp-limited DET curves • 2 systems submited : • IRI_1 : primary • baseline system • IRI_2 : secondary • RASTA front-end • mixed cell.+land. data for world models DCF min actual IRI_1 0.3176 0.3205 IRI_2 0.33330.3396

  9. Experiments • Front-end : frame selection • speech/silence classification based on a bi-gaussian modeling of the frame energy ML classification or threshold-based selection ? ( t = 2 - c.2 ) constant coef. to optimise G1(1 ,1) G2 (2,2) energy

  10. Experiments • Front-end : frame selection • speech/silence classification based on a bi-gaussian modeling of the frame log-energy ML classification or threshold-based selection ? ( t = 2 - c.2 ) constant coef. to optimise G1(1 ,1) G2 (2,2) log-energy

  11. Experiments • Front-end : frame selection • SYS_fs1 : ML selection (E) • SYS_fs2 : optimal threshold-based selection (E) : c = 0.8 • SYS_fs3 : ML selection (LogE) • SYS_fs4 : optimal threshold-based selection (LogE) : c = 2.5 • energy (E) bi-gauss. modeling with ML selection of the frames performs the best • drastic selection : about 50 % of the frames are discarded ! NIST ’03 SRE data

  12. Experiments • Front-end : feature normalization - st-norm : short-term norm. (0 mean, unit var.) on a sliding window (3 sec.) - lt-norm : long term norm. (0 mean, unit var.) on all features • st-norm is applied before frame selection • lt-norm can be applied before or after frame selection • SYS_fn1 : lt-norm + frame selection • SYS_fn2 : st-norm + frame selection • SYS_fn3 : frame selection + lt-norm NIST ’02 SRE data (subset)

  13. Experiments • Front-end : feature normalization • - SYS_fn5 : frame selection + lt-norm • baseline system (prim.) • SYS_fn6 : st-norm + frame selection+ lt-norm • short-term normalization does not seem to work well (buggy?) • long-term normalization at the end of front-end seems to be crucial • best results obtained with frame selection followed by long-term normalization of remaining features NIST ’03 SRE data

  14. Experiments • Modeling • Does size matter ? • - SYS_nbg1 : 256 component GMMs • (baseline) • SYS_nbg2 : 2048 component GMMs • no gain of performance with 2048 gaussians in the mixture • may be due to the frame selection process which remove a large amount of frames (?) NIST ’02 SRE data (subset)

  15. Experiments • Score normalization • SYS_sn1 : no score norm. • SYS_sn2 : T-norm • SYS_sn3 : DT-norm • SYS_sn4 : DZT-norm • all score normalizations improve performance • DT-norm seems to perform better than T-norm and DZT-norm at minimum DCF point NIST ’02 SRE data (subset)

  16. Conclusions • validation of the new toolkit audioseg • new baseline system performs well • frame selection is crucial for good performance • work on feature transformations (PCA, ICA ...) • model adaptation on test data • hierarchical structural model adaptation • IRISA participation to NIST’03 SRE • Perspectives

More Related