1 / 27

The CRIM Systems for the NIST 2008 SRE

The CRIM Systems for the NIST 2008 SRE. Patrick Kenny, Najim Dehak and Pierre Ouellet. Centre de recherche informatique de Montreal (CRIM). Systems. CRIM_2 was the primary system for all but the core condition Large stand-alone joint factor analysis (JFA) system trained on pre-2006 data

Télécharger la présentation

The CRIM Systems for the NIST 2008 SRE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The CRIM Systems for the NIST 2008 SRE Patrick Kenny, Najim Dehak and Pierre Ouellet Centre de recherche informatique de Montreal (CRIM)

  2. Systems • CRIM_2 was the primary system for all but the core condition • Large stand-alone joint factor analysis (JFA) system trained on pre-2006 data • CRIM_1 was the primary system for the core condition • CRIM_1 = CRIM_2 + 3 other JFA systems with different feature sets • CRIM_3 = CRIM_2 + 2006 SRE data

  3. Overview • Tasks involving multiple enrollment recordings: • 8conv-short3, 3conv-short3 • Tasks involving 10 sec test recordings: • 10sec-10sec, short2-10sec, 8conv-10sec • Najim Dehak will talk about • JFA with unconventional features • Post-eval experiments on the interview data (following LPT and I4U)

  4. Factor Analysis Configuration • 2K Gaussians, 60 dimensional features • 20 Gaussianized mfcc’s + first and second derivatives • 300 speaker factors • 100 channel factors for telephone speech • Additional 100 channel factors for microphone speech

  5. Speaker Variability Prior distribution on speaker supervectors s = m + vy + dz • m is the speaker-independent supervector • v is rectangular, low rank (eigenvectors) • d is diagonal • y, z standard Normal random vectors (speaker factors)

  6. Channel Variability Each supervector M is assumed to be a sum of a speaker supervector and a channel supervector: M = s + c Prior distribution on channel supervectors c = ux • u is rectangular, low rank (eigenchannels) • x standard Normal random

  7. Enrollment: single utterance The supervector for the utterance is m + dz + vy + ux Calculate the MAP estimates of x, y and z The speaker supervector is s + dz + vy The full posterior distribution of s can be calculated in closed form (but this is messy unless d is 0)

  8. Enrollment: 8conv case Again the joint posterior distribution of the hidden variables can be calculated in closed form. Unless d is 0, this is very messy Trick: pool the utterances together and ignore the fact that the x’s are different

  9. 10 second test conditions Many labs have reported difficulty in getting channel factors or NAP to work under these conditions The problem may be that it is unrealistic to attempt to produce point estimates (ML or MAP)of channel factors using 10 second test utterances Probability rules say you should integrate over channel factors instead

  10. Why is this not an issue for long test utterances? If the test utterance is long, the posterior distribution of the channel factors will be sharply peaked in the neighbourhood of the point estimate (MAP or ML).

  11. Research Problem How should factor analysis likelihoods and posteriors be evaluated so as to take account of all of the relevant uncertainties? - Uncertainty in the speaker factors - Uncertainty in the channel factors - Uncertainty in the assignment of observations to mixture components

  12. Current Solution • Use point estimate of speaker factors • Bayesian approach (using full posterior) doesn’t seem to help • Integrate over the channel factors • Use the UBM to align frames with mixture components • Tractable posterior + Jensen’s inequality gives lower bound on likelihood (Niko Brummer) • Very fast if combined with LPT assumption • Paradoxical results if speaker/channel dependent GMM’s used in place of UBM

  13. Ideal Solution: Integrate over all hidden variables • Robbie Vogt (Odyssey 2004) did this for a diagonal factor analysis model • No speaker or channel factors • Exact dynamic programming solution • Variational Bayes offers an approximate solution in the general case • Assume that the posterior distribution factorizes into 3 terms (speaker factors, channel factors, assignments of frames to mixture components) • Cycle through the factors to update them (like EM) • Jensen’s inequality gives lower bound on the likelihood which increases on successive iterations

  14. Fusion • Fusing long term and short term features • Pseudo-syllable unsupervised prosodic and MFCC’s contours segmentation. • Six Legendre Polynomial coefficients for each contour. • JFA without common factor (d=0) • Logistic regression function (Focal).

  15. Pseudo-syllable segmentation

  16. Long term features • Three long term systems: • 512 G, Features : Pitch + energy + duration (13 dimension) • 1024 G, Features : 12 MFCCs contours + energy + duration (79 dimension) • 1024 G, Features : 12 MFCCs contours + pitch + energy + duration (85 dimension)

  17. Short2-short3 : Tel-Tel det7

  18. Short2-short3 : Tel-Tel det8

  19. How to deal with interview data? • Interview eigenchannel trained on interview development data (as LPT and I4U). • Small configuration of the Factor analayis • Features 20 Gaussianized MFCC’s + first derivatives • 300 speaker factors , d=0 (no common factor), 100 telephone channel factors. • We carried out two experiments : • 50 TeL-Mic channel factors. • 50 TeL-Mic channel factors + 50 interview channel factors.

  20. NIST 2008 : Interview data –det1

  21. NIST 2008 : Interview data –det1

  22. References • A Study of Inter-Speaker Variability in Speaker Verification. • Modeling prosodic features with joint factor analysis for speaker verification. www.crim.ca/perso/patrick.kenny

More Related