1 / 32

Text-Constrained Speaker Recognition Using Hidden Markov Models

Text-Constrained Speaker Recognition Using Hidden Markov Models. Kofi A. Boakye International Computer Science Institute. Outline. Introduction Design and System Description Initial Results System Enhancements More words Higher order cepstra Cepstral Mean Subtraction Conclusions

jacob
Télécharger la présentation

Text-Constrained Speaker Recognition Using Hidden Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute

  2. Outline • Introduction • Design and System Description • Initial Results • System Enhancements • More words • Higher order cepstra • Cepstral Mean Subtraction • Conclusions • Future Work

  3. Introduction • Speaker Recognition Problem: Determine if spoken segment is putative target • Also referred to as Speaker Verification/Authentication

  4. Introduction Method of Solution Requires Two Phases: Similar to speech recognition, though “noise” (inter-speaker variability) is now signal. Training Phase Testing Phase claimed identity: Sally

  5. Introduction • Also like speech recognition, different domains exist • Two major divisions: • Text-dependent/Text-constrained • Highly constrained text spoken by person • Examples: fixed phrase, prompted phrase • Text-independent • Unconstrained text spoken by person • Example: conversational speech

  6. Introduction • Text-dependent systems can have high performance because of input constraints • More acoustic variation arises from speaker distinction(vs. phones) • Text-independent systems have greater flexibility

  7. Introduction Question: Is it possible to capitalize on advantages of text dependent systems in text-independent domains? Answer: Yes!

  8. Introduction Idea: Limit words of interest to a select group-Words should have high frequency in domain-Words should have high speaker-discriminative quality What kind of words match these criteria for conversational speech ?1) Discourse markers (like, well, now…)2) Filled pauses (um, uh) 3) Backchannels (yeah, right, uhhuh, …) These words are fairly spontaneous and represent an “involuntary speaking style” (Heck, WS2002)

  9. Design Likelihood Ratio Detector: Λ = p(X|S) /p(X|UBM) Task is a detection problem, so use likelihood ratio detector -In implementation, log-likelihood is used Speaker Model > Θ Accept < Θ Reject Feature Extraction / Λ signal adapt Background Model

  10. Design • State-of-the Art Speaker Recognition Systems use Gaussian Mixture Models • Speaker’s acoustic space is represented by many-component mixture of Gaussians speaker 1 speaker 2

  11. Design • Speaker models are obtained via adaptation of a Universal Background Model (UBM) • Probabilistically align target training data into UBM mixture states • Update mixture weights, means and variances based on the number of occurrences in mixtures • Gives very good performance, but… Target training data

  12. Design • Concern: GMMs utilize a “bag-of-frames” approach • Frames assumed to be independent • Sequential information is not really utilized • Alternative: Use HMMs • Do likelihood test on output from recognizer, which is an accumulated log-probability score • Text-independent system has been analyzed (Weber et al. from Dragon Systems) • Let’s try a text-dependent one!

  13. System Word-level HMM-UBM detectors HMM-UBM 1 Combination Word Extractor HMM-UBM 2 signal Λ HMM-UBM N Topology: Left-to-right HMM with self-loops and no skips 4 Gaussian components per state Number of states related to number of phones and median number of frames for word

  14. System HMMs implemented using HMM toolkit (HTK) -Used for speech recognition Input features were 12mel-cepstra, first differences, and zeroth order cepstrum (energy parameter) Adaptation: Means were adapted using Maximum A Posteriori adaptation In cases of no adaptation data, UBM was used -LLR score cancels

  15. Word Selection 13 Words: Discourse markers: {actually, anyway, like, see, well, now} Filled pauses: {um, uh} Backchannels: {yeah, yep, okay, uhhuh, right } Words account for approx: 8% of total tokens

  16. Recognition Task NIST Extended Data Evaluation: Training for 1,2,4,8, and 16 complete conversation sides and testing on one side (side duration ~2.5 mins) Uses Switchboard I corpus -Conversational telephone speech Cross-validation method where data is partitioned Test on one partition; use others for background models and normalization For project, used splits 4-6 for background and 1 for testing with 8-conversation training

  17. Scoring LLR(X) = log(p(X|S)) – log(p(X|UBM)) Target score: output of adapted HMM scoring forced alignment recognition of word from true transcripts (aligned via SRI recognizer) UBM score: output of non-adapted HMM scoring same forced alignment Frame normalization: Word normalization: Average of word-level frame normalizations N-best normalization: Frame normalization on n best matching (i.e. high log-prob) words

  18. Initial Results Observations: 1) Frame norm result = word norm result 2) EER of n-best decreases with increasing n -Suggests benefit from an increase in data

  19. Initial Results Comparable results: Sturim et al. text-dependent GMM Yielded EER of 1.3% -Larger word pool (50 words) -Channel normalization

  20. Initial Results Observations: EERs for most lie in a small range around 7% -Suggests that words, as a group, share some qualities -last two may differ greatly partly because of data scarcity Best word (“yeah”) yielded EER of 4.63% compared with 2.87% for all words

  21. System Enhancements

  22. System Enhancements: New Words Some discourse markers and backchannels are bigrams 6 Additional Words Bigrams: Discourse markers:{you_know, you_see, i_think, i_mean} Backchannels:{i_see, i_know} Total coverage of ~10% with these additional words

  23. System Enhancements: New Words Results • EER reduced from 2.87% to 2.53% • Significant reduction, especially given the size of coverage increase

  24. System Enhancements: New Words Results • Observations: • Well-performing bigrams have comparable EERs • Poorly-performing bigrams suffer from a paucity of data • Suggests possibility of frequency threshold for performance

  25. System Enhancements: More Cepstra Idea: Higher order cepstra may posses more variability that can be used for speaker discrimination Input features modified to 19 mel-cepstra from 12

  26. System Enhancements: More Cepstra Results EER Reduced from 2.87% to 1.88%

  27. System Enhancements: CMS • Idea: Channel response may introduce undesirable variability (e.g., the same speaker on different handsets), so try and remove it • Common approach is to perform Cepstral Mean Subtraction (CMS) • Convolutional effects in the time domain become additive effects in the log power domain: • X(,t) = S(,t)C(,t) • log|X(,t)|2 = log|S(,t)|2 + log|C(,t)|2

  28. System Enhancements: CMS Results • EER reduced from 2.87% to 1.35% • Poor performance in low false alarm region • possibly due to small number of data points • also may have removed ‘good’ channel info

  29. System Enhancements: Combined System Results “grab bag” system yields EER of 1.01% Suffers from same problem of poor performance for low false alarms

  30. Conclusions Well performing text-dependent speaker recognition in an unconstrained speech domain is very feasible Benefit of sequential information appears to have been established Benefits of higher order cepstra and CMS for input features have been demonstrated

  31. Future Work -Analyze performance with ASR output -Closer analysis of word frequency to performance -More words! -Normalizations (Hnorm, Tnorm) -Examine influence of word context (e.g., “well” as discourse marker and as adverb)

  32. Fin

More Related