Speaker Verification: From Research to Reality

Speaker Verification: From Research to Reality ICASSP Tutorial Salt Lake City, UT 7 May 2001 Douglas A. Reynolds, PhD Senior Member of Technical Staff M.I.T. Lincoln Laboratory Larry P. Heck, PhD Speaker Verification R&D Nuance Communications This work was sponsored by the Department of Defense under Air Force contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the Department of Defense.

Speaker Verification: From Research to Reality ICASSP Tutorial Salt Lake City, UT 7 May 2001 This material may not be reproduced in whole or part without written permission from the authors

Tutorial Outline • Part I : Background and Theory • Overview of area • Terminology • Theory and structure of verification systems • Channel compensation and adaptation • Part II : Evaluation and Performance • Evaluation tools and metrics • Evaluation design • Publicly available corpora • Performance survey • Part III : Applications and Deployments • Brief overview of commercial speaker verification systems • Design requirements for commercial verifiers • Steps to deployment • Examples of deployments

Goals of Tutorial • Understand major concepts behind modern speaker verification systems • Identify the key elements in evaluating performance of a speaker verification system • Define the main issues and tasks in deploying a speaker verification system

Part I : Background and TheoryOutline • Overview of area • Applications • Terminology • General Theory • Features for speaker recognition • Speaker models • Verification decision • Channel compensation • Adaptation • Combination of speech and speaker recognizers

Extracting Information from Speech Goal:Automatically extract information transmitted in speech signal Speech Recognition Words “How are you?” Language Recognition Language Name Speech Signal English Speaker Recognition Speaker Name James Wilson

Evolution of Speaker Recognition Commercial application of speaker recognition technology Aural and spectrogram matching Hidden Markov Models Gaussian Mixture Models Dynamic Time-Warping Vector Quantization Template matching 2001 1990 1980 1970 1960 1930- • This tutorial will focus on techniques and performance of state-of-the art systems Small databases, clean, controlled speech Large databases, realistic, unconstrained speech

Speaker Recognition Applications Access Control Physical facilities Computer networks and websites Transaction Authentication Telephone banking Remote credit card purchases Law Enforcement Forensics Home parole Speech Data Management Voice mail browsing Speech skimming Personalization Intelligent answering machine Voice-web / device customization

Terminology • The general area of speaker recognition can be divided into two fundamental tasks Speaker recognition Identification Verification • Any work on speaker recognition should identify which task is being addressed

TerminologyIdentification • Determines whom is talking from set of known voices • No identity claim from user (one to many mapping) • Often assumed that unknown voice must come from set of known speakers - referred to as closed-set identification ? Whose voice is this? ? ? ?

Terminology Verification/Authentication/Detection • Determine whether person is who they claim to be • User makes identity claim (one to one mapping) • Unknown voice could come from large set of unknown speakers - referred to as open-set verification • Adding “none-of-the-above” option to closed-set identification gives open-set identification Is this Bob’s voice? ?

Terminology Segmentation and Clustering • Determine when speaker change has occurred in speech signal (segmentation) • Group together speech segments from same speaker (clustering) • Prior speaker information may or may not be available Where are speaker changes? Which segments are from the same speaker? Speaker A Speaker B

TerminologySpeech Modalities Application dictates different speech modalities: • Text-dependent recognition • Recognition system knows text spoken by person • Examples: fixed phrase, prompted phrase • Used for applications with strong control over user input • Knowledge of spoken text can improve system performance • Text-independent recognition • Recognition system does not know text spoken by person • Examples: User selected phrase, conversational speech • Used for applications with less control over user input • More flexible system but also more difficult problem • Speech recognition can provide knowledge of spoken text

TerminologyVoice Biometric Are Know Have • Speaker verification is often referred to as a voice biometric • Biometric: a human generated signal or attribute for authenticating a person’s identity • Voice is a popular biometric: • natural signal to produce • does not require a specialized input device • ubiquitous: telephones and microphone equipped PC • Voice biometric can be combined with other forms of security Strongest security • Something you have - e.g., badge • Something you know - e.g., password • Something you are - e.g., voice

General TheoryPhases of Speaker Verification System Model (voiceprint) for each speaker Bob Sally Verification Phase Feature extraction Verification decision Accepted! Claimed identity: Sally Two distinct phases to any speaker verification system Enrollment Phase Enrollment speech for each speaker Bob Feature extraction Model training Model training Sally Verification decision

General TheoryFeatures for Speaker Recognition Difficult to automatically extract High-level cues (learned traits) Easy to automatically extract Low-level cues (physical traits) • Humans use several levels of perceptual cues for speaker recognition Hierarchy of Perceptual Cues • There are no exclusive speaker identity cues • Low-level acoustic cues most applicable for automatic systems

General TheoryFeatures for Speaker Recognition • Desirable attributes of features for an automatic system (Wolf ‘72) • Occur naturally and frequently in speech • Easily measurable • Not change over time or be affected by speaker’s health • Not be affected by reasonable background noise nor depend on specific transmission characteristics • Not be subject to mimicry Practical Robust Secure • No feature has all these attributes • Features derived from spectrum of speech have proven to be the most effective in automatic systems

General TheorySpeech Production Glottal pulses Vocal tract Speech signal Time (sec) Time (sec) • Speech production model: source-filter interaction • Anatomical structure (vocal tract/glottis) conveyed in speech spectrum

General TheoryFeatures for Speaker Recognition Vocal Tract 70 4 Cross Section of Cross Section of Male Male 60 3 Vocal Tract 50 Speaker Speaker /I/ 40 2 30 20 1 10 Magnitude (dB) Magnitude (dB) 0 0 18 Female Female 5 16 14 4 Speaker Speaker 12 3 10 8 2 6 /AE/ 4 1 2 0 0 0 2000 4000 6000 0 2000 4000 6000 Frequency (Hz) Frequency (Hz) • Different speakers will have different spectra for similar sounds • Differences are in location and magnitude of peaks in spectrum • Peaks are known as formants and represent resonances of vocal cavity • The spectrum captures the format location and, to some extent, pitch without explicit formant or pitch tracking

General TheoryFeatures for Speaker Recognition • Speech is a continuous evolution of the vocal tract • Need to extract time series of spectra • Use a sliding window - 20 ms window, 10 ms shift ... Fourier Transform Magnitude • Produces time-frequency evolution of the spectrum Frequency (Hz) Time (sec)

General TheoryFeatures for Speaker Recognition Magnitude Frequency • The number of discrete Fourier transform samples representing the spectrum is reduced by averaging frequency bins together • Typically done by a simulated filterbank • A perceptually based filterbank is used such as a Mel or Bark scale filterbank • Linearly spaced filters at low frequencies • Logarithmically spaced filters at high frequencies

General TheoryFeatures for Speaker Recognition 3.4 3.6 2.1 0.0 -0.9 0.3 .1 3.4 3.6 2.1 0.0 -0.9 0.3 .1 3.4 3.6 2.1 0.0 -0.9 0.3 .1 ... • Primary feature used in speaker recognition systems are cepstral feature vectors • Log() function turns linear convolutional effects into additive biases • Easy to remove using blind-deconvolution techniques • Cosine transform helps decorrelate elements in feature vector • Less burden on model and empirically better performance Fourier Transform Magnitude Log() Cosine transform One feature vector every 10 ms

General TheoryFeatures for Speaker Recognition Fourier Transform Cosine transform Magnitude Log()

General TheoryFeatures for Speaker Recognition 300 3300 • Additional processing steps for speaker recognition features • To help capture some temporal information about the spectra, delta cepstra are often computed and appended to the cepstra feature vector • 1st order linear fit used over a 5 frame (50 ms) span • For telephone speech processing, only voice pass-band frequency region is used • Use only output of filters in range 300-3300 Hz

General TheoryFeatures for Speaker Recognition • To help remove channel convolutional effects, cepstral mean subtraction (CMS) or RASTA filtering is applied to the cepstral vectors |.| FT() h(t) Log() Cos Trans() • Some speaker information is lost, but generally CMS is highly beneficial to performance • RASTA filtering is like a time-varying version of CMS (Hermansky, 92)

General TheoryPhases of Speaker Verification System Model (voiceprint) for each speaker Bob Sally Two distinct phases to any speaker verification system Enrollment Phase Enrollment speech for each speaker Bob Feature extraction Model training Sally Verification Phase Feature extraction Verification decision Accepted! Claimed identity: Sally

General TheorySpeaker Models • Speaker models are used to represent the speaker-specific information conveyed in the feature vectors • Desirable attributes of a speaker model • Theoretical underpinning • Generalizable to new data • Parsimonious representation (size and computation) • Modern speaker verification systems employ some form of Hidden Markov Models (HMM) • Statistical model for speech sound representation • Solid theoretical basis • Existing parameter estimation techniques

General TheorySpeaker Models 3.4 3.6 2.1 0.0 -0.9 0.3 .1 3.4 3.6 2.1 0.0 -0.9 0.3 .1 3.4 3.6 2.1 0.0 -0.9 0.3 .1 … • Treat speaker as a hidden random source generating observed feature vectors • Source has “states” corresponding to different speech sounds Observed feature vectors Speaker (source) Hidden speech state

General TheorySpeaker Models Transition probability • Feature vectors generated from each state follow a Gaussian mixture distribution • Transition between states based on modality of speech • Text-dependent case will have ordered states • Text-independent case will allow all transitions Feature distribution for state i • Model parameters • Transition probabilities • State mixture parameters • Parameters are estimated from training speech using Expectation Maximization (EM) algorithm

General TheorySpeaker Models • HMMs encode the temporal evolution of the features (spectrum) • HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states. • This provides a statistical model of how a speaker produces sounds • Designer needs to set • Topology (# states and allowed transitions) • Number of mixtures

General TheorySpeaker Models Fixed Phrase Word/phrase models “Open sesame” Prompted phrases/passwords Phoneme models /t/ /e/ /n/ Text-independent single state HMM (GMM) General speech Form of HMM depends on the application

General TheorySpeaker Models • The dominant model factor in speaker recognition performance is the number of mixtures used (Matsui and Furui, ICASSP92) • Selection of mixture order is dependent on a number of factors • Topology of HMM • Amount of training data • Desired model size • No good theoretical technique to pick mixtures order • Usually set empirically • Parameter tying techniques can help increase the effective number of Gaussians with limited total parameter increase

General TheorySpeaker Models states x(1) x(2) x(3) x(4) • The likelihood of a HMM given a sequence of feature vectors is computed as Full likelihood score Viterbi (best-path) score time

General TheoryPhases of Speaker Verification System Verification Phase Feature extraction Verification decision Accepted! Claimed identity: Sally Two distinct phases to any speaker verification system Enrollment Phase Enrollment speech for each speaker Model (voiceprint) for each speaker Bob Feature extraction Model training Bob Sally Sally Verification decision

General TheoryVerification Decision • The verification task is fundamentally a two-class hypothesis test • H0: the speech S is from an impostor • H1: the speech S is from the claimed speaker • We select the most likely hypothesis (Bayes test for minimum error) • This is known as the likelihood ratio test

General TheoryVerification Decision Speaker model + Front-end processing L S Impostor model - • Usually the log-likelihood ratio is used • The H1 likelihood is computed using the claimed speaker model • Requires an alternative or impostor model for H0 likelihood

General TheoryBackground Model • Cohorts/Likelihood Sets/Background Sets (Higgins, DSPJ91) • Use a collection of other speaker models • The likelihood of the alternative is some function, such as average, of the individual impostor model likelihoods • General/World/Universal Background Model (Carey, ICASSP91) • Use a single speaker-independent model • Trained on speech from a large number of speakers to represent general speech patterns Speaker model / Bkg 1 model Bkg 2 model Speaker model Bkg 3 model / Universal model • There are two main approaches for creating an alternative model for the likelihood ratio test

General TheoryBackground Model • The background model is crucial to good performance • Acts as a normalization to help minimize non-speaker related variability in decision score • Just using speaker model’s likelihood does not perform well • Too unstable for setting decision thresholds • Influenced by too many non-speaker dependent factors • The background model should be trained using speech representing the expected impostor speech • Same type speech as speaker enrollment (modality, language, channel) • Representation of impostor genders and microphone types to be encountered

General TheoryBackground Model • Selected highlights of research on background models • Near/Far cohortselection (Reynolds, SpeechComm95) • Select cohort speakers to cover the speaker space around speaker model • Phonetic based cohort selection (Rosenberg, ICASSP96) • Select speech and speakers to match the same speech modality as used for speaker enrollment • Microphone dependent background models (Heck, ICASSP97) • Train background model using speech from same type microphone as used for speaker enrollment • Adapting speaker model from background model (Reynolds, Eurospeech97, DSPJ00) • Use Maximum A Posteriori (MAP) estimation to derive speaker model from a background model

General TheoryComponents of Speaker Verification System Bob “My Name is Bob” Bob’s model SpeakerModel ACCEPT ACCEPT Feature extraction Input Speech Decision S REJECT ImpostorModel Impostor model(s) Identity Claim

Channel Compensation The largest challenge to practical use of speaker verification systems is channel variability • Variability refers to changes in channel effects between enrollment and successive verification attempts • Channel effects encompasses several factors • The microphones • Carbon-button, electret, hands-free, etc • The acoustic environment • Office, car, airport, etc. • The transmission channel • Landline, cellular, VoIP, etc. • Anything which affects the spectrum can cause problems • Speaker and channel effects are bound together in spectrum and hence features used in speaker verifiers • Unlike speech recognition, speaker verifiers can not “average” out these effects using large amounts of speech • Limited enrollment speech

Channel CompensationExamples Different!

Channel CompensationExamples The Same !

Channel Compensation ErrorRates Using compensation techniques has driven down error rates in NIST evaluations • Three areas where compensation has been applied Feature-based approaches CMS and RASTA Nonlinear mappings Model-based approaches Handset-dependent background models Synthetic Model Synthesis (SMS) Score-based approaches Hnorm, Tnorm Factor of 20worse Factor of 2.5worse

Channel CompensationFeature-based Approaches Linear Non - Linear Linear Linear Linear Non Non - - Linear Linear Linear Linear Linear Non - Linear Linear Linear Linear Non Non - - Linear Linear Linear Linear filter filter filter filter filter filter filter filter filter filter filter filter filter filter filter filter filter filter electret electret speech speech carbon button carbon button speech speech Discriminative training Output features Speaker Recognition system Feature analysis ANN Transform Input features • CMS and RASTA only address linear channel effects on features • Several approaches have looked at non-linear effects • Non-linear mapping (Quatieri, TrSAP 2000) • Use Volterra series to map speech between different types of handsets • Discriminative feature design (Heck, SpeechCom 2000) • Use neural-net to find features to discriminate speakers not channels

Channel CompensationModel-based Approaches synthesis synthesis cellular electret carbon button • It is generally difficult to get enrollment speech from all microphone types to be used • The SMS approach addresses this by synthetically generating speaker models as if they came from different microphones (Teunen, ICSLP 2000) • A mapping of model parameters between different microphone types is applied

Channel CompensationScore-based Approaches LR scores hnorm scores elec spk1 carb • During verification normalize LR score based on microphone label of utterance elec spk2 carb • Speaker model LR scores have different biases and scales for utterances from different handset types • Hnorm attempts to remove these bias and scale differences from the LR scores (Reynolds, NIST eval96) • Estimate mean and standard-deviation of impostor, same-sex utterances from different microphone-types

Channel CompensationScore-based Approaches Speaker model Tnorm score Cohort model Cohort model Cohort model • Tnorm/HTnorm - Estimates bias and scale parameters for score normalization using “cohort” set of speaker models (Auckenthaler, DSP Journal 2000) • Test time score normalization • Normalizes target score relative to a non-target model ensemble • Similar to standard cohort normalization except for standard deviation scaling • Used cohorts of same gender and channel as speaker • Can be used in conjunction with Hnorm

Speaker Verification: From Research to Reality