1 / 34

Ayako Ikeno and John H.L. Hansen

Perceptual In-Set Speaker ID using Neutral Speech and Lombard Speech. Ayako Ikeno and John H.L. Hansen. Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science University of Texas at Dallas Richardson, Texas 75083-0688, U.S.A.

phyre
Télécharger la présentation

Ayako Ikeno and John H.L. Hansen

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Perceptual In-Set Speaker ID using Neutral Speech and Lombard Speech Ayako Ikeno and John H.L. Hansen Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science University of Texas at Dallas Richardson, Texas 75083-0688, U.S.A. IAFPA-2006 July 23-26, 2006

  2. CRSS & Speech Processing Overview • Previous Studies on Stress & Lombard Effect • Perceptual Speaker ID with Lombard Speech • Speech Corpus - UTScope • Experimental Setup • Results • Summary & Impact Outline

  3. Overview of CRSS-Hansen Research: Dialect & Accent UAE, Egypt, Palestine, etc. Cuba, Peru, Puerto Rico Cambridge, Irish, Welsh, etc. In-Set / Out-of-Set Speaker Detection Normalization: Speaker, Environment, Language UTDrive & CU-Move: In-Vehicle Voice Navigation SPOKEN DOCUMENT RETRIEVAL http://SpeechFind.utdallas.edu Speech Under Stress Speech Enhancement

  4. CHANNEL STRESS LOMBARD ENVIRONMENT NOISE EFFECT NOISE SPEECH SPEECH RECOGNITION VOICE AMERICAN COMMUNICATIONS SPEAKER ENGLISH HUMAN CHANNEL LANGUAGE ACCENT (AUDITORY) RECOGNITION SPEAKER RECOGNITION Why Speech Systems Break? “ · S B P PEAKER ASED ROBLEMS · C B E ONTEXT ASED FFECTS S & E TRESS MOTION H (E +10,000; J 120) OMONYMS NGLISH APANESE L E / N OMBARD FFECT OISE C : ONFUSABLE P T D SYCHOLOGICAL ASK EMANDS (T , S , S ; C , K ) AKE TAKE TRAIGHT AKE ATE A /L CCENT ANGUAGE A : MBIGUOUS S D PEAKER IFFERENCES J ? EET YET ( , , ) AGE SEX VOCAL TRACT " ' " . " " IT S OURS VS IT SOURS S S PONTANEOUS PEECH " " . " " NICE GUYS VS NICE SKIES " U m , I j u s t w a n n a , I j u s t w a n t t o s a y , I d o n ' t k n o w w h a t I w a n t t o s a y . " · C B OMMUNICATION ASED · E B NVIRONMENTAL ASED M ICROPHONE A N COUSTIC OISE V C OICE OMPRESSION R R OOM EVERBERATION C /M C HANNEL OBILE ELLULAR P T D HYSICAL ASK EMANDS F i l e : 1 9 9 8 _ W h y R e c o g B r e a k D i s k : P w r B o o k ( j h l h )

  5. Speaker Stress Noise Microphone Speech Production: Phonetics & Acoustics Speech Physiology Acoustic Speech Waveform Neutral Stress

  6. DOES STRESS VARIABILITY IMPACT SPEAKER RECOGNITION? • Limited Research on Speaker Recognition over Stress, Lombard Effect, etc. • NATO RSG.10 Report showed probe experimental results with SUSAS corpus NATO, 2000

  7. Pitch Glottal Spectral Slope EFFECT OF STRESS ON SPEECH FEATURES (earlier studies by Hansen (1988), 200 speech features, 10,000 stat. tests) • Formant Location

  8. Phone Duration RMS Intensity EFFECT OF STRESS ON SPEECH FEATURES

  9. Probability distribution • Detection (ROC) curves STRESS DETECTION USING PITCH • Conditional Gaussian fit (Zhou, Hansen 1997) • Classification error rate • Neutral vs. Loud: 7.24% (Neutral), 8.28% (Loud) • Neutral vs. Lombard: 20.69% (Neutral), 19.31% (Lombard)

  10. ROC CURVES STRESS DETECTION

  11. Individual Feature Pitch Glottal Spectral Slope Intensity Phone Duration Formant Location 1st formant 2nd formant Feature Fusion Duration + Intensity + mean Pitch Stress/Neutral Error Rates 6 21% 18 36% 18 36% 28 46% 38 46% 50 58% 0 17% PAST STRESS DETECTION STUDIES USING TRADITIONAL FEATURES

  12. TEAGER ENERGY OPERATOR • Discrete time and Continuous time TEO: where, is Teager Energy Operator TEO-CB-Auto-Env: Critical Band based TEO AUTOcorrelation ENVelope Critical Frequency 17 Band Partition = based on Auditory Perception Ref: Zhou, Hansen,Kaiser, IEEE Transactions on Speech & Audio Processing, vol. 9(2): 201-216, March 2001

  13. Assessment for NATO SUSC-0Military Cockpit Recordings Neutral HMM Model vs. Stress trained HMM Model

  14. Detection of Speech Under Stress: WRAIR GOAL:(1) Identify, Model, and Classify Speech Under Stress in Military-Related Task Conditions, and (2) Improve Automatic Speech Coding under Stress APPROACH: • Effective “Soldier of the Quarter Board” Paradigm • Monitor and Track “Biometrics” of Stress: Heart rate, blood pressure, stress hormones, psychometrics. • Engineering: Focus on NONLINEAR Air Turbulent Model – Teager Energy Operator; Identify Stress Dependent Performance across Speakers, phonemes Rahurkar, Hansen, Meyerhoff, Saviolakis, Koenig, "Frequency Distribution based Weighted Sub-Band Approach for Classification of Emotional/Stressful Content in Speech," Interspeech, pp.721-724, Geneva, Switzerland, Sept. 2003 (another paper at Interspeech-2005)

  15. First observed by Etienne Lombard in 1911 • Change in speech production in response to noise to increase communication performance • “Lombard Test” - standard test for hearing loss in U.S. (ASHA) – measure dB-SPL change in speech production • Hansen (1988) evaluation of 200 features with +10,000 statistical tests on 11 different stressed speech conditions to quantify changes in speech production Lombard Effect

  16. Speech Corpus: UTScope • Audio samples for the perceptual experiment were extracted from UTScope corpus. Speech under COgnitive and Physical stress & Emotion • Consists of 4 Domains • Lombard Effect – noise levels & types • Physical Stress – stair climbing/stepper • Cognitive Stress – driving (simulator & actual) • Emotion (Angry, Fear, Anxiety, Frustration) IAFPA-06: focus on “Lombard Effect”

  17. Goal: obtain Lombard Speech at different noise levels • Quantify ground truth with biometric analysis • Lombard Effect Speech –9 conditions (3 noise, 3 levels) Lombard Effect Highway Noise (windows ½ open) 70,80,90 dB-SPL Large Crowd Noise 70,80,90 dB-SPL Pink Noise 65,75,85 dB-SPL 1 sec. duration

  18. UTScope Corpus LARGE CROWD NOISE 70, 80, 90 dB-SPL HIGHWAY DRIVING, WINDOWS HALF OPEN 70, 80 ,90 dB-SPL PINK NOISE 65, 75, 86 dB-SPL UTScope PURETONE HEARING SCREENING NOISE LEVELS CALIBRATED WITH QUEST SLM OPEN-AIR HEADPHONES FOR SPEECH FEEDBACK

  19. UTScope Corpus 20 TIMIT SENTENCES 5 DIGIT STRINGS 1 MINUTE SPONTANEOUS SPEECH UTScope 8-CHANNEL DAT RECORDER 100 SPEAKERS P-MIC CLOSE-TALKING MIC FAR-FIELD MIC

  20. The ASHA-certified sound booth and recording equipments UTScope Data Collection

  21. Spectrogram Examples Male – Neutral Male – Lombard • Lombard Effect impacts Temporal and Spectral Structure (as expected) • Evaluation: Perceptual Experiments to assess Speaker Recognition

  22. Perceptual In-Set Speaker ID • Listener Test • Speakers • Corpus: UTScope • Native US English speakers • Female speakers only • Speech Conditions • Noise Type • Highway driving • Noise Level • 90dB-SPL

  23. Perceptual In-Set Speaker ID • Speech Materials • Read speech • TIMIT sentences: phonetically balanced • 3 sentences per audio sample (.wav, 16k Hz) • Ref: Basketball can be an entertaining sport. My problem is, the cat’s meow always hurts my ears. The causeway ended abruptly at the shore. • Test: Youngsters commonly love chocolate and candies as treats. December and January are nice months to spend in Miami. There were other farmhouses nearby.

  24. Perceptual In-Set Speaker ID • Listener Test • Listeners(12: 2f/10m May ‘06, -- 41 as of July ‘06) India(4), China(1), Korea(1), Mexico(1), Pakistan(1), Thai(1), Turkey(1) US(1), Vietnam(1) • Task: In-set vs. Out-of-set Speaker Identification • Reference/Training • 12 In-set Female speakers • Test • 8 In-Set speakers • 4 Out-of-Set speakers

  25. In-Set Speaker ID User Interface • Reference audio: Neutral Lombard • Test audio: Neutral Lombard

  26. In-Set Speaker ID Results • The effect of speech condition: significant (p=.0024). • Mismatched condition(NL-LD) accuracy: chance level (52%). • Lombard speech(LD-LD, 79%): higher accuracy than neutral speech (NL-NL, 67%). • Lombard effect may emphasize the speech characteristics, and improve accuracy on perceptual speaker ID.

  27. Closed Set Speaker ID Accuracy (%) Automated System Performance (SUSAS Corpus) The trend hold the same for the automated system. LOSS 5-74% Angry 62% Lombard 48% Loud 74% (See Hansen, et.al, The Impact of Speech Under `Stress' on Military Speech Technology, NATO Research & Tech. Org. RTO-TR-10, March 2000).

  28. In-Set Speaker ID Results • In-Set accuracy: affected by the speech condition significantly (p<.0001). • Out-of-Set accuracy: the effect of speech condition marginally significant (p=.0450). • Mismatched condition: • In-Set accuracy below chance level. • Out-of-Set accuracy & false alarm rate high. • Matched conditions: In-Set accuracy & false acceptance rate higher.

  29. In-Set Speaker ID Results • Accuracy, Confidence Ratings, and Token Coverage • Confidence rating 3 (somewhat sure) and above: • Accuracy: 72% or higher. • Token coverage: 95% or more. • Confidence ratings can be used to filter listener responses for higher accuracy without losing the token coverage significantly.

  30. Summary & Conclusions • Results: • Mismatched condition: • The ID accuracy is at a chance level • The Out-of-Set accuracy and false alarm rate are high. • Matched conditions: • The ID accuracy is higher with Lombard speech than with neutral speech

  31. Summary & Conclusions • Production analysis: Characteristics of speech features change in different manners depending on the noise type and noise level (Varadarajan and Hansen, 2006). • Implications: The ID accuracy is higher with Lombard speech if the noise type and level match for the reference and test audio. • Future Work: The effect of noise type & noise level on perceptual speaker ID.

  32. Candidate Monitoring Speaker State: Monitoring Speaker State: Soldier Of Quarter (SOQ) CORPUS BOARD SETUP • Test Time AB: T-7 day C: T-20 min D: D.o.Board E: T+20 min FG: T+7day • 6 Questions. • Data Recorded • Speech. • Biometrics

  33. Monitoring Speaker State: Monitoring Speaker State: BIOMETRIC MEASURES Change +34.2% +33.0% +22.5% HR = Heart Rate (beats per minute). sBP = Systolic Blood Pressure. dBP = Dystolic Blood Pressure. * Chemical Analysis of Saliva & Blood Samples was also done

  34. Baseline Weighted CB Monitoring Speaker State: Monitoring Speaker State: Classification Error Results for Closed Speaker Set Stress Neutral

More Related