1 / 20

“TalkPrinting” : Improving Speaker Recognition by Modeling Stylistic Features

“TalkPrinting” : Improving Speaker Recognition by Modeling Stylistic Features. S. Kajarekar, K. Sonmez, L. Ferrer, V. Gadde, A. Venkataraman, E. Shriberg , A. Stolcke, H. Bratt SRI International, Menlo Park, CA Funding: KDD Notes: 9 months into project, Results updated from paper. Outline.

aidan-ford
Télécharger la présentation

“TalkPrinting” : Improving Speaker Recognition by Modeling Stylistic Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “TalkPrinting”: Improving Speaker Recognition by Modeling Stylistic Features S. Kajarekar, K. Sonmez, L. Ferrer, V. Gadde, A. Venkataraman, E. Shriberg, A. Stolcke, H. Bratt SRI International, Menlo Park, CA Funding: KDD Notes: 9 months into project, Results updated from paper NSF-NIJ Symposium, June 2-3, 2003

  2. Outline • Project motivation and goal • Overview of approach • Selected results to date: • TalkPrint features: lexical, prosodic • Adding TalkPrint features to baseline • Effect of amount of training data • Effect of amount of test data • Summary, conclusions and future work NSF-NIJ Symposium, June 2-3, 2003

  3. Motivation • Significant distal communication occurs by voice only (e.g., telephone conversations) • Vast amounts of data captured for intelligence, law enforcement • Analysts can listen to only small percentage • Need technology to help filter out the majority of uninteresting cases and mark the ones that contain interesting speakers and/or content • Must be completely automatic from audio NSF-NIJ Symposium, June 2-3, 2003

  4. Objective • Model patterns in way people talk to find out: • Who is talking? (speaker recognition) • What type of conversation is it? (style recognition) • Is a speaker acting strange? (anomaly detection) (emotion, cognitive state, health, etc) • Today: report on speaker recognition only • “Tag” speech data with this information (probabilistically), to aid analysts NSF-NIJ Symposium, June 2-3, 2003

  5. Standard Approach • Slice speech into tiny (10 ms) time regions, model energy/frequency distributions • Each speaker = Gaussian mixture model (GMM) • Frames are independent (unordered) • No longer range information O k a y NSF-NIJ Symposium, June 2-3, 2003

  6. Limitations of Current Approach • Works very well under certain conditions, BUT: • Degrades with channel variation and noise • Can’t distinguish people with similar vocal tracts • Fails to capture longer-range properties: • habitual word patterns, disfluency rates and types • prosodic (pausing, temporal, and intonation) patterns • turn-length, turn-taking patterns • Long-range cues also useful for style, anomalies. NSF-NIJ Symposium, June 2-3, 2003

  7. New Approach: ‘TalkPrinting’ • Capture behavioral patterns in how a person talks (speaking rate, intonation, word usage, etc….) • Humans use these patterns (e.g. ID through wall) • Patterns reflect different underlying causes: dialectal, social, pragmatic, cognitive, affective • While behavioral, many patterns hard to fake well • Combine TalkPrint features with conventional (voice print) features NSF-NIJ Symposium, June 2-3, 2003

  8. Decision Paradigm • Build model for general population • Build model for the target speaker • Compare the models using a likelihood-ratio test: score is the difference in log scores for the data given each model • This score is compared to a threshold for making discrete decisions • Threshold determines the tradeoff between error types (misses, false alarms) NSF-NIJ Symposium, June 2-3, 2003

  9. Experiments • Task: speaker ID (no data yet for others) • Data from telephone conversations on various topics (Switchboard corpus) • Built competitive baseline system in order to assess gain from TalkPrint features • Built TalkPrint systems from new features • To date, fused systems at the score level, using neural network (updated from paper) • Eventual goal: fuse at the feature level NSF-NIJ Symposium, June 2-3, 2003

  10. Standard Evaluation Metrics • Annual speaker recognition evaluations conducted by NIST • Various metrics: • Detection Error Trade-off curves: shows the dependence between miss and false alarm rates, • Equal Error Rate (EER): point on DET curve at which miss rate = false alarm rate • Cost-weighted error rate (application dependent) NSF-NIJ Symposium, June 2-3, 2003

  11. Questions: • Can we improve speaker recognition by augmenting baseline system with: • Language features? • Prosodic (rate, rhythm, melody) features? • How is performance affected by: • Amount of training data? • Amount of test data? NSF-NIJ Symposium, June 2-3, 2003

  12. Language Features & ASR • Language model yields probability of frequent words/pairs: Uhhuh, yeah, I mean, you know, etc. • Need to recognize the words first: requires large-vocabulary conversational ASR engine • Word error rates on conversational ASR high (>20%) even for state of the art systems • Used purposely stripped-down SRI’s LVCSR system (SOA): 38% WER on this data • Test of whether can get by with high WER! NSF-NIJ Symposium, June 2-3, 2003

  13. Prosodic Features • Long line of work at SRI on prosody modeling • Notable aspect: model prosody directly from the signal (no intermediate phonological labels) • Raw feature types: • Duration (phonemes, syllables, words; normalized) • Pause location and duration • Intonation (pitch contours, stylized using spline fits) • Energy (also stylized contours) • Duration and pause features use time alignments from recognition hypothesis NSF-NIJ Symposium, June 2-3, 2003

  14. Sample Prosodic Features • Duration features: • vector of durations of phones in a word and of “states” (3 subphone units) in a phone, e.g.: “Tucson” t uw s ah n • “NERFs” (New Extraction Region Features) • Sample: pitch and duration features in regions between consecutive pauses (few parameters) NSF-NIJ Symposium, June 2-3, 2003

  15. Results: Features 16 training conversations, 1 test conversation • LM and Dur combine well with each other • Fusing LM+Dur with baseline dramatically improves performance • 1st NERF results show further gain (< misses) • Note: DCF penalizes false alarms (access apps); filtering apps would penalize misses NSF-NIJ Symposium, June 2-3, 2003

  16. Effect of Amount of Training Data • 1 conversation = approx..3 minutes • Performance improves with added training data • Effect similar for both baseline & TalkPrint systems • Intelligence apps likely to keep adding training data NSF-NIJ Symposium, June 2-3, 2003

  17. Effect of Amount of Test Data • EER: FA = Misses • Combined system = Baseline + Dur • Duration significantly aids performance • Helps even at 10sec • Baseline seems to saturate at 2 mins, duration keeps improving with length NSF-NIJ Symposium, June 2-3, 2003

  18. Summary & Conclusions (1) • Automatic tagging of massive amounts of audio data for speaker, likely content, and anomalies, can preprocess data for human analysts • Conventional speaker recognition fails to capture beyond-the-frame behavioral patterns • We find such behavioral patterns aid speaker recognition when added to a state-of-the-art baseline system (frame-based features) • Useful TalkPrint features include both language and prosody NSF-NIJ Symposium, June 2-3, 2003

  19. Summary & Conclusions (2) • Language and prosody features complement both each other and baseline features • Both language and prosody features help despite nearly 40% of the words wrong! • Performance of TalkPrint features improves with both added training and added test data • In contrast, baseline features appear to saturate after about 2 minutes of test data NSF-NIJ Symposium, June 2-3, 2003

  20. Future Work • Improve TalkPrint features • Develop feature selection and fusion methods • Investigate effect of various factors: • word error rate of ASR system • noise (are TalkPrint features more robust?) • Work with government to assess performance on relevant conversational data • Extend approach to capture information about type of conversation and to detect anomalies NSF-NIJ Symposium, June 2-3, 2003

More Related