Towards Superhuman Speech Recognition

Towards Superhuman Speech Recognition Mukund Padmanabhan and Michael Picheny Human Language Technologies Group IBM Thomas J. Watson Research Center Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms Group

Common UI Folklore “Except when interacting with video games, a user does not take very well to surprises” Human-Computer Interaction Dix, Finley, Aboud and Beale “Golden Rule #3: Make the interface consistent” Elements of user interface design Mandel “Computer users usually seek predictable responses and are discouraged if they must engage in clarification dialogs frequently” Designing the User Interface Shneiderman

Speech Recognition Progress

Human Performance(Lippmann, 1997)

Problem Categorization

Domain Dependence

- 1. spontaneous speech: largest effect on WER (Switchboard, Voicemail, Meetings, real-world speech) - 2. multi-environment speech sources (16K, 8K, far-field microphone, noisy ...) - 3. multi-domain speech sources (dictation, travel, call center, small vocab, broadcast news) - 4. domain-dependence of performance Observations Objective: Develop speech recognition system that mimics human performance (independent of environment, domain, works as well for spontaneous as for carefully enunciated speech) Focus areas Improve spontaneous speech models 1.Articulatory modeling 2. Prosodic features 3. Segmental graphical models 4. Joint parameter estimation 5. Speaker separation for multi-speaker speech 6. Data collection for "meeting speech" Multi-environment 1. non-linear feature space transformation 2. Hidden observations Multi-domain 1. Multistyle training 2. Domain independent LM

30% Improvement • No initial decoding

ASR Workshop

A Language Model that Works Well on Many Domains • Different (static) language models work best on different domains • Use dynamic adaptation to make a generic LM act like a domain-specific LM • Generic LM – linear interpolation of collection of domain-specific LMs (SWB, BN, digit/date grammar, etc.) • Adapt by dynamically adjusting interpolation weights • Want to be able to adapt quickly • At the word/sentence level, not at the document level Um, yeah. Well, anyway, I’ll be arriving atfour twenty two p.m. on flight fifty six.Say hi to mom. Oh, and don’t forget tobuy IBM at one forty-four.

Adapting Language Model Interpolation Weights • Simply re-estimate weights to maximize likelihood of adaptation data (like dynamic deleted interpolation) • Can be quite slow because have to accumulate a lot of evidence • Add hidden variable to model that tracks which domain LM is currently being used (Bayesian adaptation) • Rate of adaptation can be fast, depend on context, and can be trained on domain labelled data.

Other Factors Driving Progress

What Types of Data Do We Need?

2000 Hours/year 50000 hours/year (25) 5000 hours of speech Cost ~ $1M Some Concrete Suggestions Target: 5000 Hours of transcribed spontaneous speech Sources of new data: Supergirl By David Odell Script - Revised Screenplay Word Document Superman: The Motion Picture By Mario Puzo Early Draft Script Superman: The Motion Picture By Mario Puzo Shooting Script Superman II Directed By Richard Donner Script - Early Version Superman II Directed By Richard Lester Script Later Version Superman II Shooting Script Superman IV: The Quest for Peace By Christopher Reeve, Script - Superman: The Man Of Steel By Alex Ford & J Ellison Script - Unproduced Superman Lives By Kevin Smith Script - Unproduced Superman Lives By Dan Gilroy Script synopsis Unproduced • Test data: Mixture of current and new sources • Switchboard, Voicemail, BN, DC, OGI • SPEECON, Meetings

Conclusions • Speech recognition performance not adequate • Human performance figures suggests that we still have enormous room for improvement • Presented several new algorithms to attack problem aggressively • Suggested training and test methodology to drive research • Communal participation critical to push ahead

Towards Superhuman Speech Recognition

Towards Superhuman Speech Recognition

Presentation Transcript

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

From Speech Recognition Towards Speech Understanding

Speech Recognition

SPEECH RECOGNITION:

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition