Articulatory Feature-Based Speech Recognition

word word ind1 ind1 U1 U1 sync1,2 sync1,2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2,3 S2 S2 ind3 ind3 U3 U3 S3 S3 JHU WS06 Final team presentationAugust 17, 2006 Articulatory Feature-Based Speech Recognition

Project Participants Team members: Karen Livescu (MIT) Arthur Kantor (UIUC) Ozgur Cetin (ICSI Berkeley) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore) Satellite members/advisors: Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Ghinwa Choueiter (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Mathew Magimai (ICSI), Erik McDermott (NTT), Daryush Mehta (MIT), Florian Metze (Deutsche Telekom), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT), Amar Subramanya (UW)

Why are we here? • Why articulatory feature-based ASR? • Improved modeling of co-articulation • Potential application to audio-visual and multilingual ASR • Improved ASR performance with feature-based observation models in some conditions • Potential savings in training data • Compatibility with more recent theories of phonology (autosegmental phonology, articulatory phonology) • Why now? • A number of sites working on complementary aspects of this idea, e.g. • U. Edinburgh (King et al.) • UIUC (Hasegawa-Johnson et al.) • MIT (Livescu, Saenko, Glass, Darrell) • Recently developed tools (e.g. GMTK) for systematic exploration of the model space

A brief history • Many have argued for replacing the single phone stream with multiple sub-phonetic feature streams (Rose et al. ‘95, Ostendorf ‘99, ‘00, Nock ‘00, ‘02, Niyogi et al. ‘99 (for AVSR)) • Many have worked on parts of the problem • AF classification/recognition (Kirchhoff, King, Frankel, Wester, Richmond, Hasegawa-Johnson, Borys, Metze, Fosler-Lussier, Greenberg, Chang, Saenko, ...) • Pronunciation modeling (Livescu & Glass, Bates) • Many have combined AF classifiers with phone-based recognizers(Kirchhoff, King, Metze, Soltau, ...) • Some have built HMMs by combining AF states into product states (Deng et al., Richardson and Bilmes) • Only very recently, work has begun on end-to-end recognition with multiple streams of AF states (Hasegawa-Johnson et al. ‘04, Livescu ‘05) • No prior work on AF-based models for AVSR

Yes No factored obs model? state asynchrony cross-word soft asynchrony soft asynchrony within word free within unit coupled state transitions Yes No [Livescu ‘04] [Deng ’97, Richardson ’00] fact. obs? fact. obs? fact. obs? fact. obs? obs model GM SVM NN N N N N Y Y Y Y [Metze ’02] [Kirchhoff ’02] [Juneja ’04] CD CD CD CD CD CD CD CD N N N Y N Y Y Y N Y N [Livescu ’05] N FHMMs ??? ??? Y N Y Y [WS04] [Kirchhoff ’96, Wester et al. ‘04] CHMMs ??? ??? ??? ??? ??? ??? ??? A (partial) taxonomy of design issues factored state (multistream structure)? (Not to mention choice of feature sets... same in hidden structure and observation model?)

P(w) language model w = “makes sense...” pronunciation model P(q|w) s = [ m m m ey1 ey1 ey2 k1 k1 k1 k2 k2 s ... ] observation model P(o|q) o = Definitions: Pronunciation and observation modeling

Project goals Building complete AF-based recognizers and understanding the design issues involved A world of areas to explore... • Comparisons of Observation models: Gaussian mixtures over acoustic features, hybrid models [Morgan & Bourlard 1995], tandem models [Ellis et al. 2001] Pronunciation models: Articulatory asynchrony and reduction models • Analysis of articulatory phenomena: Dependence on context, speaker, speaking rate, speaking style, ... • Application of AFSR to audio-visual speech recognition • All require some resources... Feature sets Manual and automatic AF alignments Tools

That was the vision... What we focused on at WS06 • Comparisons of AF-based observation models in the context of phone-based recognizers • Comparisons of AF-based pronunciation models using Gaussian mixture-based observation models • AF-based audio-visual speech recognition • Resources Feature sets Manual AF alignments Tools: tying, visualization

Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines (Karen, Simon) • Hybrid observation models (Simon) • Tandem observation models (Ozgur, Arthur) • Multistream AF-based pronunciation models (Karen, Chris, Nash, Lisa, Bronwyn) • AF-based audio-visual speech recognition (Mark, Partha) • Analysis (Nash, Lisa, Ari) BREAK • Structure learning (Steve) • Student proposals (Arthur, Chris, Partha, Bronwyn?) • Summary, conclusions, future work (Karen)

Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines • Hybrid observation models • Tandem observation models • Multistream AF-based pronunciation models • AF-based audio-visual speech recognition • Analysis • BREAK • Structure learning • Student proposals • Summary, conclusions, future work

frame i-1 frame i frame i+1 C C C A A B B A B D D D Dynamic Bayesian networks (DBNs) • BNs consisting of a structure that repeats an indefinite (i.e. dynamic) number of times • Useful for modeling time series (e.g. speech!)

HMM DBN frame i-1 frame i frame i+1 .7 .8 1 Qi-1 Qi+1 Qi .3 .2 . . . . . . P(qi|qi-1) P(obsi | qi) 1 2 3 obsi-1 obsi+1 obsi qi 1 2 3 qi-1 q=1 1 .7 .3 0 obs q=2 2 0 .8 .2 obs obs q=3 3 0 0 1 = variable = state = allowed dependency = allowed transition Notation: Representing an HMM as a DBN

Inference • Definition: • Computation of the probability of one subset of the variables given another subset • Inference is a subroutine of: • Viterbi decoding q* = argmaxqp(q|obs) • Maximum-likelihood parameter estimation * = argmax p(obs| ) • For WS06, all models implemented, trained, and tested using the Graphical Models Toolkit (GMTK) [Bilmes 2002]

Feature set for pronunciation modeling • Based on articulatory phonology [Browman & Goldstein 1990] • Assuming complete synchrony among lip, tongue, glottis/velum features, and limited substitutions, can combine into 3 streams

Feature set for observation modeling

Manual feature transcriptions • Purpose: Testing of AF classifiers, automatic alignments • Main transcription guideline: Should correspond to what we would like our AF classifiers to detect

Manual feature transcriptions • Main transcription guideline: The output should correspond to what we would like our AF classifiers to detect • Details • 2 transcribers: phonetician (Lisa Lavoie), PhD student in speech group (Xuemin Chi) • 78 SVitchboard utterances • 9 utterances from Switchboard Transcription Project for comparison • Multipass transcription using WaveSurfer (KTH) • 1st pass: Phone-feature hybrid • 2nd pass: All-feature • 3rd pass: Discussion, error-correction • Some basic statistics • Overall speed ~1000 x real-time • High inter-transcriber agreement (93% avg. agreement, 85% avg. string accuracy) • First use to date of human-labeled articulatory data for classifier/recognizer testing

SIMON: SVitchboard, baselines, gmtkTie MLPs, hybrid models

OZGUR & ARTHUR: Tandem models intro Our models & results

KAREN, CHRIS, NASH, LISA, BRONWYN: Multistream AF-based pronunciation models

Reminder: phone-based models: frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation (Note: missing pronunciation variants)

Multistream pronunciation models wordTransition word wordTransitionL subWordStateL async stateTransitionL phoneStateL wordTransitionT L subWordStateT stateTransitionT phoneStateT T (differences from actual model: 3rd feature stream, pronunciation variants, word transition bookkeeping)

A first attempt: 1-state monofeat • Analogous to 1-state monophone with minimum duration of 3 frames • All three states of each phone map to the same feature values • INSERT phoneState2feat TABLE HERE • One state of asynchrony allowed between L and T, and between G and {L,T}

Problems with 1-state monofeat • Much higher WER than monophone—possible causes: • By collapsing three states into one, we’ve lost sequencing information—suggests further splitting AF states • Asynchrony modeling is too simple • Asynchrony is likely dependent on context, e.g. part of speech, word/syllable position, speaking rate • Asynchrony often occurs across word boundaries (e.g. “greem beans”) • Asynchrony may between streams may not be symmetric • Modeling substitutions as well as asynchrony may be crucial • Improper handling of silence • Synchronous model outperforms asynchronous one  asynchronous states may be poorly trained, suggesting • Training with different initializations • Tying states • We have addressed the above issues to varying extents—enter Chris, Nash, Lisa, Bronwyn...

CHRIS, NASH, LISA, BRONWYN: Multistream AF-based pronunciation models

Multistream pronunciation models: Summary • So far, our models perform worse than baseline monophone models • Much work to be done! • Better training of low-occupancy states • Improved tying strategies and tree clustering questions • Improved training schedules—e.g. incorporate Gaussian vanishing as well as splitting • Initialization from independent AF HMM alignments • Cross-word asynchrony, context-dependent asynchrony, substitution modeling have only just begun

MARK & PARTHA: AVSR

NASH, LISA, ARI: Analysis: Manual transcriber agreement & “canonicalness”, MLP performance analysis Recognizer error analysis FAFA

BREAK

STEVE: Structure learning

CHRIS, PARTHA, ARTHUR, BRONWYN? Proposals

KAREN: Summary & conclusions

Summary • Main results & take-home messages: • Tandem AF models: Beat phone-based models, at least monophone --try them at home! • Hybrid AF models: TBA! • Multistream AF-based pron models: Close to phone-based; more work to be done • AVSR: TBA! • Embedded training: works? • Obtained improved articulatory alignments over phone-based ones • Other contributions: • gmtkTie • Manual transcriptions • Wavesurfer analysis tool • New SVB baselines (monophone & triphone)

This is just the beginning... • Further experimentation with WS06 models • How do tandem and hybrid results vary with amount of data? • Hybrid results with Fisher- vs. SVitchboard-trained MLPs • Better initializations for multistream pronunciation models • Application of AVSR models to more complex tasks: connected digits, larger vocabularies, more challenging data (e.g. AVICAR) • Fulfilling ze dream • More work on combining multistream pronunciation models with new observation models (tandem, hybrid) • More work on substitution modeling, cross-word asynchrony, context-dependence • Embedded training with fancier pronunciation models • More work on alignments • Improving alignments: Best model is not necessarily the lowest-WER one • Analysis • Learning pronunciation models from aligned data

Acknowledgments Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Ghinwa Choueiter (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Mathew Magimai (ICSI), Erik McDermott (NTT), Daryush Mehta (MIT), Florian Metze (Deutsche Telekom), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT), Amar Subramanya (UW) NSF DARPA DoD CLSP

EXTRA SLIDES

Articulatory Feature-Based Speech Recognition