1 / 105

Articulatory Feature-Based Speech Recognition

word. word. ind 1. ind 1. U 1. U 1. sync 1,2. sync 1,2. S 1. S 1. ind 2. ind 2. U 2. U 2. sync 2,3. sync 2,3. S 2. S 2. ind 3. ind 3. U 3. U 3. S 3. S 3. JHU WS06 Final team presentation August 17, 2006. Articulatory Feature-Based Speech Recognition. Project Participants.

thu
Télécharger la présentation

Articulatory Feature-Based Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. word word ind1 ind1 U1 U1 sync1,2 sync1,2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2,3 S2 S2 ind3 ind3 U3 U3 S3 S3 JHU WS06 Final team presentationAugust 17, 2006 Articulatory Feature-Based Speech Recognition

  2. Project Participants Team members: Karen Livescu (MIT) Arthur Kantor (UIUC) Özgür Çetin (ICSI) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore) Satellite members/advisors: Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Ghinwa Choueiter (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Mathew Magimai (ICSI), Erik McDermott (NTT), Daryush Mehta (MIT), Florian Metze (Deutsche Telekom), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT), Amar Subramanya (UW)

  3. Why are we here? • Why articulatory feature-based ASR? • Improved modeling of co-articulation • Application to audio-visual and multilingual ASR • Potential savings in training data • Compatibility with more recent theories of phonology (autosegmental phonology, articulatory phonology) • Improved ASR performance with feature-based observation models in some conditions [e.g. Kirchhoff ‘02, Soltau et al. ‘02] • Improved lexical access in experiments with oracle feature transcriptions [Livescu & Glass ’04, Livescu ‘05] • Why now? • A number of sites working on complementary aspects of this idea: U. Edinburgh (King et al.), UIUC (Hasegawa-Johnson et al.), (Livescu et al.) • Recently developed tools (e.g. GMTK) for systematic exploration of the model space

  4. A brief history • Many have argued for replacing the single phone stream with multiple sub-phonetic feature streams [Rose et al. ‘95, Ostendorf ‘99, ‘00, Nock ‘00, ‘02, Niyogi et al. ’99] • Many have worked on parts of the problem • AF classification/recognition [Kirchhoff, King, Frankel, Wester, Richmond, Hasegawa-Johnson, Borys, Metze, Fosler-Lussier, Greenberg, Chang, Saenko, ...] • Pronunciation modeling [Livescu & Glass, Bates] • Many have combined AF classifiers with phone-based recognizers[Kirchhoff, King, Metze, Soltau, ...] • Some have built HMMs by combining AF states into product states [Deng et al., Richardson and Bilmes] • Only very recently, work has begun on end-to-end recognition with multiple streams of AF states [Hasegawa-Johnson et al. ‘04, Livescu ’05] • No prior work on AF-based models for AVSR

  5. Yes No factored obs model? state asynchrony cross-word soft asynchrony soft asynchrony within word free within unit coupled state transitions Yes No [Livescu ‘04] [Deng ’97, Richardson ’00] fact. obs? fact. obs? fact. obs? fact. obs? obs model GM SVM NN N N N N Y Y Y Y [Metze ’02] [Kirchhoff ’02] [Juneja ’04] CD CD CD CD CD CD CD CD N N N Y N Y Y Y N Y N [Livescu ’05] N FHMMs ??? ??? Y N Y Y [WS04] [Kirchhoff ’96, Wester et al. ‘04] CHMMs ??? ??? ??? ??? ??? ??? ??? A (partial) taxonomy of design issues factored state (multistream structure)? (Not to mention choice of feature sets... same in pronunciation and observation models?)

  6. P(w) language model w = “makes sense...” pronunciation model P(q|w) s = [ m m m ey1 ey1 ey2 k1 k1 k1 k2 k2 s ... ] observation model P(o|q) o = Definitions: Pronunciation and observation modeling

  7. Project goals Building complete AF-based recognizers and understanding the design issues involved A world of areas to explore... • Comparisons of Observation models: Gaussian mixtures over acoustic features, hybrid models [Morgan & Bourlard 1995], tandem models [Ellis et al. 2001] Pronunciation models: Articulatory asynchrony and substitution models • Analysis of articulatory phenomena: Dependence on context, speaker, speaking rate, speaking style, ... • Application of AFSR to audio-visual speech recognition • All require some resources... Feature sets Manual and automatic AF alignments Tools

  8. That was the vision... At WS06, we focused on... • AF-based observation models in the context of phone-based recognizers • AF-based pronunciation models with Gaussian mixture-based observation models • AF-based audio-visual speech recognition • Resources Feature sets Manual AF alignments Tools: tying, visualization We did not focus on... • Combination of AF-based pronunciation models with different observation models • Analysis of feature alignment data

  9. Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines (Karen, Simon) • Hybrid observation models (Simon) • Tandem observation models (Ozgur, Arthur) • Multistream AF-based pronunciation models (Karen, Chris, Nash, Lisa, Bronwyn) • AF-based audio-visual speech recognition (Mark, Partha) • Analysis (Nash, Lisa, Ari) BREAK • Structure learning (Steve) • Student proposals (Arthur, Chris, Partha, Bronwyn?) • Summary, conclusions, future work (Karen)

  10. Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines • Hybrid observation models • Tandem observation models • Multistream AF-based pronunciation models • AF-based audio-visual speech recognition • Analysis • BREAK • Structure learning • Student proposals • Summary, conclusions, future work

  11. A B C D Bayesian networks (BNs) • Directed acyclic graph (DAG) with one-to-one correspondence between nodes and variables X1, X2, ... , XN • Node Xi with parents pa(Xi) has a “local” probability function pXi|pa(Xi) • Joint probability = product of local probabilities: p(xi,...,xN) =  p(xi|pa(xi)) p(b|a)  p(a,b,c,d) = p(a)p(b|a)p(c|b)p(d|b,c) p(c|b) p(a) p(d|b,c)

  12. frame i-1 frame i frame i+1 C C C A A B B A B D D D Dynamic Bayesian networks (DBNs) • BNs consisting of a structure that repeats an indefinite (i.e. dynamic) number of times • Useful for modeling time series (e.g. speech!)

  13. FSN DBN frame i-1 frame i frame i+1 .7 .8 1 Qi-1 Qi+1 Qi .3 .2 . . . . . . P(qi|qi-1) P(obsi | qi) 1 2 3 obsi-1 obsi+1 obsi qi 1 2 3 qi-1 q=1 1 .7 .3 0 obs q=2 2 0 .8 .2 obs obs q=3 3 0 0 1 = variable = state = dependency = allowed transition Notation: Representations of HMMs as DBNs

  14. A phone HMM-based recognizer frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation

  15. Inference • Definition: • Computation of the probability of one subset of the variables given another subset • Inference is a subroutine of: • Viterbi decoding q* = argmaxqp(q|obs) • Maximum-likelihood parameter estimation * = argmax p(obs| ) • For WS06, all models implemented, trained, and tested using the Graphical Models Toolkit (GMTK) [Bilmes 2002]

  16. Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines • Hybrid observation models • Tandem observation models • Multistream AF-based pronunciation models • AF-based audio-visual speech recognition • Analysis • BREAK • Structure learning • Student proposals • Summary, conclusions, future work

  17. Articulatory feature sets • We use separate feature sets for pronunciation and observation modeling • Why? • For observation modeling, want features that are acoustically distinguishable • For pronunciation modeling, want features that can be modeled as independent streams

  18. TB-LOC TT-LOC TB-OP TT-OP LIP-LOC VELUM LIP-OP GLOTTIS Feature set for pronunciation modeling • Based on articulatory phonology [Browman & Goldstein ‘90] adapted for pronunciation modeling [Livescu ’05] • Under some simplifying assumptions, can combine into 3 streams

  19. Feature set for observation modeling

  20. Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines • Hybrid observation models • Tandem observation models • Multistream AF-based pronunciation models • AF-based audio-visual speech recognition • Analysis • BREAK • Structure learning • Student proposals • Summary, conclusions, future work

  21. Manual feature transcriptions • Purpose: Testing of AF classifiers, automatic alignments, NOT training • Main transcription guideline: Should correspond to what we would like our AF classifiers to detect

  22. Manual feature transcriptions • Main transcription guideline: The output should correspond to what we would like our AF classifiers to detect • Details • 2 transcribers: phonetician (Lisa Lavoie), PhD student in speech group (Xuemin Chi) • 78 SVitchboard utterances • 9 utterances from Switchboard Transcription Project for comparison • Multipass transcription using WaveSurfer (KTH) • 1st pass: Phone-feature hybrid • 2nd pass: All-feature • 3rd pass: Discussion, error-correction • Some basic statistics • Overall speed ~1000 x real-time • High inter-transcriber agreement (93% avg. agreement, 85% avg. string accuracy) • First use to date of human-labeled articulatory data for classifier/recognizer testing

  23. Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines • Hybrid observation models • Tandem observation models • Multistream AF-based pronunciation models • AF-based audio-visual speech recognition • Analysis • BREAK • Structure learning • Student proposals • Summary, conclusions, future work

  24. SIMON: SVitchboard, baselines, gmtkTie MLPs, hybrid models

  25. OZGUR & ARTHUR: Tandem models intro Our models & results

  26. KAREN, CHRIS, NASH, LISA, BRONWYN: Multistream AF-based pronunciation models

  27. Multi-stream AF-based pronunciation models q (phonetic state) • Phone-based o (observation vector) • AF-based qi (state of AF i) o (obs vector)

  28. word don’t probably baseform p r aa b ax b l iy d ow n t (2) p r aa b iy (1) p r ay (1) p r aw l uh (1) p r ah b iy (1) p r aa l iy (1) p r aa b uw (1) p ow ih (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy (37) d ow n (16) d ow (6) ow n (4) d ow n t (3) d ow t (3) d ah n (3) ow (3) n ax (2) d ax n (2) ax (1) n uw ... surface (actual) Motivation: Pronunciation variation everybody sense s eh n s eh v r iy b ah d iy [From data of Greenberg et al. ‘96] (1) s eh n t s (1) s ih t s (1) eh v r ax b ax d iy (1) eh v er b ah d iy (1) eh ux b ax iy (1) eh r uw ay (1) eh b ah iy

  29. Pronunciation variation and ASR performance • Automatic speech recognition (ASR) is strongly affected by pronunciation variation • Words produced non-canonically are more likely to be mis-recognized [Fosler-Lussier ‘99] • Conversational speech is recognized at twice the error rate of read speech [Weintraub et al. ‘96]

  30. [t] insertion rule dictionary Phone-based pronunciation modeling • Address pronunciation variation issue by substituting, inserting, or deleting segments: • Suffer from low coverage of conversational pronunciations and sparse data • Partial changes are not well described [Saraclar et al. ‘03] increased inter-word confusability sense [ s eh n t s ] / s eh n s /

  31. feature values GLO open critical open VEL closed open closed dictionary TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n s feature values GLO open critical open VEL closed open closed surface variant #1 TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n t s feature values GLO open critical open VEL closed open closed surface variant #2 TB mid / uvular mid-nar / palatal mid / uvular TT critical / alveolar mid-nar / alveolar closed / alveolar critical / alveolar phone s ih t s n Revisiting examples

  32. A more complex example everybody [ eh r uw ay ] (INSERT DIFFERENT EXAMPLE USING ARI’S TOOL)

  33. Can we take advantage of these intuitions? • In lexical access experiments with oracle feature alignments, yes: • Lexical access accuracy improves from ?? to ?? using articulatory model with asynchrony and context-independent substitutions [Livescu & Glass ’04] • Scaling up to a complete recognizer—issues: • Computational complexity • Noisy observations

  34. Reminder: phone-based model frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation (Note: missing pronunciation variants)

  35. Multistream pronunciation models wordTransition word wordTransitionL subWordStateL async stateTransitionL phoneStateL wordTransitionT L subWordStateT stateTransitionT phoneStateT T (differences from actual model: 3rd feature stream, pronunciation variants, word transition bookkeeping)

  36. A first attempt: 1-state monofeat • Analogous to 1-state monophone with minimum duration of 3 frames • All three states of each phone map to the same feature values • INSERT PART OF phoneState2feat TABLE HERE • One state of asynchrony allowed between L and T, and between G and {L,T}

  37. A first attempt: 1-state monofeat • (INSERT EXAMPLE USING ARI’s TOOL)

  38. Results: 1-state monofeat • Much higher WER than monophone—possible remedies • Improved modeling with the same structure—concerted effort at WS06 • Alternative structures—begun to explore • Cross-word asynchrony • Context-dependent asynchrony • Substitutions

  39. CHRIS, NASH, LISA, BRONWYN: Multistream AF-based pronunciation models

  40. Improving the Model • Design Challenges • Multiple states per feature • Initialization • Tying • Silence synchronization

  41. Design Challenges • Optimal parameters vary widely for different models • Number of components • Language model scale • Language model penalty

  42. Design Challenges • Experimentation time grows with model complexity • Adding features to monofeat graph:

  43. 3-State Monofeat • Monofeat usually maps all states of a phone to the same feature value

  44. 3-State Monofeat • 3 State Monofeat makes a unique feature value for each phone state

  45. 3-State Monofeat • Why? • Forces a sequence of states • Models context

  46. 3-State Monofeat

  47. Initialization • Problem: • Low occupancy leads to many poorly trained Gaussians • Potential Solution • Better initialization • Train through 8 components per mixtures with asynchrony parameters clamped at: • p(synchronous)=0.5, p(asynchronous)=0.5 • p(synchronous)=0.6, p(asynchronous)=0.4

  48. Initialization Original Initialization O.5/0.5 Initializatoin

  49. Initialization

  50. Tying • Problem: • Low occupancy leads to many poorly trained Gaussians • Potential Solution • Parameter tying • Uses gmtkTie

More Related