310 likes | 524 Vues
Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks. Karen Livescu JHU Workshop Planning Meeting April 16, 2004 Joint work with Jim Glass. Preview. The problem of pronunciation variation for automatic speech recognition (ASR)
E N D
Feature-based Pronunciation ModelingUsing Dynamic Bayesian Networks Karen Livescu JHU Workshop Planning Meeting April 16, 2004 Joint work with Jim Glass
Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers
The problem of pronunciation variation • Conversation from the Switchboard speech database: • “neither one of them”: • “decided”: • “never really”: • “probably”: • Noted as an obstacle for ASR (e.g., [McAllester et al. 1998])
The problem of pronunciation variation (2) • More acute in casual/conversational than in read speech: probably p r aa b iy 2 p r ay 1 p r aw l uh 1 p r ah b iy 1 p r aa lg iy 1 p r aa b uw 1 p ow ih 1 p aa iy 1 p aa b uh b l iy 1 p aa ah iy 1
Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers
[p] insertion rule dictionary Traditional solution: phone-based pronunciation modeling • Transformation rules are typically of the form p1 p2 / p3 __ p4 (where pimay be null) • E.g. Ø p / m __ {non-labial} • Rules are derived from • Linguistic knowledge (e.g. [Hazen et al. 2002]) • Data (e.g. [Riley & Ljolje 1996]) • Powerful, but: • Sparse data issues • Increased inter-word confusability • Some pronunciation changes not well described • Limited success in recognition experiments warmth [ w ao r m p th ] / w ao r m th /
Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers
TB-LOC TT-LOC TB-OPEN VELUM TT-OPEN LIP-OP VOICING A feature-based approach • Speech can alternatively be described using sub-phonetic features • (This feature set based on articulatory phonology [Browman & Goldstein 1990])
voicing V V V V !V lips & velum desynchronize velum Clo Clo Clo Op Clo dictionary lip opening Nar Mid Mid Clo Mid ... ... ... ... ... … Feature-based pronunciation modeling • instruments[ih_n s ch em ih_n n s] [ w ao r m p th ] warmth • wants[w aa_n t s] -- Phone deletion?? • several[s eh r v ax l] -- Exchange of two phones??? everybody[eh r uw ay]
Related work • Much work on classifying features: • [King et al. 1998] • [Kirchhoff2002] • [Chang, Greenberg, & Wester 2001] • [Juneja & Espy-Wilson 2003] • [Omar & Hasegawa-Johnson 2002] • [Niyogi & Burges 2002] • Less work on “non-phonetic” relationship between words and features • [Deng et al. 1997], [Richardson & Bilmes 2000]: “fully-connected” state space via hidden Markov model • [Kirchhoff 1996]: features independent, except for synchronization at syllable boundaries • [Carson-Berndsen 1998]: bottom-up, constraint-based approach • Goal: Develop a general feature-based pronunciation model • Capable of using known independence assumptions • Without overly strong assumptions
index 0 1 2 3 4 voicing V V V V !V velum Off Off Off On Off lip opening Nar Mid Mid Clo Mid ... ... ... ... ... … dictionary Approach: Main Ideas ([HLT/NAACL-2004]) • Begin with usual assumption: Each word has one or more underlying pronunciations, given by a dictionary warmth • Surface (actual) feature values can stray from underlying values via: • Substitution – modeled by confusion matrices P(s|u) • Asynchrony • Assign index (counter) to each feature, and allow index values to differ • Apply constraints on the difference between the mean indices of feature subsets • Natural to implement using graphical models, in particular dynamic Bayesian networks (DBNs)
speaking rate # questions lunchtime frame i-1 framei ... ... S S O O Aside: Dynamic Bayesian networks • Bayesian network (BN): Directed-graph representation of a distribution over a set of variables • Graph node variable + its distribution given parents • Graph edge “dependency” • Dynamic Bayesian network (DBN): BN with a repeating structure • Example: HMM • Uniform algorithms for (among other things) • Finding the most likely values of a subset of the variables, given the rest (analogous to Viterbi algorithm for HMMs) • Learning model parameters via EM
Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers
encodes baseform pronunciations CLO CRI NAR N-M MID … CLO .7 .2 .1 0 0 … CRI 0 .7 .2 .1 0 … NAR 0 0 .7 .2 .1 … … … … … … … … Approach: A DBN-based Model • Example DBN using 3 features: • (Simplified to show important properties! Implemented model has additional variables.)
Approach: A DBN-based Model (2) • “Unrolled” DBN: . . . • Parameter learning via Expectation Maximization (EM) • Training data • Articulatory databases • Detailed phonetic transcriptions
Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers
A proof-of-concept experiment • Task: classify an isolated word from the Switchboard corpus, given a detailed phonetic transcription (from ICSI Berkeley, [Greenberg et al. 1996]) • Convert transcription into feature vectors Si, one per 10ms • For each word w in a 3k+ word vocabulary, compute P(w|Si) • Output w* = arg maxw P(w|Si) • Used GMTK [Bilmes & Zweig 2002] for inference and EM parameter training • Note: ICSI transcription is somewhere between phones and features—not ideal, but as good as we have
1.7 prons/word 4 prons/word asynchronous feature-based 29.7 16.4 Model Error rate (%) Failure rate (%) asynch. + segmental constraint 32.7 19.4 Baseforms only 63.6 61.2 + phonological rules 50.3 47.9 27.8 synchronous feature-based 35.2 24.8 asynch. + segmental constraint + EM 19.4 Results (development set) • What didn’t work? • Some deletions ([ax], [t]) • Vowel retroflexion • Alveolar + [y] palatal • (Cross-word effects) • (Speech/transcription errors…) • When did asynchrony matter? • Vowel nasalization & rounding • Nasal + stop nasal • Some schwa deletions • instruments [ih_n s ch em ih_n n s] • everybody [eh r uw ay]
Sample Viterbi path everybody [ eh r uw ay ]
Ongoing/future work • Trainable synchrony constraints ([ICSLP 2004?]) • Context-dependent distributions for underlying (Ui) and surface (Si) feature values • Extension to more complex tasks (multi-word sequences, larger vocabularies) • Implementation in a complete recognizer (cf. [Eurospeech 2003]) • Articulatory databases for parameter learning/testing • Can we use such a model to learn something about speech?
(rest of model) Integration with feature classifier outputs • Use (hard) classifier decisions as observations for Si • Convert classifier scores to posterior probabilities and use as “soft evidence” for Si • Landmark-based classifier outputs to DBN Si’s: • Convert landmark-based features to one feature vector/frame • (Possibly) convert from SVM feature set to DBN feature set
Acknowledgment • Jeff Bilmes, U. Washington
possible pronunciations (typically phone strings) Bayes’ Rule acoustic model pronunciation model language model Background: Continuous Speech Recognition • Given waveform with acoustic features A, find most likely word string : • Assuming U* much more likely than all other U:
Lips, tongue, velum, glottis: Right on it, sir! Lips, tongue, velum, glottis: Right on it, sir! Lips, tongue, velum, glottis: Right on it, sir! Lips, tongue, velum, glottis: Right on it, sir! Velum, glottis: Right on it, sir ! Velum, glottis: Right on it, sir ! Example: “warmth” “warmpth” Brain: Give me a []! • Phone-based view: Brain: Give me a []! • (Articulatory) feature-based view: Lips: Huh? Tongue: Umm…yeah, OK.
Graphical models for hidden feature modeling • Most ASR approaches use hidden Markov models (HMMs) and/or finite-state transducers (FSTs) • Efficient and powerful, but limited • Only one state variable per time frame • Graphical models (GMs) allow for • Arbitrary numbers of variables and dependencies • Standard algorithms over large classes of models • Straightforward mapping between feature-based models and GMs • Potentially large reduction in number of parameters • GMs for ASR: • Zweig (e.g. PhD thesis, 1998), Bilmes (e.g. PhD thesis, 1999), Stephenson (e.g. Eurospeech 2001) • Feature-based ASR with GMs suggested by Zweig, but not previously investigated
Background • Brief intro to ASR • Words written in terms of sub-word units, acoustic models compute probability of acoustic (spectral) features given sub-word units or vice versa • Pronunciation model: mapping between words and strings of sub-word units
Possible solution? • Allow every pronunciation in some large database • Unreliable probability estimation due to sparse data • Unseen words • Increased confusability
Phone-based pronunciation modeling (2) • Generalize across words • But: • Data still sparse • Still increased confusability • Some pronunciation changes not well described by phonetic rules • Limited gains in speech recognition experiments
Approach • Begin with usual assumption that each word has one or more “target” pronunciations, given by the dictionary • Model the evolution of multiple feature streams, allowing for: • Feature changes on a frame-by-frame basis • Feature desynchronization • Control of asynchrony—more “synchronous” feature configurations are preferable • Dynamic Bayesian networks (DBNs): Efficient parameterization and computation when state can be factored