290 likes | 458 Vues
A few thoughts about ASAT. Some slides from NSF workshop presentation on knowledge integration Thoughts about “islands of certainty” Neural networks: the good, the bad, and the ugly Short intro to the OSU team du jour. Outline (or, rather, my list of questions).
 
                
                E N D
A few thoughts about ASAT • Some slides from NSF workshop presentation on knowledge integration • Thoughts about “islands of certainty” • Neural networks: the good, the bad, and the ugly • Short intro to the OSU team du jour
Outline (or, rather, my list of questions) • What is Knowledge Integration (KI)? • How has KI influenced ASR to date? • Where should KI be headed? • What types of cues should we be looking for? • How should cues be combined?
What is Knowledge Integration? • It means different things to different people • Combining multiple hypotheses • Bringing linguistic information to bear in ASR • Working definition: • Combining multiple sources of evidence to produce a final (or intermediate) hypothesis • Traditional ASR process uses KI • Combines acoustic, lexical, and syntactic information • But this is only the tip of the iceberg
Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … KI examples in ASR P(X|Q) P(Q|W) P(W) Feature Calculation Acoustic Modeling Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ S E A R C H The cat chased the dog • Acoustic model gives state hypotheses from features • Search integrates knowledge from acoustic, pronunciation, and language models • Statistical models have “simple” dependencies
Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … KI: Statistical Dependencies Feature Calculation Acoustic Modeling Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ S E A R C H The cat chased the dog • “Side information” from the speech waveform • Speaking rate • Prosodic information • Syllable boundaries
Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … KI: Statistical Dependencies Feature Calculation Acoustic Modeling Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ S E A R C H The cat chased the dog • Information from sources outside “traditional” system • Class n-grams, CFG/Collins-style parsers • Sentence-level stress • Vocal-tract length normalization
Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … KI: Statistical Dependencies Feature Calculation Acoustic Modeling Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ S E A R C H The cat chased the dog • Information from “internal” knowledge sources • Pronunciations w/ multi-words, LM probabilities • State-level pronunciation modeling • Buried Markov Models
Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … KI: Statistical Dependencies Feature Calculation Acoustic Modeling Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ S E A R C H The cat chased the dog • Information from errors made by system • Discriminative acoustic, pronunciation, and language modeling
KI: Model Combination Feature Calculation Acoustic Modeling Pronunciation Modeling Language Modeling The cat chased the dog X Feature Calculation Acoustic Modeling Pronunciation Modeling Language Modeling • Integrate multiple “final” hypotheses • ROVER • Word sausages (Mangu et al.)
KI: Model Combination Feature Calculation Acoustic Modeling Pronunciation Modeling Language Modeling X Feature Calculation Acoustic Modeling The cat chased the dog • Combine multiple “non-final” hypotheses • Multi-stream modeling • Synchronous phonological feature modeling • Boosting • Interpolated language models
Summary: Current uses of KI • Probability conditioningP(A|B) -> P(A|B,X,Y,Z) • More refined (accurate?) models • Can complicate overall equation • Model mergingP(A|B) -> f(P1(A|B),w1) + f(P2(A|B),w2) • Different views of information are (usually) good • But sometimes combination methods are not as principled as one would like
Where should we go from here? • As a field have investigated many sources of knowledge • We learn more about language this way • Cf. “More data is better data” school • To make an impact we need • A common framework • Easy ways to combine knowledge • “Interesting” sources of knowledge
KI in Event-Driven ASR • Phonological features as events(from Chin’s proposal) mid-low closure burst closure burst nasal consonant vowel consonant back alveolar can’t
KI in Event-Driven ASR • Integrating multiple detectors • Easy if detectors are of the same type • Use both conditioning and model combination mid-low closure burst closure burst nasal consonant vowel consonant back alveolar P(back|detector1) P(back|detector2) can’t
KI in Event-Driven ASR • Integrating multiple cross-type detectors • Simplest to use Naïve Bayes assumptionP(X|e1,e2,e3)=(P(e1|X)P(e2|X)P(e3|X)P(X))/Z mid-low closure burst closure burst nasal consonant vowel consonant back alveolar can’t P(k|features)
KI in Event-Driven ASR • Breakdown in Naïve Bayes • Detectors aren’t always independent New non-independent detector k high closure burst closure burst nasal consonant vowel consonant back alveolar Feature spreading correlated with vowel raising can’t
KI in Event-Driven ASR • Wanted: Gestalt detector • View overall shape of detector streams k high closure burst closure burst nasal consonant vowel consonant back alveolar P(can’t| )
The Challenge of Plug-n-Play • Shouldn’t have to re-learn entire system every time a new detector is added • Can’t have one global P(can’t|all variables) • Changes should be localized • Implies need for hierarchical structure • Composition structure should enable combination of radically different forms of information • E.g., audio-visual speech recognition
The Challenge of Plug-n-Play • Perhaps need three types of structures • Event integrators • Is this a CVC syllable? • Problems like feature spreading become local • Hypothesis generators • I think the word “can’t” is here. • Combines evidence from top-level integrators • Hypothesis validators • Is this hypothesis consistent? • Language model, word boundary detection, … • Still probably have Naïve Bayes problems
What type of detectors should we be thinking about? • Phonological features • Phones • Syllables? Words? Function Words? • Syllable/word boundaries • Prosodic stress • … and a whole bunch of other things • We’ve already looked at a number of them • And Jim’s already made some of these points
Putting it all together • Huge multi-dimensional graph search • Should not be strictly “left-to-right” • “Islands of certainty” • People tend to emphasize the important words • …and we can usually detect them better • Work backwards to firm up uncertain segments
Summary • As a field, we have looked at many influences on our probabilistic models • Have gained expertise in • Probability conditioning • Model combination • Event-driven ASR may provide challenging, but interesting framework for incorporating different ideas
We can’t parse everything • At least not on the first pass • Need to find ways to cleverly reduce computation: center around things that we’re sure about • Can we use confidence values from “light” detectors and refine? (likely) • Can we use external sources of knowledge to help guide search? (likely)
Word/syllable onset detection • Several factors point to existence of factors that can help with word segmentation • Psychology experiments have suggested that phonotactics plays a big role (e.g., Saffran et al.) • Shire (at ICSI) was able to train a pretty reliable syllable boundary detector from acoustics • Syllable onsets pronounced more canonically than nuclei or codas -- 84% vs 65% Switchboard, 90% vs 62%/80% TIMIT (Fosler-Lussier et al 99) • Can we build “island of certainty” models by looking at a combination of acoustic/phonetic factors?
Integrating multiple units • Naïve method: just try to combine everything in sight • Refined method: process left to right, but process a buffer (e.g. .5-2 sec) • Look for islands • Back-fit other material in a way that makes sense given the islands • Can use external measures like speaking rate to validate likelihood of inferred structure
Neural nets • ANNs are good as non-linear discriminators • But they have a problem: when they’re wrong, they are often REALLY wrong • Ex: training on TI digits (30 phones, easy) • CV frame-level margin: P(correct)-P(next competitor) • 9% margin < -0.4, 8% margin -0.4--0 • 8% margin 0-0.4, 75% margin >0.4 • Could chalk this up to “pronunciation variation” • Current thinking: if training more responsive to margin, might move some of that 9% upward.
Current personnel • Me • Keith as consultant • Anton Rytting (Linguistics): part time senior grad student, works on word segmentation in Greek; currenly twisting his arm • Linguistics student TBA 1/05. • Incoming students (we’ll see who works) • 1 ECE student (signal processing) • 2 CSE students (MS in reinforcement learning, BA in genetic algorithms)