CS 416 Artificial Intelligence

CS 416Artificial Intelligence Lecture 19 Reasoning over Time Chapter 15

Hidden Markov Models (HMMs) • Represent the state of the world with a single discrete variable • If your state has multiple variables, form one variable whose value takes on all possible tuples of multiple variables • A two-variable system (heads/tails and red/green/blue) becomes • A single-variable system with six values (heads/red, tails/red, …)

HMMs • Let number of states be S • Transition model T is an SxS matrix filled by P( Xt | Xt-1 ) • Probability of transitioning from any state to another • Consider obtaining evidence et at each timestep • Construct an SxS matrix O consisting of P( et | Xt = i ) along the diagonal and zero elsewhere

HMMs • Rewriting the FORWARD algorithm • Constructing the predicted sequence of states from 0t+1 given e0  et+1 • f1:t+1 = aFORWARD (f1:t, et+1)

HMMs • Optimizations • FORWARD and BACKWARD can be written in matrix form • Matrix forms permit reinspection for speedups • Consult book if interested in these for assignment

Speech recognition vs. Speech understanding • Recognition • Convert acoustic signal into words • P (words | signal) = a P (signal | words) P (words) • Understanding • Recognizing the context and semantics of the words We have a model of this too We have a model of this

Applications • NaturallySpeaking (interesting story from Wired), Viavoice… • 90% hit rate is 10% error rate • want 98% or 99% success rate • Dictation • Cheaper to play doctor’s audio tapes into telephone so someone in India can type the text and email it back • User-control of devices • “Call home”

Spectrum of choices

Waveform to phonemes • 40 – 50 phones (sounds) in all human languages • 48 phonemes (distinguishable unts) in English (according to ARPAbet) • Ceiling = [s iy l ih ng] [s iy l ix ng] [s iy l en] • Nothing is precise here, so HMM with state variable Xt corresponding to the phone uttered at time t • P (Et | Xt): given phoneme, what is its waveform • Must have models that adjust for pitch, speed, volume…

Analog to digital (A to D) • Diaphragm of microphone is displaced by movement of air • Analog to digital converter samples the signal at discrete time intervals (8 – 16 kHz, 8-bit for speech)

Data compression • 8kHz at 8 bits is 0.5 MB for one minute of speech • Too much information for constructing P(Xt+1 | Xt) tables • Reduce signal to overlapping frames (10 msecs) • frames have features that are evaluated based on signal

More data compression • Features are still too big • Consider n features with 256 values each • 256n possible frames • A table of P (features | phones) would be too large • Cluster! • Reduce number of options from 256n to something manageable

Phone subdivision • Phones last 5-10 frames • Possible to subdivide a phone into three parts • Onset, mid, end • [t] = [silent beginning, small explosion, hissing end] • The sound of a phone changes based on surrounding phones • Brain coordinates ending of one phone and beginning of upcoming ones (coarticulation) • Sweet vs. stop • State space is increased, but improved accuracy

Words • You say [t ow m ey t ow] • P (t ow m ey t ow | “tomato”) • I say [t ow m aa t ow]

Words - coarticulation • The first syllable changes based on dialect • There are four ways to say “tomato” and we would store P( [pronunciation] | “tomato”) for each • Remember diagram would have three stages per phone

Words - segmentation • “Hearing” words in sentences seems easy to us • Waveforms are fuzzy • There are no clear gaps to designate word boundaries • One must work the probabilities to decide if current word is continuing with another syllable or if it seems likely that another word is starting

Sentences • Bigram Model • P (wi | w1:i-1) has a lot of values to determine • P (wi | wi-1) is much more manageable • We make a first-order Markov assumption about word sequences • Easy to train this through text files • Much more complicated models are possible that take syntax and semantics into account

Bringing it together • Each transformation is pretty inaccurate • Lots of choices • User “error” – stutters, bad grammar • Subsequent steps can rule out choices from previous steps • Disambiguation

Bringing it together • Continuous speech • Words composed of p 3-state phones • W words in vocabulary • 3pW states in HMM • 10 words, 4 phones each, 3 states per phone = 120 states • Compute likelihood of all words in sequence • Viterbi algorithm from 15.2

A final note • Where do all the transition tables come from? • Word probabilities from text analysis • Pronunciation models have been manually constructed for many hours of speaking • Some have multiple-state phones identified • Because this annotation is so expensive to perform, can we annotate or label the waveforms automatically?

Expectation Maximization (EM) • Learn HMM transition and sensor models sans labeled data • Initialize models with hand-labeled data • Use these models to predict states at multiple times t • Use these predictions as if they were “fact” and update HMM transition table and sensor models • Repeat

CS 416 Artificial Intelligence