Hidden Markov Models

Hidden Markov Models

Credits and Acknowledgments • Materials used in this course were taken from the textbook “Pattern Classification” by Duda et al., John Wiley & Sons, 2001 with the permission of the authors and the publisher; and also from • Other material on the web: • Dr. A. Aydin Atalan, Middle East Technical University, Turkey • Dr. Djamel Bouchaffra, Oakland University • Dr. Ricardo Gutierrez-Osuna, Texas A&M University • Dr. Adam Krzyzak, Concordia University • Dr. Andrew W. Moore, Carnegie Melon University, http://www.cs.cmu.edu/~awm/tutorials • Dr. Joseph Picone, Mississippi State University • Dr. Robi Polikar, Rowan University • Dr. Stefan A. Robila, University of New Orleans • Dr. Sargur N. Srihari, State University of New York at Buffalo • David G. Stork, Stanford University • Dr. Godfried Toussaint, McGill University • Dr. Chris Wyatt, Virginia Tech • Dr. Alan L. Yuille, University of California, Los Angeles • Dr. Song-Chun Zhu, University of California, Los Angeles

Reading • DHS 3.10 • Larry Rabiner’s tutorial on HMM’s

Outline • Discrete Markov Processes • Hidden Markov Models • Illustrative examples • The Viterbi algorithm

Time-based Models • The models we’ve looked at so far: • Simple parametric distributions • Discrete distribution estimates • These are typically based on what is called the “independence assumption”- each data point is independent of the others, and there is no time-sequencing or ordering. • What if the data has correlations based on its order, like a time-series?

Applications of time based models • Sequential pattern recognition is a relevant problem in a number of disciplines • Human-computer interaction: Speech recognition • Bioengineering: ECG and EEG analysis • Robotics: mobile robot navigation • Bioinformatics: DNA base sequence alignment

Andrei Andreyevich Markov Born: 14 June 1856 in Ryazan, RussiaDied: 20 July 1922 in Petrograd (now St Petersburg), Russia Markov is particularly remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes.

Markov random processes • A random sequence has the Markov property if its distribution is determined solely by its current state. Any random process having this property is called a Markov random process. • For observable state sequences (state is known from data), this leads to a Markov chain model. • For non-observable states, this leads to a Hidden Markov Model (HMM).

Chain Rule & Markov Property Bayes rule Markov property

A Markov System Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … s2 s1 s3 N = 3 t=0

Current State A Markov System Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt{s1, s2 .. sN} s2 s1 s3 N = 3 t=0 qt=q0=s3

Current State A Markov System Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt{s1, s2 .. sN} Between each timestep, the next state is chosen randomly. s2 s1 s3 N = 3 t=1 qt=q1=s2

P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt{s1, s2 .. sN} Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state. P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1 s2 s1 s3 N = 3 t=1 qt=q1=s2 P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt{s1, s2 .. sN} Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state. P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1 s2 1/2 2/3 1/2 s1 s3 1/3 N = 3 t=1 qt=q1=s2 1 P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0 Often notated with arcs between states

P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 Markov Property P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1 s2 qt+1 is conditionally independent of { qt-1, qt-2, … q1, q0 } given qt. In other words: P(qt+1 = sj |qt = si ) = P(qt+1 = sj |qt = si ,any earlier history) Notation: 1/2 2/3 1/2 s1 s3 1/3 N = 3 t=1 qt=q1=s2 1 P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

Example: A Simple Markov Model For Weather Prediction • Any given day, the weather can be described as being in one of three states: • State 1: precipitation (rain, snow, hail, etc.) • State 2: cloudy • State 3: sunny Transitions between states are described by the transition matrix This model can then be described by the following directed graph

Basic Calculations-1 • Example: What is the probability that the weather for eight consecutive days is “sun-sun-sun-rain-rain-sun-cloudy-sun”? • Solution: • O = sun sun sun rain rain sun cloudy sun 3 3 3 1 1 3 2 3

Basic Calculations-2 • Example: Given that the system is in a known state, what is the probability that it stays in that state for d days? • O =i i i ... i j Note the exponential character of this distribution.

Basic Calculations-3 • We can compute the expected number of observations in a state given that we started in that state: Thus, the expected number of consecutive sunny days is (1/(1-0.8)) = 5; the expected number of cloudy days is 2.5, etc.

From Markov To Hidden Markov • The previous model assumes that each state can be uniquely associated with an observable event • Once an observation is made, the state of the system is then trivially retrieved • This model, however, is too restrictive to be of practical use for most realistic problems • To make the model more flexible, we will assume that the outcomes or observations of the model are a probabilistic function of each state • Each state can produce a number of outputs according to a unique probability distribution, and each distinct output can potentially be generated at any state • These are known a Hidden Markov Models (HMM), because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system

The coin-toss problem • To illustrate the concept of an HMM consider the following scenario • Assume that you are placed in a room with a curtain • Behind the curtain there is a person performing a coin-toss experiment • This person selects one of several coins, and tosses it: heads (H) or tails (T) • The person tells you the outcome (H,T), but not which coin was used each time • Your goal is to build a probabilistic model that best explains a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…} • The coins represent the states; these are hidden because you do not know which coin was tossed each time • The outcome of each toss represents an observation • A “likely” sequence of coins may be inferred from the observations, but this state sequence will not be unique

The Coin Toss Example – 2 coins

From Markov to Hidden Markov Model: The Coin Toss Example – 3 coins

The urn-ball problem • To further illustrate the concept of an HMM, consider this scenario • You are placed in the same room with a curtain • Behind the curtain there are N urns, each containing a large number of balls with M different colors • The person behind the curtain selects an urn according to an internal random process, then randomly grabs a ball from the selected urn • He shows you the ball, and places it back in the urn • This process is repeated over and over • Questions? • How would you represent this experiment with an HMM? • What are the states? • Why are the states hidden? • What are the observations?

Doubly Stochastic System The Urn-and-Ball Model O = {green, blue, green, yellow, red, ..., blue} How can we determine the appropriate model for the observation sequence given the system above?

*L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. HMM Notation(from Rabiner’s Survey) The states are labeled S1 S2 .. SN For a particular trial…. Let T be the number of observations T is also the number of states passed through O = O1 O2 .. OT is the sequence of observations Q = q1 q2 .. qT is the notation for a path of states λ = N,M,i,,aij,bi(j) is the specification of an HMM

HMM Formal Definition An HMM, λ, is a 5-tuple consisting of • N the number of states • M the number of possible observations • {1, 2, .. N} The starting state probabilities P(q0 = Si) = i • a11 a22… a1N a21 a22… a2N : : : aN1 aN2… aNN • b1(1) b1(2) … b1(M) b2(1) b2(2) … b2(M) : : : bN(1) bN(2) … bN(M) The state transition probabilities P(qt+1=Sj | qt=Si)=aij The observation probabilities P(Ot=k | qt=Si)=bi(k)

1/3 S2 S1 Z Y XY 1/3 2/3 2/3 ZX 1/3 1/3 S3 1/3 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Here’s an HMM N = 3 M = 3 1 = 1/2 2 = 1/2 3 = 0 a11 = 0 a12 = 1/3 a13 = 2/3 a12 = 1/3 a22 = 0 a13 = 2/3 a13 = 1/3 a32 = 1/3 a13 = 1/3 b1 (X) = 1/2 b1 (Y) = 1/2 b1 (Z) = 0 b2 (X) = 0 b2 (Y) = 1/2 b2 (Z) = 1/2 b3 (X) = 1/2 b3 (Y) = 0 b3 (Z) = 1/2

1/3 S2 S1 Z Y XY 1/3 2/3 2/3 ZX 1/3 1/3 S3 1/3 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: N = 3 M = 3 1 = 1/2 2 = 1/2 3 = 0 a11 = 0 a12 = 1/3 a13 = 2/3 a12 = 1/3 a22 = 0 a13 = 2/3 a13 = 1/3 a32 = 1/3 a13 = 1/3 b1 (X) = 1/2 b1 (Y) = 1/2 b1 (Z) = 0 b2 (X) = 0 b2 (Y) = 1/2 b2 (Z) = 1/2 b3 (X) = 1/2 b3 (Y) = 0 b3 (Z) = 1/2 50-50 choice between S1 and S2

1/3 S2 S1 Z Y XY 1/3 2/3 2/3 ZX 1/3 1/3 S3 1/3 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: N = 3 M = 3 1 = 1/2 2 = 1/2 3 = 0 a11 = 0 a12 = 1/3 a13 = 2/3 a12 = 1/3 a22 = 0 a13 = 2/3 a13 = 1/3 a32 = 1/3 a13 = 1/3 b1 (X) = 1/2 b1 (Y) = 1/2 b1 (Z) = 0 b2 (X) = 0 b2 (Y) = 1/2 b2 (Z) = 1/2 b3 (X) = 1/2 b3 (Y) = 0 b3 (Z) = 1/2 50-50 choice between X and Y

1/3 S2 S1 Z Y XY 1/3 2/3 2/3 ZX 1/3 1/3 S3 1/3 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: N = 3 M = 3 1 = 1/2 2 = 1/2 3 = 0 a11 = 0 a12 = 1/3 a13 = 2/3 a12 = 1/3 a22 = 0 a13 = 2/3 a13 = 1/3 a32 = 1/3 a13 = 1/3 b1 (X) = 1/2 b1 (Y) = 1/2 b1 (Z) = 0 b2 (X) = 0 b2 (Y) = 1/2 b2 (Z) = 1/2 b3 (X) = 1/2 b3 (Y) = 0 b3 (Z) = 1/2 Goto S3 with probability 2/3 or S2 with prob. 1/3

1/3 S2 S1 Z Y XY 1/3 2/3 2/3 ZX 1/3 1/3 S3 1/3 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: N = 3 M = 3 1 = 1/2 2 = 1/2 3 = 0 a11 = 0 a12 = 1/3 a13 = 2/3 a12 = 1/3 a22 = 0 a13 = 2/3 a13 = 1/3 a32 = 1/3 a13 = 1/3 b1 (X) = 1/2 b1 (Y) = 1/2 b1 (Z) = 0 b2 (X) = 0 b2 (Y) = 1/2 b2 (Z) = 1/2 b3 (X) = 1/2 b3 (Y) = 0 b3 (Z) = 1/2 50-50 choice between X and Y

1/3 S2 S1 Z Y XY 1/3 2/3 2/3 ZX 1/3 1/3 S3 1/3 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: N = 3 M = 3 1 = 1/2 2 = 1/2 3 = 0 a11 = 0 a12 = 1/3 a13 = 2/3 a12 = 1/3 a22 = 0 a13 = 2/3 a13 = 1/3 a32 = 1/3 a13 = 1/3 b1 (X) = 1/2 b1 (Y) = 1/2 b1 (Z) = 0 b2 (X) = 0 b2 (Y) = 1/2 b2 (Z) = 1/2 b3 (X) = 1/2 b3 (Y) = 0 b3 (Z) = 1/2 Each of the three next states is equally likely

1/3 S2 S1 Z Y XY 1/3 2/3 2/3 ZX 1/3 1/3 S3 1/3 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: N = 3 M = 3 1 = 1/2 2 = 1/2 3 = 0 a11 = 0 a12 = 1/3 a13 = 2/3 a12 = 1/3 a22 = 0 a13 = 2/3 a13 = 1/3 a32 = 1/3 a13 = 1/3 b1 (X) = 1/2 b1 (Y) = 1/2 b1 (Z) = 0 b2 (X) = 0 b2 (Y) = 1/2 b2 (Z) = 1/2 b3 (X) = 1/2 b3 (Y) = 0 b3 (Z) = 1/2 50-50 choice between Z and X

1/3 S2 S1 Z Y XY 1/3 2/3 2/3 ZX 1/3 1/3 S3 1/3 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: N = 3 M = 3 1 = 1/2 2 = 1/2 3 = 0 a11 = 0 a12 = 1/3 a13 = 2/3 a12 = 1/3 a22 = 0 a13 = 2/3 a13 = 1/3 a32 = 1/3 a13 = 1/3 b1 (X) = 1/2 b1 (Y) = 1/2 b1 (Z) = 0 b2 (X) = 0 b2 (Y) = 1/2 b2 (Z) = 1/2 b3 (X) = 1/2 b3 (Y) = 0 b3 (Z) = 1/2

1/3 S2 S1 Z Y XY 1/3 2/3 2/3 ZX 1/3 1/3 S3 1/3 Here’s an HMM Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: N = 3 M = 3 1 = 1/2 2 = 1/2 3 = 0 a11 = 0 a12 = 1/3 a13 = 2/3 a12 = 1/3 a22 = 0 a13 = 2/3 a13 = 1/3 a32 = 1/3 a13 = 1/3 b1 (X) = 1/2 b1 (Y) = 1/2 b1 (Z) = 0 b2 (X) = 0 b2 (Y) = 1/2 b2 (Z) = 1/2 b3 (X) = 1/2 b3 (Y) = 0 b3 (Z) = 1/2 This is what the observer has to work with…

Hidden Markov Models Question 1: Probability Evaluation (and State Estimation) What is P( O1O2…OT) ( and P(qT=Si| O1O2…OT) ) It will turn out that a new cute D.P. trick will get this for us. • Question 2: Most Probable Path Given O1O2…OT , what is the most probable state path ? And what is that probability? Yet another famous D.P. trick, the VITERBI algorithm, gets this. • Question 3: Learning HMMs: Given O1O2…OT , what is the maximum likelihood HMM that could have produced this string of observations? Very very useful. Uses the E.M. Algorithm

Most probable path given observations

Let's look at a particular type of shortest path problem. Suppose we wish to get from A to J in the road network of

You now continue working back through the stages one by one, each time completely computing a stage before continuing to the preceding one. The results are: Stage 2. Stage 1.

Efficient MPP computation-Viterbi We’re going to compute the following variables: δt(i)= max P(q1 q2 .. qt-1 qt = Si  O1 .. Ot) q1q2..qt-1 which represents the most probable path that accounts for the first t observations and ends at state Si DEFINE: mppt(i) = that path So: δt(i)= Prob(mppt(i))

The Viterbi Algorithm Now, suppose we have all the δt(i)’s and mppt(i)’s for all i. HOW TO GET δt+1(j) and mppt+1(j)? mppt(1) Prob=δt(1) mppt(2) : mppt(N) ? S1 S2 Sj : Prob=δt(2) SN Prob=δt(N) qt+1 qt

The Viterbi Algorithm time t time t+1 S1 : Sj Si : The most prob path with last two states Si Sj is the most prob path to Si , followed by transition Si→ Sj

The Viterbi Algorithm time t time t+1 S1 : Sj Si : The most prob path with last two states Si Sj is the most prob path to Si , followed by transition Si→ Sj What is the prob of that path? δt(i) x P(Si → Sj  Ot+1 | λ) = δt(i) aij bj (Ot+1) SO The most probable path to Sj has Si* as its penultimate state where i*=argmax δt(i) aij bj (Ot+1) i

Summary: δt+1(j) = δt(i*) aij bj (Ot+1) mppt+1(j) = mppt+1(i*)Si* } with i* defined to the left The Viterbi Algorithm time t time t+1 S1 : Sj Si : The most prob path with last two states Si Sj is the most prob path to Si , followed by transition Si→ Sj What is the prob of that path? δt(i) x P(Si → Sj  Ot+1 | λ) = δt(i) aij bj (Ot+1) SO The most probable path to Sj has Si* as its penultimate state where i*=argmax δt(i) aij bj (Ot+1) i

δt(i)= max P(q1 q2 .. qt-1  qt = Si  O1 .. Ot) q1q2..qt-1

EXAMPLE Consider the following two 2-state HMM’s where both states have two possible output symbols, A and B. Model 1: Transition probabilities a11= 0.7; a12 = 0.3; a21 = 0.0; a22= 1.0. Output probabilities: b1(A) = 0.8; b1(B) = 0.2; b2(A) = 0.4; b2(B) = 0.6. Initial Probabilities: π1=0.5; π2=0.5 Model 2: Transition probabilities a11= 0.6; a12 = 0.4; a21 = 0.0; a22= 1.0. Output probabilities: b1(A) = 0.9; b1(B) = 0.1; b2(A) = 0.3; b2(B) = 0.7. Initial Probabilities: π1=0.4; π2=0.6 A. sketch the state diagram for the two models B Which model is more likely to produce the observation sequence {A,B,B}? C. Given the observation sequence {A,B,B}, find the Viterbi path for each model. Would your answer in (b) be different is you used the likelihood of the Viterbi paths to approximate the likelihood of the model?

Hidden Markov Models