Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008
State Machine to Recognize “AUG” Final state transition Start state Each character causes a transition to the next state
Deterministic Finite Automaton (DFA) • States • One start state • One or more accept states • Transitions • For every state, for every character • Outputs • Optional: states can emit outputs, e.g. “Stop” at accept state
Why DFAs? • Every regular expression has an associated state machine that recognizes it (and vice versa) • State machines are easy to implement in very low level code (or hardware) • Sometimes the state machine is easier to describe than the regular expression
Hidden Markov Models • Also a form of state machine • Transitions based on probabilities, not inputs • Every state has (probabilistic) output (or emission) • “Hidden” because only emissions are visible, not states or transitions
HMM vs. DFA • DFA is deterministic • Each decision (which state next? What to output?) is fully determined by the input string • HMM is probabilistic • HMM makes both decisions based on probability distributions
HMM vs. DFA (2) • DFA model is explicit and used directly like a program. • HMM model must be inferred from data. Only emissions (outputs) can be observed. States and transitions, as well as the probability distributions for transitions and outputs are hidden.
HMM Example: Fair Bet Casino • The casino has two coins, a Fair coin (F) and a Biased coin (B) • Fair coin has 50% H, 50% T • Biased coin has 75% H, 25% T • Before each flip, with probability 10%, the dealer will switch coins. • Can you tell, based only on a sequence of H and T which coin is used when?
“Fair Bet Casino” HMM Image from Jones & Pevner 2004
The Decoding Problem • Given an HMM and a sequence of outputs, what is the most likely path through the HMM that generated the outputs?
Viterbi Algorithm • Uses dynamic programming • Starting point: • When the output string is “”, the most likely state is the start state (and there is no path) • Taking a step: • Likelihood of this state is maximum of all ways to get here, measured as: • Likelihood of previous state * Likelihood of transition to this state * Likelihood of output from this state
Example: “HHT” • Initial -> F • Prev= 1, Trans = 0.5, Out=0.5, total = 0.25 • Initial -> B • Prev =1, Trans = 0.5, Out=0.75, total = 0.375 • Result: F = 0.25, B=0.375
Example: “HHT” • F -> F • Prev=0.25, Trans = 0.9, Out=0.5, total = 0.1125 • B -> F • Prev=0.375, Trans = 0.1, Out=0.5, total = 0.01875 • F -> B • Prev =.25, Trans = 0.1, Out=0.75, total = 0.01875 • B -> B • Prev =.375, Trans = 0.9, Out=0.75, total = 0.253125 • Result: F = 0.1125, B=0.253125
Example: HHT • F -> F • Prev=.1125, Trans = 0.9, Out=0.5, total = 0.0506 • B -> F • Prev=.253125, Trans = 0.1, Out=0.5, total = 0.0127 • F -> B • Prev =.1125, Trans = 0.1, Out=0.25, total = 0.00281 • B -> B • Prev=.253125, Trans = 0.9, Out=0.25, total = 0.0570 • Result: F = 0.0506, B=0.0570
Tracing Back • Pick the highest result from the last step, follow the highest transition from each previous step (just like Smith-Waterman) • Result: initial->B->B->B • Biased coin always used • What if the next flip is T?
Log Probabilities • Probabilities are increasingly small, as you multiply numbers less than one • Computers have limits to precision • Therefore, it’s better to use a log probability format • 1/10*1/10 = 1/100 (10-1 *10-1 = 10-2) • -1 + -1 = -2
GC Rich Islands • A GC Rich Island is an area of a genome where GC content is significantly greater than the genome as a whole • GC Rich Islands are like Biased Coins • Can recognize them using the same HMM • GC content is p(H) for fair coin • Larger number is p(H) for biased coin • Estimate probability of entering vs. leaving GC Rich island for “changing coin” probability
Probability of State Sequence, Given Output Sequence • Given HMM and output string, what is probability that HMM is in state S at time t? • Forward: similar formulation as decoding problem, except take sum of all paths, instead of maxof all paths (times from 0 to t-1) • Backward: similar, but work from end of string (times from t+1 to end of sequence
Parameter Estimation • Given many strings, what are the parameters of the HMM that generated them? • Assume we know the states and transitions, but not the probabilities of transitions or outputs • This is an optimization problem
Characteristics of an Optimization Problem • Each potential solution has a “goodness” value (in this case, probability) • We want the best solution • Perfect answer: try all possibilities (not usually possible) • Good, but not perfect answer: use a heuristic
Hill Climbing (an Optimization Heuristic) • Start with a solution (could be random) • Consider one or more “steps”, or perturbations to the solution • Choose the “step” that most improves the score • Repeat until the score is good enough, or no better score can be reached
Hill Climbing for HMM • Guess a state sequence • Using the string(s), estimate transition and emission probabilities • Using the probabilities, generate a new state sequence using the decoding algorithm • Repeat until the sequence stabilizes
HMM for Sequence Profiles • Three kinds of states: • Insertion • Deletion • Match • Probability estimations indicate how often each occurs • Logos are direct representations of HMMs in this format