Create Presentation
Download Presentation

Download Presentation
## Hidden Markov Models

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Hidden Markov Models**Ellen Walker Bioinformatics Hiram College, 2008**State Machine to Recognize “AUG”**Final state transition Start state Each character causes a transition to the next state**Deterministic Finite Automaton (DFA)**• States • One start state • One or more accept states • Transitions • For every state, for every character • Outputs • Optional: states can emit outputs, e.g. “Stop” at accept state**Why DFAs?**• Every regular expression has an associated state machine that recognizes it (and vice versa) • State machines are easy to implement in very low level code (or hardware) • Sometimes the state machine is easier to describe than the regular expression**Hidden Markov Models**• Also a form of state machine • Transitions based on probabilities, not inputs • Every state has (probabilistic) output (or emission) • “Hidden” because only emissions are visible, not states or transitions**HMM vs. DFA**• DFA is deterministic • Each decision (which state next? What to output?) is fully determined by the input string • HMM is probabilistic • HMM makes both decisions based on probability distributions**HMM vs. DFA (2)**• DFA model is explicit and used directly like a program. • HMM model must be inferred from data. Only emissions (outputs) can be observed. States and transitions, as well as the probability distributions for transitions and outputs are hidden.**HMM Example: Fair Bet Casino**• The casino has two coins, a Fair coin (F) and a Biased coin (B) • Fair coin has 50% H, 50% T • Biased coin has 75% H, 25% T • Before each flip, with probability 10%, the dealer will switch coins. • Can you tell, based only on a sequence of H and T which coin is used when?**“Fair Bet Casino” HMM**Image from Jones & Pevner 2004**The Decoding Problem**• Given an HMM and a sequence of outputs, what is the most likely path through the HMM that generated the outputs?**Viterbi Algorithm**• Uses dynamic programming • Starting point: • When the output string is “”, the most likely state is the start state (and there is no path) • Taking a step: • Likelihood of this state is maximum of all ways to get here, measured as: • Likelihood of previous state * Likelihood of transition to this state * Likelihood of output from this state**Example: “HHT”**• Initial -> F • Prev= 1, Trans = 0.5, Out=0.5, total = 0.25 • Initial -> B • Prev =1, Trans = 0.5, Out=0.75, total = 0.375 • Result: F = 0.25, B=0.375**Example: “HHT”**• F -> F • Prev=0.25, Trans = 0.9, Out=0.5, total = 0.1125 • B -> F • Prev=0.375, Trans = 0.1, Out=0.5, total = 0.01875 • F -> B • Prev =.25, Trans = 0.1, Out=0.75, total = 0.01875 • B -> B • Prev =.375, Trans = 0.9, Out=0.75, total = 0.253125 • Result: F = 0.1125, B=0.253125**Example: HHT**• F -> F • Prev=.1125, Trans = 0.9, Out=0.5, total = 0.0506 • B -> F • Prev=.253125, Trans = 0.1, Out=0.5, total = 0.0127 • F -> B • Prev =.1125, Trans = 0.1, Out=0.25, total = 0.00281 • B -> B • Prev=.253125, Trans = 0.9, Out=0.25, total = 0.0570 • Result: F = 0.0506, B=0.0570**Tracing Back**• Pick the highest result from the last step, follow the highest transition from each previous step (just like Smith-Waterman) • Result: initial->B->B->B • Biased coin always used • What if the next flip is T?**Log Probabilities**• Probabilities are increasingly small, as you multiply numbers less than one • Computers have limits to precision • Therefore, it’s better to use a log probability format • 1/10*1/10 = 1/100 (10-1 *10-1 = 10-2) • -1 + -1 = -2**GC Rich Islands**• A GC Rich Island is an area of a genome where GC content is significantly greater than the genome as a whole • GC Rich Islands are like Biased Coins • Can recognize them using the same HMM • GC content is p(H) for fair coin • Larger number is p(H) for biased coin • Estimate probability of entering vs. leaving GC Rich island for “changing coin” probability**Probability of State Sequence, Given Output Sequence**• Given HMM and output string, what is probability that HMM is in state S at time t? • Forward: similar formulation as decoding problem, except take sum of all paths, instead of maxof all paths (times from 0 to t-1) • Backward: similar, but work from end of string (times from t+1 to end of sequence**Parameter Estimation**• Given many strings, what are the parameters of the HMM that generated them? • Assume we know the states and transitions, but not the probabilities of transitions or outputs • This is an optimization problem**Characteristics of an Optimization Problem**• Each potential solution has a “goodness” value (in this case, probability) • We want the best solution • Perfect answer: try all possibilities (not usually possible) • Good, but not perfect answer: use a heuristic**Hill Climbing (an Optimization Heuristic)**• Start with a solution (could be random) • Consider one or more “steps”, or perturbations to the solution • Choose the “step” that most improves the score • Repeat until the score is good enough, or no better score can be reached**Hill Climbing for HMM**• Guess a state sequence • Using the string(s), estimate transition and emission probabilities • Using the probabilities, generate a new state sequence using the decoding algorithm • Repeat until the sequence stabilizes**HMM for Sequence Profiles**• Three kinds of states: • Insertion • Deletion • Match • Probability estimations indicate how often each occurs • Logos are direct representations of HMMs in this format