360 likes | 699 Vues
Markov models and applications. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Oct 15 th , 2013. Key concepts. Markov chains Computing the probability of a sequence Estimating parameters of a Markov model Hidden Markov models States
E N D
Markov models and applications Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Oct 15th, 2013
Key concepts • Markov chains • Computing the probability of a sequence • Estimating parameters of a Markov model • Hidden Markov models • States • Emission and transition probabilities • Parameter estimation • Forward and backward algorithm • Viterbi algorithm
Why Markov models? • So far we have assumed in our models that there is no dependency between consecutive base pair locations • This assumption is rarely true in practice • Markov models allow us to model the dependencies inherent in sequential data such as DNA or protein sequence
Applications of Markov models • Genome annotation • Given a genome sequence find functional untis of the genome • Genes, CpG islands, promoters.. • Sequence classification • A Hidden Markov model (Profile HMMs) can represent a family of proteins • Sequence alignment
Markov models • Provide a generative model for sequence data • A Markov chain is the simplest type of Markov models • Described by • A collection of states, each state representing a symbol observed in the sequence • At each state a symbol is emitted • At each state there is a some probability of transitioning to another state
Markov chain model notation • Xtdenote the state at time t • aij=P(Xt=j|Xt-1=i) denotes the transition probability from i to j
A Markov chain model for DNA sequence .38 A G .16 .34 begin .12 transition probabilities C T state transition
begin Markov chain models • can also have an end state; allows the model to represent • a distribution over sequences of different lengths • preferences for ending sequences with certain symbols A G end C T
Computing the probability of a sequence from a first Markov model • Let X be a sequence of random variables X1 …XLrepresenting a biological sequence • from the chain rule of probability
Computing the probability of a sequence from a first Markov model • Key property of a (1st order) Markov chain: the probability of each Xt depends only on the value of Xt-1 This can be written as the initial probabilities or a transition probability from the “begin” state
begin Example of computing the probability from a Markov chain A g end c t What is P(CGGT)?
Learning Markov model parameters • Model parameters • Transition probabilities • Estimated via maximum likelihood
Estimating model parameters • Assume we have some training data that we know was generated from a first order Markov model • Need to estimate transition probabilities • P(Xt|Xt-1)
Estimating model parameters • P(X) • P(Xt|Xt-1) Number of times x follows G
Estimating parameters example • Assume the following training data ACCGCGCTTAGCTTAGTGACTAGCCGTTAC • Fill in the zeroth and first order transition probability values Do we really want to set this to 0? A T C G A T G C
Laplace estimates of parameters • instead of estimating parameters strictly from the data, we could start with some prior belief for each • for example, we could use Laplace estimates • where represents the number of occurrences of characteri pseudocount • using Laplace estimates with the sequences • gccgcgcttg • gcttggtggc • tggccgttgc
Estimation for 1st order probabilities • Using Laplace estimates with the sequences for first order probabilities • gccgcgcttg • gcttggtggc • tggccgttgc
Using Markov chains to classify sequences as CpG islands or not • CpG islands are isolated regions of the genome corresponding to high concentrations of CG dinucleotides • Range from 100-5000 bps • For example regions upstream of genes or gene promoters • Given a short sequence of DNA, how can we classify it as a CpG island?
Build a Markov chain model for CpG islands • Learn two Markov models • One from sequences that look like CpG (positive) • One from sequences that don’t look like CpG (negative) • Parameters estimated from ~60,000 nucleotides CpG Not CpG G is much more likely to follow A in the CpG positive model
Using the Markov chains to classify a new sequence • Let y denote a new sequence • To use the Markov models to classify y we compute the log odds ratio • The larger the value of S(y) the more likely is y a CpG island • Note: must also normalize by the length
Applying the CpG Markov chain models • Is CGCGA a CpG island? • Un-normalized Score= • =3.3684 • Is CCTGG a CpG island? • Un-normalized Score=0.3746
Extensions to Markov chains • Hidden Markov models (Next lectures) • Higher-order Markov models • Inhomogeneous Markov models
Order of a Markov model • Describes how much history we want to keep • First order Markov models are most common • Second order • nth order
Higher order Markov models • An nth order Markov model over alphabet A is equivalent to a first order Markov model An • An corresponds to the alphabet of n-tuples • For example • A 2nd order Markov model over A,T,G,C • And a 1st order Markov model over AA,AT,AG,AC,TA,TG,TC,TT,GA,GT,GC,GG,CA,CT,CC,CG • Are equivalent • However higher the order the more parameters we need to estimate
A first order Markov chain for a second-order Markov chain with {A,B} alphabet AA AB BA BB
Inhomogeneous Markov chains • So far the Markov chain has the same transition probability for the entire sequence • Sometimes it is useful to switch between Markov chains depending on where we are in the sequence • For example, recall the genetic code that specifies what triplets of bases code for an amino acid • There are different levels of redundancy for each the first, second or third positions • This could be modeled by three Markov chains
Inhomogeneous Markov chain • Let 1, 2 and 3 denote the three Markov chains • So our transition probabilities look like • aij1,aij2,aij3 • Let x1 be in codon position 3 • Probability of x2, x3.. would be • Exercise: write down the terms for x5 and x6
Summary • Markov models allow us to model dependencies in sequential data • Markov models are described by • states, each state corresponding to a letter in our observed alphabet • Transition probabilities between states • Parameter estimation can be done by counting the number of transitions between consecutive states (first order) • Laplace correction is often applied • Often used for classifying sequences as CpG islands or not • Extensions of Markov models • Order • Inhomogeneous Markov models