1 / 28

Markov Models

Markov Models. BMI/CS 576 www.biostat.wisc.edu/bmi576.html Sushmita Roy sroy@biostat.wisc.edu Oct 23 rd , 2012. BMI/CS 576. Motivation for Markov models in computational b iology.

pilis
Télécharger la présentation

Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Markov Models BMI/CS 576 www.biostat.wisc.edu/bmi576.html Sushmita Roy sroy@biostat.wisc.edu Oct 23rd, 2012 BMI/CS 576

  2. Motivation for Markov models in computational biology • there are many cases in which we would like to represent the statistical regularities of some class of sequences • genes • various regulatory sites in DNA (e.g. promoters) • proteins in a given family • etc. • Markov models are well suited to this type of task

  3. Example application • CpG islands • CG dinucleotides are rarer in eukaryotic genomes than expected given the marginal probabilities of C and G • but the regions upstream of genes are richer in CG dinucleotides than elsewhere – CpG islands • useful evidence for finding genes • could predict CpG islands with Markov chains • one to represent CpG islands • one to represent the rest of the genome

  4. A Markov chain model .38 a g .16 .34 begin .12 transition probabilities c t state transition

  5. begin Markov chain models • can also have an end state; allows the model to represent • a distribution over sequences of different lengths • preferences for ending sequences with certain symbols a g end c t

  6. Markov chain models • a Markov chain model is defined by • a set of states • some states emit symbols • other states (e.g. the begin and end states) are silent • a set of transitions with associated probabilities • the transitions emanating from a given state define a distribution over the possible next states

  7. Markov chain models • Let X be a sequence of random variables X1 …XLrepresenting a biological sequence • from the chain rule of probability

  8. Markov chain models • from the chain rule we have • key property of a (1st order) Markov chain: the probability of each depends only on the value of

  9. a g begin end c t The probability of a sequence for a given Markov chain model

  10. Markov chain notation • the transition parameters can be denoted by where • similarly we can denote the probability of a sequence xas where represents the transition from the begin state

  11. Estimating the model parameters • Given some data, how can we determine the probability parameters of our model? • one approach: maximum likelihood estimation • given a set of data D • set the parameters to maximize • i.e. make the data D look as likely as possible under the model

  12. Maximum likelihood estimation • suppose we want to estimate the parameters P(a), P(c), P(g), P(t) • and we’re given the sequences accgcgctta gcttagtgac tagccgttac • then the maximum likelihood estimates are

  13. do we really want to set this to 0? Maximum likelihood estimation • suppose instead we saw the following sequences gccgcgcttg gcttggtggc tggccgttgc • then the maximum likelihood estimates are

  14. A Bayesian approach • instead of estimating parameters strictly from the data, we could start with some prior belief for each • for example, we could use Laplace estimates • where represents the number of occurrences of characteri pseudocount • using Laplace estimates with the sequences • gccgcgcttg • gcttggtggc • tggccgttgc

  15. A Bayesian approach • a more general form: m-estimates prior probability of a number of “virtual” instances • with m=8 and uniform priors • gccgcgcttg • gcttggtggc • tggccgttgc

  16. Estimation for 1st order probabilities • to estimate a 1st order parameter, such as P(c|g), we count the number of times thatcfollows the historygin our given sequences • using Laplace estimates with the sequences • gccgcgcttg • gcttggtggc • tggccgttgc

  17. Higher order Markov chains • the Markov property specifies that the probability of a state depends only on the probability of the previous state • but we can build more “memory” into our states by using a higher order Markov model • in an nth order Markov model

  18. Selecting the order of a Markov chain model • higher order models remember more “history” • additional history can have predictive value • example: • predict the next word in this sentence fragment“… the__”(duck, end, grain, tide, wall, …?) • now predict it given more history “… against the __” (duck, end, grain, tide, wall, …?) • “swim against the __” (duck, end, grain, tide, wall, …?)

  19. Selecting the order of a Markov chain model • but the number of parameters we need to estimate grows exponentially with the order • for modeling DNA we need parameters for an nth order model • the higher the order, the less reliable we can expect our parameter estimates to be • estimating the parameters of a 2nd order Markov chain from the complete genome of E. Coli, we’d see each word > 72,000 times on average • estimating the parameters of an 8th order chain, we’d see each word ~ 5 times on average

  20. Higher order Markov chains • an nth order Markov chain over some alphabet Ais equivalent to a first order Markov chain over the alphabet Anof n-tuples • example: a 2nd order Markov model for DNA can be treated as a 1st order Markov model over alphabet AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT • caveat: we process a sequence one character at a time A C G G T CG GG GT AC

  21. A fifth-order Markov chain aaaaa ctaca P(a|gctac) ctacc begin ctacg ctact P(c|gctac) P(gctac) gctac

  22. begin begin CpG islands as a classification task • train two Markov models: one to represent CpG island sequence regions, another to represent other sequence regions (null) • given a test sequence, use two models to • determine probability that sequence is a CpG island • classify the sequence (CpGor null) a a g g end end c c t t

  23. Markov chains for discrimination • parameters estimated for CpG and null models • human sequences containing 48 CpG islands • 60,000 nucleotides CpG null

  24. Markov chains for discrimination • using Bayes’ rule tells us • if we don’t take into account prior probabilities of two classes ( and ) then we just need to compare and

  25. Markov chains for discrimination • light bars represent negative sequences • dark bars represent positive sequences (i.e. CpG islands) • the actual figure here is not from a CpG island discrimination task, however Figure from A. Krogh, “An Introduction to Hidden Markov Models for Biological Sequences” in Computational Methods in Molecular Biology, Salzberg et al. editors, 1998.

  26. Inhomogenous Markov chains • in the Markov chain models we have considered so far, the probabilities do not depend onour position in a given sequence • in an inhomogeneous Markov model, we can have different distributions at different positions in the sequence • consider modeling codons in protein coding regions

  27. a a a c c c g g g t t t An inhomogeneous Markov chain begin pos 1 pos 2 pos 3

  28. begin begin end Why we need an end state to define a distribution over varying length sequences 1.0 A 0.6 T 0.4 P(A) = 0.6 P(T) = 0.4 P(AA) = 0.36 P(AT) = 0.24 P(TA) = 0.24 P(TT) = 0.16 0 1 1.0 1.0 0.2 A 0.6 T 0.4 0 3 1 0.8 P(A) = 0.12 P(T) = 0.08 P(AA) = 0.0576 P(AT) = 0.0384 P(TA) = 0.0384 P(TT) = 0.0256 P(L=1) = 0.2 P(L=2) = 0.16

More Related