Hidden Markov Models For DNA Sequence Alignment

Hidden Markov Models For DNA Sequence Alignment Rich Burns CS 790 – Bioinformatics Spring 2001 Wright State University CS 790 Spring 2001

Presentation Outline • Introduction • What do we want to know? • Why do we want to know? • How do we find this out? • Hidden Markov Models (HMM) • What is a HMM? • How does a HMM work? CS 790 Spring 2001

Presentation Outline 2 • Ground Up Examples • Regular Expressions • Motif Example • Profile HMM • Sequence Alignment • References CS 790 Spring 2001

What do we want to know? • Missing Children • Database searching for other members of a sequence family • Multiple Alignments • Align a set of sequences and score their fit to the family CS 790 Spring 2001

Why do we want to know? • “The most important contribution of computational biology has been in the development of methods for extracting information from the bipolymer sequence databases via sequence comparison, characterization and classification… …Sequence alignment methodology is central to all of these methods” [Liu et. Al. 1999] CS 790 Spring 2001

Why do we want to know? • Sequence comparison methods help reveal • Information about structure and function • Information about the process of molecular evolution CS 790 Spring 2001

How do we find out? • Hidden Markov Models CS 790 Spring 2001

What is a HMM? • A statistical model (Customizable) • Describes a series of observations by a hidden stochastic process • Defines a probability distribution over possible sequences • Our case: • Observations = nucleotides • Series = sequence of observations CS 790 Spring 2001

How does a HMM work? • Consider regular expressions • Grep • Consider the following DNA motif A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C CS 790 Spring 2001

How does a HMM work?(Regular Expression) • [AT] [CG] [AC] [ACGT]* A [TG] [GC] A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C CS 790 Spring 2001

How does a HMM work? • The regular expression can: • Determine if the sequence in question fits the criteria of the search or not • The regular expression cannot: • Determine how well the sequence in question fits the criteria of the search CS 790 Spring 2001

Deriving the HMM • Deriving the HMM from a known alignment • Statistics • Each column in the alignment generates a state • Count the occurrence of [ATGC] in each column to determine probabilities for each state • Insertions are trickier CS 790 Spring 2001

Deriving the HMM A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C CS 790 Spring 2001

Using the HMM • Remember the goal: • How well does the given sequence fit the family • Let’s try it • Exceptional Sequence: T G C T - - A G G • Consensus Sequence: A C A C - - A T C CS 790 Spring 2001

Using the HMM • Exceptional Sequence • P(TGCT- -AGG) = (.2*1)*(.2*1)*(.2*.6)*(.2*.6)*(1*1)*(.2) ~=0.0023e-2 • Consensus Sequence • P(ACAC- -ATC) = (.8*1)*(.8*1)*(.8*.6)*(.4*.6)*(1*1)*(.8*1)*(.8) ~= 4.7e-2 CS 790 Spring 2001

Using the HMM CS 790 Spring 2001

Problem with Probability • Exceptional Sequence • P(TGCT- -AGG) = (.2*1)*(.2*1)*(.2*.6)*(.2*.6)*(1*1)*(.2) ~=0.0023e-2 • Consensus Sequence • P(ACAC- -ATC) = (.8*1)*(.8*1)*(.8*.6)*(.4*.6)*(1*1)*(.8*1)*(.8) ~= 4.7e-2 • Sequence length dependent • Not always a good score to use • Penalizes insertions – favors deletions • Bias – who’s to say that insertions are bad and deletions are good? • Log-odds CS 790 Spring 2001

Log-odds • Log-odds is computed as: • P(S) – same as before • 0.25L – null model • Considers the overall sequence of nucleotides as random • Better estimate – use overall frequency of nucleotides in organisms genome CS 790 Spring 2001

Log-odds • Consensus Sequence • LO(ACACATC) = 1.16+0+1.16-0.51+0.47-0.51+1.39+0+1.16+0+1.16 = 6.64 CS 790 Spring 2001

Profile HMM • Much more complex structure • No numerical example • Designed to allow position dependent gap penalties CS 790 Spring 2001

Profile HMM • Bottom Row • Middle Row • Top Row CS 790 Spring 2001

A drawback and pseudocounts • Dangerous to estimate a probability distribution from just a few examples • All professors are interested in bioinformatics • All computers run windows • Pseudocount fake count • Pretend you saw a nucleotide in a position even though it wasn’t there – allows for the small possibility that something else may occur other than what you have observed CS 790 Spring 2001

How pseudocounts help • If for instance you have only the first 2 sequences and you are looking at sequence 4 • P(4) = .5*1*0*1*… = 0 A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C When in fact we already know that sequence 4 is part of the same family CS 790 Spring 2001

Multiple alignments from unaligned sequences • Start with a model of random probabilities • Or a reasonable guess if it is available • Build a model from this alignment • Use the alignment to improve the probabilities • May lead to a slightly different alignment • Stop when alignment fails to change iterate CS 790 Spring 2001

Multiple alignment algorithms • Viterbi • Forward Backward • Baum-Welch CS 790 Spring 2001

Advantages of HMMs • Built on a formal probabilistic basis • Can use Bayesian probability theory to guide the scoring parameters • Probability theory allows a HMM to be trained from unaligned sequences if alignment not known or trusted • Consistent theory behind gap/insertion penalties • Less skill and intervention needed to train a good HMM vs. hand constructed profile • Can make libraries of hundreds of profile HMMs and apply them on a large scale (whole genome) CS 790 Spring 2001

Drawbacks of HMMs • Do not capture higher-order correlations • Assumes identity of a particular position is independent of the identity of all other positions • Scoring by probability / Training heuristics • pseudocounts CS 790 Spring 2001

References • Salzberg et al., Computational Methods in Molecular Biology (Chapter 4: Krogh), Elsevier , 1998 • http://www.cs.jhu.edu/~salzberg/compbio-book.html • http://www.cbs.dtu.dk/krogh/refs.html • Rabiner, L. R., A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77:257-286 • Good introduction to HMMs • Krogh, et al., Hidden Markov Models in Computational Biology (applications to protein modeling). J. Mol. Bio. (1994) 235, 1501-1531 • Krogh is a name I saw a lot in this area • Liu S., et al., Markovian Structures in Biological Sequence Alignments, Journal of the American Statistical Association, March 1999, Vol. 94, No 445 • HMMER user’s guide – http://be.embnet.org/HMMERman/node9.html CS 790 Spring 2001

Hidden Markov Models For DNA Sequence Alignment