Hidden Markov Models

Hidden Markov Models Fundamentals and applications to bioinformatics.

Markov Chains • Given a finite discrete set S of possible states, a Markov chain process occupies one of these states at each unit of time. • The process either stays in the same state or moves to some other state in S. • This occurs in a stochastic way, rather than in a deterministic one. • The process is memoryless and time homogeneous.

Transition Matrix • Let S={S1, S2, S3}. A Markov Chain is described by a table of transition probabilities such as the following:

A simple example • Consider a 3-state Markov model of the weather. We assume that once a day the weather is observed as being one of the following: rainy or snowy, cloudy, sunny. • We postulate that on day t, weather is characterized by a single one of the three states above, and give ourselves a transition probability matrix A given by:

- 2 - • Given that the weather on day 1 is sunny, what is the probability that the weather for the next 7 days will be “sun-sun-rain-rain-sun-cloudy-sun”?

- 3 - • Given that the model is in a known state, what is the probability it stays in that state for exactly d days? • The answer is • Thus the expected number of consecutive days in the same state is • So the expected number of consecutive sunny days, according to the model is 5.

Hidden? • What if each state does not correspond to an observable (physical) event? What if the observation is a probabilistic function of the state? • To clarify, let us analyze another simple example, before formally defining Hidden Markov Models, or simply HMMs. • The Urn and Ball Model

Elements of an HMM An HMM is characterized by the following: • N, the number of states in the model. • M, the number of distinct observation symbols per state. • the state transition probability distribution where • the observation symbol probability distribution in state qj, , where bj(k) is the probability that the k-th observation symbol pops up at time t, given that the model is in state Ej. • the initial state distribution

Three Basic Problems for HMMs • Given the observation sequence O = O1O2O3…Ot, and a model m = (A, B, p), how do we efficiently compute P(O | m)? • Given the observation sequence O and a model m, how do we choose a corresponding state sequence Q = q1q2q3…qt which is optimal in some meaningful sense? • How do we adjust the model parameters to maximize P(O | m)?

Solution to Problem (1) • Given an observed output sequence O, we have that P[O] = • This calculation involves the sum of NT multiplications, each being a multiplication of 2T terms. The total number of operations is on the order of 2T NT. • Fortunately, there is a much more efficient algorithm, called the forward algorithm.

The Forward Algorithm • It focuses on the calculation of the quantity which is the joint probability that the sequence of observations seen up to and including time t is O1,…,Ot, and that the state of the HMM at time t is Ei. Once these quantities are known,

…continuation • The calculation of the (t, i)’s is by induction on t. From the formula we get

Backward Algorithm • Another approach is the backward algorithm. • Specifically, we calculate (t, i) by the formula • Again, by induction one can find the (t,i)’s starting with the value t = T – 1, then for the value t = T – 2, and so on, eventually working back to t = 1.

Solution to Problem (2) • Given an observed sequence O = O1,…,OT of outputs, we want to compute efficiently a state sequence Q = q1,…,qT that has the highest conditional probability given O. • In other words, we want to find a Q that makes P[Q | O] maximal. • There may be many Q’s that make P[Q | O] maximal. We give an algorithm to find one of them.

The Viterbi Algorithm • It is divided in two steps. First it finds maxQ P[Q | O], and then it backtracks to find a Q that realizes this maximum. • First define, for arbitrary t and i, (t,i) to be the maximum probability of all ways to end in state Si at time t and have observed sequence O1O2…Ot. • Then maxQ P[Q and O] = maxi(T,i)

- 2 - • But • Since the denominator on the RHS does not depend on Q, we have • We calculate the (t,i)’s inductively.

- 3 - • Finally, we recover the qi’s as follows. Define and put qT = S(T). This is the last state in the state sequence desired. The remaining qt for t < T are found recursively by defining and putting

Solution to Problem (3) • We are given a set of observed data from an HMM for which the topology is known. We wish to estimate the parameters in that HMM. • We briefly describe the intuition behind the Baum-Welch method of parameter estimation. • Assume that the alphabet M and the number of states N is fixed at the outset. • The data we use to estimate the parameters constitute a set of observed sequences {O(d)}.

The Baum-Welch Algorithm • We start by setting the parameters pi, aij, bi(k) at some initial values. • We then calculate, using these initial parameter values: • pi* = the expected proportion of times in state Si at the first time point, given {O(d)}.

- 2 - 2) 3) where Nij is the random number of times qt(d) =Si and qt+1(d) = Sj for some d and t; Ni is the random number of times qt(d) = Si for some d and t; and Ni(k) equals the random number of times qt(d) = Si and it emits symbol k, for some d and t.

Upshot • It can be shown that if  = (pi, ajk, bi(k)) is substituted by * = (pi*, ajk*, bi*(k)) then P[{O(d)}| *] P[{O(d)}| ], with equality holding if and only if * = . • Thus successive iterations continually increase the probability of the data, given the model. Iterations continue until a local maximum of the probability is reached.

Applications

Some preliminary remarks • Sequence alignment is useful for discovering functional, structural, and evolutionary information in biological research. • Different metrics (or notions of distance) could be defined to compare sequences. • Mathematician Peter Sellers (1974) showed that if a sequence alignment is formulated in terms of distances instead of similarity, a biologically more appealing interpretation of gaps is possible. • The latter is an evolution-motivated definition, relying on the concept of ancestry.

Modeling Protein Families • The states of our HMM will be divided into match states, insert states and delete states. • It is useful to include an initial state and a final one, and we assume that no match or delete state is visited more than once. • The alphabet M consists of twenty amino acids together with one dummy symbol  representing “delete”. Delete states output  only. • Each insert and match state has its own distribution over the 20 amino acids, and does not emit the symbol .

- 2 - • If the emission probabilities for the match and insert states are uniform over the 20 amino acids, the model will produce random sequences not having much in common. • If each state emits one specific amino acid with probability 1, then the model will always produce the same sequence. • Somewhere in between these two extremes, the parameters can be set so that the model is interesting.

- 3 - • Each choice of parameters produces a different family of sequences. • This family can be rather “tight”, or it can be rather “loose”. • It is possible that the tightness occurs locally. • Allowing gap penalties and substitution probabilities to vary along the sequences reflects biological reality better.

- 4 - • Dynamic programming and BLAST are essential for certain applications, but HMMs are more efficient for modeling large families of sequences. • The HMM model is sufficiently flexible to model the varying features of a protein along its length. • The model described has proven in practice to provide a good compromise between flexibility and tractability. Such HMMs are called profile HMMs.

- 5 - • All applications start with training. • This estimation procedure uses the Baum-Welch algorithm. • The model is chosen to have length equal to the average length of a sequence in the training set, and all parameters are initialized by using uniform distributions.

Multiple Sequence Alignment • The msa of a set of sequences may be viewed as an evolutionary history of the sequences. • HMMs often provide a msa as good as, if not better than, other methods. • The approach is well grounded in probability theory • No sequence ordering is required. • Insertion/deletion penalties are not needed. • Experimentally derived information may be incorporated.

Description • In this section we describe how to use the theory of the previous section to compute msa for a set of sequences. • The sequences to be aligned are used as the training data, to train the parameters of the model. • For each sequence, the Viterbi algorithm is then used to determine a path most likely to have produced that sequence.

- 2 - • Consider the sequences CAEFDDH and CDAEFPDDH • Suppose the model has length 10 and their most likely paths through the model are m0m1m2m3m4d5d6m7m8m9m10 and m0m1i1m2m3m4d5m6m7m8m9m10. • The alignment induced is found by aligning positions that were generated by the same match state. This leads to the alignment C–AEF –DDH CDAEFPDDH

Pfam • Pfam is a web-based resource maintained by the Sanger Center http://www.sanger.ac.uk/Pfam • Pfam uses the basic theory described above to determine protein domains in a query sequence. • Suppose that a new protein is obtained for which no information is available except the raw sequence. • We wish to “annotate” this sequence.

- 2 - • The typical starting point is a BLAST search. • Pfam returns a family of protein domains, which enriches the information obtained by a BLAST search. • The domains in Pfam are determined based on expert knowledge, sequence similarity, and other protein family databases. • Currently, Pfam contains more than 2000 domains.

- 3 - • For each domain a set of examples of this domain is selected. • The sequences representing each domain are put into an alignment, and the alignments themselves are used to set the parameters. • Recall that an alignment implies for each sequence in the alignment a path through the HMM, as in the previous sections.

- 4 - • The proportion of times these paths take a given transition is used to estimate the transition and the emission probabilities. • Given the HMMs for all the domains, a query sequence is then run past each one using a forward algorithm. • When a portion of the query sequence has probability of having been produced by an HMM of a certain cutoff, the corresponding domain is reported.

Gene Finding • Currently, a popular and successful gene finder for human DNA sequences is GENSCAN (Burge et al. 1997.) • It is based on a generalization of HMMs, called Semihidden Markov Models. • The algorithms involved in this model are an order of magnitude more complex than for a regular HMM. • The gene-finding application requires a generalization of the Viterbi algorithm.

- 2 - • Burge (1997) observed that if the lengths of the long intergenic regions can be taken as having geometric distributions, and if these lengths generate sequences in a relatively iid fashion, then the algorithm can be adjusted so that practical running times can be obtained.

Final Remarks • HMMs have been used to model alignments of three-dimensional structure in proteins (Stultz et al. 1993; Hubbard and Park 1995; Di Francesco et al. 1997, 1999; FORREST Web server at http://absapha.dcrt.nih.gov:8008/) • In one example of this approach, the models are trained on patterns of  helices,  strands, tight turns, and loops in specific structural classes, which then may be used to provide the most probable structure and structural class of a protein.

Well… those weren’t the final remarks • A version of GeneMark (Borodosky and McIninch 1993) called GeneMark.HMM uses a particular type of HMM (called a fifth-order Markov Model) to search for E. coli genes (Lukashin and Borodovsky 1998). • The success of the HMM method depends on having appropriate initial or prior conditions, i.e., a good prior model for the sequences and a sufficient number of sequences to train the model.

Finally • Another consideration in using HMMs is the number of sequences. If a good prior model is used, it should be possible to train the HMM with as few as 20 sequences. • In general, the smaller the sequence number, the more important the prior conditions. • HMMs are more effective if methods to inject statistical noise into the model are used during the training procedure.

Hidden Markov Models