370 likes | 481 Vues
Lecture #8 : - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM. Background Readings : Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis , Durbin et al., 2001.
E N D
Lecture #8:- Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001. Shlomo Moran, following Dan Geiger and Nir Friedman .
The Parameters An HMM model is defined by the probabiltyparameters: mkland ek(b), for all states k,l and all symbols b. θdenotes the collection of these parameters. To determine the values of the parameters θ, use a training set = {x1,...,xn}, where each xjis a sequence which is assumed to fit the model.
Maximum Likelihood Parameter Estimation for HMM looks for θ which maximizes: p(x1,..., xn|θ) = ∏j p (xj|θ).
Finding ML parameters for HMM when all states are known: Let Mkl = #(transitions from k to l) in the training set. Ek(b) = #(emissions of symbol b from state k) in the training set. We look for parameters ={mkl, ek(b)} that: The optimal ML parameters θ are given by:
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown We need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences Swhich produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]
Parameter Estimation when States are Unknown • The general process for finding θ in this case is • Start with an initial value of θ. • Find θ* so thatp(x|θ*) > p(x|θ) • set θ = θ*. • Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Baum Welch Training We start with some values of mkland ek(b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ* s.t. p(x|θ*) > p(x|θ) This is done by “imitating” the algorithm for Case 1, where all states are known:
… … xi-1= b Si-1= k Si= l When states were known, we counted… xi= c In case 1 we computed the optimal values of mkland ek(b), (for the optimal θ)by simply counting the number Mkl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi When states are unknown Mkland Ek(b) are taken as averages: Mkl and Ek(b) are computed according to the current distribution θ, that is: where Mskl is the number of k to l transitions in the sequence s. Similarly Ek(b)=∑sEsk(b)p(s|x,θ), where Esk(b) is the number of times k emits b in the sequence s with output x.
… … Si-1= ? Si= ? xi= c xi-1= b Computing averages of state-transitions: Since the number of sequences s is exponential in L, it is too expensive to compute Mkl=∑sMsklp(s|x,θ) in the naïve way. Hence, we use dynamic programming: For eacheach pair (k,l) and for each edge si-1si we compute the average number of “k to l” transitions over this edge. Then we take Mklto bethe sum over all edges.
…and of Letter-Emissions Si = ? xi= b Similarly, For each edge si b and each state k, we compute the average number of times that si=k, which is the expected number of “k → b” transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:
sL .. Si-1 Si s1 .. Xi-1 Xi X1 XL Baum Welch: step E for MklCount aveargenumber of state transitions For computing the averages, Baum Welch computes for each index i and states k,l, the following probability: P(si-1=k, si=l | x,θ) For this, it uses the forwards and backwards algorithms
sL .. k l s1 .. Xi-1 Xi X1 XL Baum Welch: Step E for Mkl Claim: By the probability distribution of HMM (mkland el(xi) are the parameters defined by , and Fk(i-1), Bk(i) are the outputs of the forward /backward algorithms)
s1 s2 sL-1 sL Si-1 si X1 X2 XL-1 XL Xi-1 Xi = Fk(i-1)mklel(xi )Bl(i) x Via the backward algorithm Via the forward algorithm Fk(i-1)mklel(xi)Bl(i) p(si-1=k,si=l | x, ) = proof of claim P(x1,…,xL,si-1=k,si=l|)= P(x1,…,xi-1,si-1=k|)mklel(xi )P(xi+1,…,xL |si=l,)
Step E for Mkl(end) For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges:
Step E forMkl , with many sequences: Claim: When we have n independent input sequences (x1,..., xn) of lengths L1..Ln , then Mklis given by:
Proof of Claim: When we have n independent input sequences (x1,..., xn), the probability space is the product of n spaces: The probability of a simple event in this space with parameters θ is given by:
Proof of Claim (cont): The probability of that simple event given x=(x1 ,..,xn): The probability of the compound event (sj,xj) given x=(x1 ,..,xn):
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi=b Baum-Welch: Step E for Ek(b) count expected number of letter-emissions for state k and each symbol b, for each i whereXi=b, compute the expected number of times that Si=k.
Baum-Welch: Step E for Ek(b) For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that si = k, over all i’s for which xi = b.
Step E forEk(b),many sequences Exercise: when we have n sequences (x1,..., xn), the expected number of emissions of b from k is given by:
Summary: the E part of the Baum Welch training This part computes the expected numbers Mkl of k→ltransitions for all pairs of states k and l, and the expected numbers Ek(b) of transmisionsof symbol b from state k, for all states k and symbols b. The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.
Baum-Welch: step M Use the Mkl’s, Ek(b)’s to compute the new values of mkland ek(b). These values define θ*. The correctness of the EM algorithm implies that, if θ* ≠ θ, then: p(x1,..., xn|θ*) >p(x1,..., xn|θ) i.e, θ* increases the probability of the data, unless it is equal to θ.(this will follow from the correctness of the EM algorithm, to be proved later.) This procedure is iterated, until some convergence criterion is met.
Viterbi training:Maximizing the probability of the most probable path
Assume that rather then finding θ which maximizes the likelihood of the input x1,..,xn , we wishto maximize the probability of a most probable path, ie to find parameters θ and state paths s(x1),..,s(xn) s.t.the value of p(s(x1),..,s(xn) , x1,..,xn |θ) is maximized. Clearly, s(xj) should be the most probable path for xj under the parameters θ . We assume only one sequence (n=1). This is done by Viterbi Training
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Viterbi training Start from given values of mkland ek(b), which define prior values of θ. Each iteration: Step 1: Use Viterbi’s algorithm to find a most probable path s(x), which maximizes p(s(x), x|θ).
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Viterbi training (cont) θ* Step 2. Use the ML method for HMM when the states are known to find θ*which maximizes p(s(x) , x|θ*). • Note : If after Step 2 we have p(s(x) , x|θ*)= p(s(x) , x|θ), then it must be that θ=θ*. In this case the next iteration will be identical to the current one, and hence we may terminate the algorithm.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Viterbi training (cont) Step 3. If θ≠θ* , set θ←θ* , and repeat. If θ=θ* , stop.
A A A A 1. Monitoring probabilities of repetitions Markov chains are rather limited in describing sequences of symbols with non-random structures. For instance, a Markov chain forces the distribution of segments in which some state is repeated k+1 times to be (1-p)pk, for some p. By adding states we may bypass this restriction:
A1 A2 A3 A4 1. State duplications An extension of Markov chain which allows the distribution of segments in which a state is repeated k+1 times to have any desired value: Assign k+1 states to represent the same “real” state. This may model k repetitions (or less) with any desired probability.
2. Silent states • States which do not emit symbols. • Can be used to model repetitions. • Also used to allow arbitrary jumps (may be used to model deletions) • Need to generalize the Forward and Backward algorithms for arbitrary acyclic digraphs to count for the silent states: Silent states: Regular states:
eg, the forwards algorithm should look: z Silent states Directed cycles of silent (or other) states complicate things, and should be avoided. Regular states v x symbols
AA AB BB BA 3. High Order Markov Chains Markov chains in which the transition probabilities depends on the last k states: P(xi|xi-1,...,x1) = P(xi|xi-1,...,xi-k) Can be represented by a standard Markov chain with more states. eg for k=2:
4. Inhomogeneous Markov Chains • An important task in analyzing DNA sequences is recognizing the genes which code for proteins. • A triplet of 3 nucleotides – codon - codes for amino acids. • It is known that in parts of DNA which code for genes, the three codons positions has different statistics. • Thus a Markov chain model for DNA should represent not only the Nucleotide (A, C, G or T), but also its position – the same nucleotide in different position will have different transition probabilities. Used in GENEMARK gene finding program (93).