1 / 37

 Shlomo Moran, following Dan Geiger and Nir Friedman

Lecture #8 : - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM. Background Readings : Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis , Durbin et al., 2001.

kareem
Télécharger la présentation

 Shlomo Moran, following Dan Geiger and Nir Friedman

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture #8:- Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001. Shlomo Moran, following Dan Geiger and Nir Friedman .

  2. Parameter Estimation in HMM:What we saw until now

  3. The Parameters An HMM model is defined by the probabiltyparameters: mkland ek(b), for all states k,l and all symbols b. θdenotes the collection of these parameters. To determine the values of the parameters θ, use a training set = {x1,...,xn}, where each xjis a sequence which is assumed to fit the model.

  4. Maximum Likelihood Parameter Estimation for HMM looks for θ which maximizes: p(x1,..., xn|θ) = ∏j p (xj|θ).

  5. Finding ML parameters for HMM when all states are known: Let Mkl = #(transitions from k to l) in the training set. Ek(b) = #(emissions of symbol b from state k) in the training set. We look for parameters  ={mkl, ek(b)} that: The optimal ML parameters θ are given by:

  6. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown We need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences Swhich produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]

  7. Parameter Estimation when States are Unknown • The general process for finding θ in this case is • Start with an initial value of θ. • Find θ* so thatp(x|θ*) > p(x|θ) • set θ = θ*. • Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.

  8. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Baum Welch Training We start with some values of mkland ek(b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ* s.t. p(x|θ*) > p(x|θ) This is done by “imitating” the algorithm for Case 1, where all states are known:

  9. … xi-1= b Si-1= k Si= l When states were known, we counted… xi= c In case 1 we computed the optimal values of mkland ek(b), (for the optimal θ)by simply counting the number Mkl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.

  10. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi When states are unknown Mkland Ek(b) are taken as averages: Mkl and Ek(b) are computed according to the current distribution θ, that is: where Mskl is the number of k to l transitions in the sequence s. Similarly Ek(b)=∑sEsk(b)p(s|x,θ), where Esk(b) is the number of times k emits b in the sequence s with output x.

  11. … Si-1= ? Si= ? xi= c xi-1= b Computing averages of state-transitions: Since the number of sequences s is exponential in L, it is too expensive to compute Mkl=∑sMsklp(s|x,θ) in the naïve way. Hence, we use dynamic programming: For eacheach pair (k,l) and for each edge si-1si we compute the average number of “k to l” transitions over this edge. Then we take Mklto bethe sum over all edges.

  12. …and of Letter-Emissions Si = ? xi= b Similarly, For each edge si b and each state k, we compute the average number of times that si=k, which is the expected number of “k → b” transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:

  13. sL .. Si-1 Si s1 .. Xi-1 Xi X1 XL Baum Welch: step E for MklCount aveargenumber of state transitions For computing the averages, Baum Welch computes for each index i and states k,l, the following probability: P(si-1=k, si=l | x,θ) For this, it uses the forwards and backwards algorithms

  14. sL .. k l s1 .. Xi-1 Xi X1 XL Baum Welch: Step E for Mkl Claim: By the probability distribution of HMM (mkland el(xi) are the parameters defined by  , and Fk(i-1), Bk(i) are the outputs of the forward /backward algorithms)

  15. s1 s2 sL-1 sL Si-1 si X1 X2 XL-1 XL Xi-1 Xi = Fk(i-1)mklel(xi )Bl(i) x Via the backward algorithm Via the forward algorithm Fk(i-1)mklel(xi)Bl(i) p(si-1=k,si=l | x, ) = proof of claim P(x1,…,xL,si-1=k,si=l|)= P(x1,…,xi-1,si-1=k|)mklel(xi )P(xi+1,…,xL |si=l,)

  16. Step E for Mkl(end) For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges:

  17. Step E forMkl , with many sequences: Claim: When we have n independent input sequences (x1,..., xn) of lengths L1..Ln , then Mklis given by:

  18. Proof of Claim: When we have n independent input sequences (x1,..., xn), the probability space is the product of n spaces: The probability of a simple event in this space with parameters θ is given by:

  19. Proof of Claim (cont): The probability of that simple event given x=(x1 ,..,xn): The probability of the compound event (sj,xj) given x=(x1 ,..,xn):

  20. Proof of Claim (end):

  21. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi=b Baum-Welch: Step E for Ek(b) count expected number of letter-emissions for state k and each symbol b, for each i whereXi=b, compute the expected number of times that Si=k.

  22. Baum-Welch: Step E for Ek(b) For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that si = k, over all i’s for which xi = b.

  23. Step E forEk(b),many sequences Exercise: when we have n sequences (x1,..., xn), the expected number of emissions of b from k is given by:

  24. Summary: the E part of the Baum Welch training This part computes the expected numbers Mkl of k→ltransitions for all pairs of states k and l, and the expected numbers Ek(b) of transmisionsof symbol b from state k, for all states k and symbols b. The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

  25. Baum-Welch: step M Use the Mkl’s, Ek(b)’s to compute the new values of mkland ek(b). These values define θ*. The correctness of the EM algorithm implies that, if θ* ≠ θ, then: p(x1,..., xn|θ*) >p(x1,..., xn|θ) i.e, θ* increases the probability of the data, unless it is equal to θ.(this will follow from the correctness of the EM algorithm, to be proved later.) This procedure is iterated, until some convergence criterion is met.

  26. Viterbi training:Maximizing the probability of the most probable path

  27. Assume that rather then finding θ which maximizes the likelihood of the input x1,..,xn , we wishto maximize the probability of a most probable path, ie to find parameters θ and state paths s(x1),..,s(xn) s.t.the value of p(s(x1),..,s(xn) , x1,..,xn |θ) is maximized. Clearly, s(xj) should be the most probable path for xj under the parameters θ . We assume only one sequence (n=1). This is done by Viterbi Training

  28. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Viterbi training Start from given values of mkland ek(b), which define prior values of θ. Each iteration: Step 1: Use Viterbi’s algorithm to find a most probable path s(x), which maximizes p(s(x), x|θ).

  29. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Viterbi training (cont) θ* Step 2. Use the ML method for HMM when the states are known to find θ*which maximizes p(s(x) , x|θ*). • Note : If after Step 2 we have p(s(x) , x|θ*)= p(s(x) , x|θ), then it must be that θ=θ*. In this case the next iteration will be identical to the current one, and hence we may terminate the algorithm.

  30. s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Viterbi training (cont) Step 3. If θ≠θ* , set θ←θ* , and repeat. If θ=θ* , stop.

  31. Extensions of HMM

  32. A A A A 1. Monitoring probabilities of repetitions Markov chains are rather limited in describing sequences of symbols with non-random structures. For instance, a Markov chain forces the distribution of segments in which some state is repeated k+1 times to be (1-p)pk, for some p. By adding states we may bypass this restriction:

  33. A1 A2 A3 A4 1. State duplications An extension of Markov chain which allows the distribution of segments in which a state is repeated k+1 times to have any desired value: Assign k+1 states to represent the same “real” state. This may model k repetitions (or less) with any desired probability.

  34. 2. Silent states • States which do not emit symbols. • Can be used to model repetitions. • Also used to allow arbitrary jumps (may be used to model deletions) • Need to generalize the Forward and Backward algorithms for arbitrary acyclic digraphs to count for the silent states: Silent states: Regular states:

  35. eg, the forwards algorithm should look: z Silent states Directed cycles of silent (or other) states complicate things, and should be avoided. Regular states v x symbols

  36. AA AB BB BA 3. High Order Markov Chains Markov chains in which the transition probabilities depends on the last k states: P(xi|xi-1,...,x1) = P(xi|xi-1,...,xi-k) Can be represented by a standard Markov chain with more states. eg for k=2:

  37. 4. Inhomogeneous Markov Chains • An important task in analyzing DNA sequences is recognizing the genes which code for proteins. • A triplet of 3 nucleotides – codon - codes for amino acids. • It is known that in parts of DNA which code for genes, the three codons positions has different statistics. • Thus a Markov chain model for DNA should represent not only the Nucleotide (A, C, G or T), but also its position – the same nucleotide in different position will have different transition probabilities. Used in GENEMARK gene finding program (93).

More Related