CS 6243 Machine Learning

CS 6243 Machine Learning

Télécharger la présentation

CS 6243 Machine Learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

1. CS 6243 Machine Learning Markov Chain and Hidden Markov Models

2. Outline • Background on probability • Hidden Markov models • Algorithms • Applications

3. Probability Basics • Definition (informal) • Probabilities are numbers assigned to events that indicate “how likely” it is that the event will occur when a random experiment is performed • A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment • The sample space S of a random experiment is the set of all possible outcomes

4. Probabilistic Calculus • All probabilities between 0 and 1 • If A, B are mutually exclusive: • P(A  B) = P(A) + P(B) • Thus: P(not(A)) = P(Ac) = 1 – P(A) S B A

5. Conditional probability • The joint probability of two events A and B P(AB), or simply P(A, B) is the probability that event A and B occur at the same time. • The conditional probability of P(A|B) is the probability that A occurs given B occurred. P(A | B) = P(A  B) / P(B) <=> P(A  B) = P(A | B) P(B) <=> P(A  B) = P(B|A) P(A)

6. Example • Roll a die • If I tell you the number is less than 4 • What is the probability of an even number? • P(d = even | d < 4) = P(d = even  d < 4) / P(d < 4) • P(d = 2) / P(d = 1, 2, or 3) = (1/6) / (3/6) = 1/3

7. Independence • A and B are independent iff: • Therefore, if A and B are independent: These two constraints are logically equivalent

8. Examples • Are P(d = even) and P(d < 4) independent? • P(d = even and d < 4) = 1/6 • P(d = even) = ½ • P(d < 4) = ½ • ½ * ½ > 1/6 • If your die actually has 8 faces, will P(d = even) and P(d < 5) be independent? • Are P(even in first roll) and P(even in second roll) independent? • Playing card, are the suit and rank independent?

9. Theorem of total probability • Let B1, B2, …, BN be mutually exclusive events whose union equals the sample space S. We refer to these sets as a partition of S. • An event A can be represented as: • Since B1, B2, …, BN are mutually exclusive, then • P(A) = P(A  B1) + P(A B2) + … + P(ABN) • And therefore • P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2) + … + P(A|BN)*P(BN) • = i P(A | Bi) * P(Bi) Marginalization Exhaustive conditionalization

10. Example • A loaded die: • P(6) = 0.5 • P(1) = … = P(5) = 0.1 • Prob of even number? P(even) = P(even | d < 6) * P (d<6) + P(even | d = 6) * P (d=6) = 2/5 * 0.5 + 1 * 0.5 = 0.7

11. Another example • A box of dice: • 99% fair • 1% loaded • P(6) = 0.5. • P(1) = … = P(5) = 0.1 • Randomly pick a die and roll, P(6)? • P(6) = P(6 | F) * P(F) + P(6 | L) * P(L) • 1/6 * 0.99 + 0.5 * 0.01 = 0.17

12. Bayes theorem • P(A  B) = P(B) * P(A | B) = P(A) * P(B | A) Conditional probability (likelihood) P ( A | B ) P ( B ) Prior of B => = P ( B | A ) P ( A ) Posterior probability Prior of A (Normalizing constant) This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics Bayes Theorem is definitely the fundamental relation in Statistical Pattern Recognition

13. Bayes theorem (cont’d) • Given B1, B2, …, BN, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj? • P(Bj | A) = P(A | Bj) * P(Bj) / P(A) = P(A | Bj) * P(Bj) / jP(A | Bj)*P(Bj) Posterior probability Prior of Bj Likelihood Normalizing constant (theorem of total probabilities) Bj: different models / hypotheses In the observation of A, should you choose a model that maximizes P(Bj | A) or P(A | Bj)? Depending on how much you know about Bj !

14. Example • A test for a rare disease claims that it will report positive for 99.5% of people with disease, and negative 99.9% of time for those without. • The disease is present in the population at 1 in 100,000 • What is P(disease | positive test)? • P(D|+) = P(+|D)P(D)/P(+) = 0.01 • What is P(disease | negative test)? • P(D|-) = P(-|D)P(D)/P(-) = 5e-8

15. Another example • We’ve talked about the boxes of casinos: 99% fair, 1% loaded (50% at six) • We said if we randomly pick a die and roll, we have 17% of chance to get a six • If we get 3 six in a row, what’s the chance that the die is loaded? • How about 5 six in a row?

16. P(loaded | 666) = P(666 | loaded) * P(loaded) / P(666) = 0.53 * 0.01 / (0.53 * 0.01 + (1/6)3 * 0.99) = 0.21 • P(loaded | 66666) = P(66666 | loaded) * P(loaded) / P(66666) = 0.55 * 0.01 / (0.55 * 0.01 + (1/6)5 * 0.99) = 0.71

17. Simple probabilistic models for DNA sequences • Assume nature generates a type of DNA sequence as follows: • A box of dice, each with four faces: {A,C,G,T} • Select a die suitable for the type of DNA • Roll it, append the symbol to a string. • Repeat 3, until all symbols have been generated. • Given a string say X=“GATTCCAA…” and two dice • M1 has the distribution of pA=pC=pG=pT=0.25. • M2 has the distribution: pA=pT=0.20, pC=pG=0.30 • What is the probability of the sequence being generated by M1 or M2?

18. Model selection by maximum likelihood criterion • X = GATTCCAA • P(X | M1) = P(x1,x2,…,xn | M1) = i=1..nP(xi|M1) = 0.258 = 1.53e-5 • P(X | M2) = P(x1,x2,…,xn | M2) = i=1..nP(xi|M2) = 0.25 0.33 = 8.64e-6 P(X|M1) / P(X|M2) =  P(xi|M1)/P(xi|M2) = (0.25/0.2)5 (0.25/0.3)3 LLR =  log(P(xi|M1)/P(xi|M2)) = nASA + nCSC + nGSG + nTST = 5 * log(1.25) + 3 * log(0.833) = 0.57 Si = log (P(i | M1) / P(i | M2)), i = A, C, G, T Log likelihood ratio (LLR)

19. Model selection by maximum a posterior probability criterion • Take the prior probabilities of M1 and M2 into consideration if known Log (P(M1|X) / P(M2|X)) = LLR + log(P(M1)) – log(P(M2)) = nASA + nCSC + nGSG + nTST + log(P(M1)) – log(P(M2)) • If P(M1) ~ P(M2), results will be similar to LLR test

20. Markov models for DNA sequences We have assumed independence of nucleotides in different positions - unrealistic in biology

21. Example: CpG islands • CpG - 2 adjacent nucleotides, same strand (not base-pair; “p” stands for the phosphodiester bond of the DNA backbone) • In mammal promoter regions, CpG is more frequent than other regions of genome • often mark gene-rich regions

22. CpG islands • CpG Islands • More CpG than elsewhere • More C & G than elsewhere, too • Typical length: a few 100s to few 1000s bp • Questions • Is a short sequence (say, 200 bp) a CpG island or not? • Given a long sequence (say, 10-100kb), find CpG islands?

23. Markov models • A sequence of random variables is a k-th order Markov chain if, for all i, ithvalue is independent of all but the previous k values: • First order (k=1): • Second order: • 0th order: (independence)

24. First order Markov model

25. A 1st order Markov model for CpG islands • Essentially a finite state automaton (FSA) • Transitions are probabilistic (instead of deterministic) • 4 states: A, C, G, T • 16 transitions: ast = P(xi = t | xi-1 = s) • Begin/End states

26. Probability of emitting sequence x

27. Probability of a sequence • What’s the probability of ACGGCTA in this model? P(A) * P(C|A) * P(G|C) … P(A|T) = aBA aAC aCG …aTA • Equivalent: follow the path in the automaton, and multiply the transition probabilities on the path

28. Training • Estimate the parameters of the model • CpG+ model: Count the transition frequencies from known CpG islands • CpG- model: Also count the transition frequencies from sequences without CpG islands • ast = #(s→t) / #(s →) a+st a-st

29. Discrimination / Classification • Given a sequence, is it CpG island or not? • Log likelihood ratio (LLR) βCG = log2(a+CG/a -CG) = log2(0.274/0.078) = 1.812 βBA = log2(a+ A/a - A) = log2(0.591/1.047) = -0.825

30. Example • X = ACGGCGACGTCG • S(X) = βBA + βAC +βCG +βGG +βGC +βCG +βGA + βAC +βCG +βGT +βTC +βCG = βBA + 2βAC +4βCG +βGG +βGC +βGA +βGT +βTC = -0.825+ 2*.419 + 4*1.812+.313 +.461 - .624 - .730 + .573 = 7.25

31. CpG island scores Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

32. Questions • Q1: given a short sequence, is it more likely from CpG+ model or CpG- model? • Q2: Given a long sequence, where are the CpG islands (if any)? • Approach 1: score (e.g.) 100 bp windows • Pro: simple • Con: arbitrary, fixed length, inflexible • Approach 2: combine +/- models.

33. Combined model • Given a long sequence, predict which state each position is in. (states are hidden: Hidden Markov model)

34. Hidden Markov Model (HMM) • Introduced in the 70’s for speech recognition • Have been shown to be good models for biosequences • Alignment • Gene prediction • Protein domain analysis • … • An observed sequence data that can be modeled by a Markov chain • State path unknown • Model parameter known or unknown • Observed data: emission sequences X = (x1x2…xn) • Hidden data: state sequences Π = (π1π2…πn)

35. Hidden Markov model (HMM) Definition: A hidden Markov model (HMM) is a five-tuple • Alphabet = { b1, b2, …, bM } • Set of statesQ = { 1, ..., K } • Transition probabilities between any two states aij = transition prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…K • Start probabilitiesa0i a01 + … + a0K = 1 • Emission probabilities within each state ek(b) = P( xi = b | i = k) ek(b1) + … + ek(bM) = 1, for all states k = 1…K 1 2 K …

36. HMM for the Dishonest Casino A casino has two dice: • Fair die P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 • Loaded die P(1) = P(2) = P(3) = P(4) = P(5) = 1/10 P(6) = 1/2 Casino player switches back and forth between fair and loaded die once in a while

37. The dishonest casino model aFF = 0.95 aFL = 0.05 aLL = 0.95 Fair LOADED eF(1) = 1/6 eF(2) = 1/6 eF(3) = 1/6 eF(4) = 1/6 eF(5) = 1/6 eF(6) = 1/6 eL(1) = 1/10 eL(2) = 1/10 eL(3) = 1/10 eL(4) = 1/10 eL(5) = 1/10 eL(6) = 1/2 aLF = 0.05 Transition probability Emission probability

38. Simple scenario • You don’t know the probabilities • The casino player lets you observe which die he/she uses every time • The “state” of each roll is known • Training (parameter estimation) • How often the casino player switches dice? • How “loaded” is the loaded die? • Simply count the frequency that each face appeared and the frequency of die switching • May add pseudo-counts if number of observations is small

39. More complex scenarios • The “state” of each roll is unknown: • You are given the results of a series of rolls • You don’t know which number is generated by which die • You may or may not know the parameters • How “loaded” is the loaded die • How frequently the casino player switches dice

40. The three main questions on HMMs • Decoding GIVEN a HMM M, and a sequence x, FIND the sequence  of states that maximizes P (x,  | M ) • Evaluation GIVEN a HMM M, and a sequence x, FIND P ( x | M ) [or P(x)for simplicity] • Learning GIVEN a HMM M with unspecified transition/emission probs., and a sequence x, FIND parameters  = (ei(.), aij) that maximize P (x | ) Sometimes written as P (x, ) for simplicity.

41. Question # 1 – Decoding GIVEN A HMM with parameters. And a sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

42. 1 1 1 1 … 2 2 2 2 … … … … … K K K K … A parse of a sequence Given a sequence x = x1……xN, and a HMM with k states, A parse of x is a sequence of states  = 1, ……, N 1 2 2 K x1 x2 x3 xK

43. 1 1 1 1 … 2 2 2 2 … … … … … K K K K … Probability of a parse 1 Given a sequence x = x1……xN and a parse  = 1, ……, N To find how likely is the parse: (given our HMM) P(x, ) = P(x1, …, xN, 1, ……, N) = P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1) = P(xN | N)P(N | N-1) ……P(x2 | 2)P(2 | 1)P(x1 | 1)P(1) =a01 a12……aN-1Ne1(x1)……eN(xN) 2 2 K x1 x2 x3 xK

44. Example 0.05 • What’s the probability of  = Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, Fair X = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4? 0.95 0.95 Fair LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05

45. Example • What’s the probability of  = Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, Fair X = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4? P = ½ * P(1 | F) P(Fi+1 | Fi) …P(5 | F) P(Li+1 | Fi) P(6|L) P(Li+1 | Li) …P(4 | F) = ½ x 0.957 0.052 x (1/6)6 x (1/10)2 x (1/2)2 = 5 x 10-11 0.05 0.05

46. Decoding • Parse (path) is unknown. What to do? • Alternative algorithms: • Most probable single path (Viterbi algorithm) • Sequence of most probable states (Forward-backward algorithm)

47. The Viterbi algorithm • Goal: to find • Is equivalent to find

48. The Viterbi algorithm L L L L L L L L L L • Find a path with the following objective: • Maximize the product of transition and emission probabilities Maximize the sum of log probabilities B F F F F F F F F F F P(s|L) = 1/10, for s in [1..5]P(6|L) = 1/2 P(s|F) = 1/6, for s in [1..6] Edge weight (symbol independent) Node weight (depend on symbols in seq)

49. wLF wFF wFF = log (aFF) rF(xi+1) = log (eF(xi+1)) The Viterbi algorithm x1 x2 x3 … xi xi+1 … xn-1 xn VF (i+1) = rF (xi+1) + max VF (i) + wFF VL (i) + wLF … … L L L L L L L B E … … F F F F F F F Weight for the best parse of (x1…xi+1), with xi+1 emitted by state F VL (i+1) = rL (xi+1) + max VF (i) + wFL VL (i) + wLL Weight for the best parse of (x1…xi+1), with xi+1 emitted by state L

50. Recursion from FSA directly wLL=-0.05 wFF=-0.05 WFL=-3.00 aLL=0.95 aFF=0.95 aFL=0.05 Fair LOADED Fair LOADED WLF=-3.00 aLF=0.05 rL(6) = -0.7 rL(s) = -2.3 (s = 1…5) P(s|F) = 1/6 s = 1…6 P(6|L) = ½ P(s|L) = 1/10 (s = 1...5) rF(s) = -1.8 s = 1...6 VF (i+1) = rF (xi+1) + max {VL (i) + WLF VF (i) + WFF} VL (i+1) = rL (xi+1) + max {VL (i) + WLL VF (i) + WFL} PF (i+1) = eF (xi+1) max {PL (i) aLF PF (i) aFF} PL (i+1) = eL (xi+1) max {PL (i) aLL PF (i) aFL}