 Download Download Presentation Uncovering Sequences Mysteries With Hidden Markov Model

# Uncovering Sequences Mysteries With Hidden Markov Model

Télécharger la présentation ## Uncovering Sequences Mysteries With Hidden Markov Model

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Uncovering Sequences Mysteries WithHidden Markov Model Cédric Notredame

2. Our Scope Look once Under the Hood Understand the principle of HMMs Understand HOW HMMs are used in Biology

3. Outline -Reminder of Bayesian Probabilities -HMMs and Markov Chains -Application to gene prediction -Application Tm predictions -Application to Domain/Prot Family Prediction -Future Applications

4. Conditional Probabilities And Bayes Theorem

5. I now send you an essay which I have found among the papers of our deceased friend Mr Bayes, and which, in my opinion, has great merit... In an introduction which he has writ to this Essay, he says, that his design at first in thinking on the subject of it was, to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times. Bayes

6. “The Durbin…”

7. What is a Probabilistic Model ? Dice = Probabilistic Model -Each Possible outcome has a probability (1/6) -Biological Questions: -What kind of dice would generate coding DNA -Non-Coding ?

8. Which Parameters ? OR -Through Observation: -measure frequencies on a large number of events Dice = Probabilistic Model Parameters: proba of each outcome -A Priori estimation: 1/6 for each Number

9. Which Parameters ? Parameters: proba of each outcome Model: Intra/Extra Protein 1- Make a set of Inside Proteins using annotation 2- Make a set of Outside Proteins using annotation 3- COUNT Frequencies on the two sets Model Accuracy  Training Set

10. Maximum Likelihood Models 1- Make training set 2- Count Frequencies Model Accuracy  Training Set Maximum Likelihood Model: Model probability MAXIMISES Data probability Model: Intra/Extra Proteins

11. Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model Model Probability MAXIMISES Data Probability AND Data Probability MAXIMISES Model Probability P ( Model ¦ Data) is Maximised ¦ means GIVEN!

12. Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model P ( Model ¦ Data) is Maximised P ( Data ¦ Model) is Maximised Model Probability MAXIMISES Data Probability AND Data Probability MAXIMISES Model Probability

13. Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model P ( Coin ¦ Data)< P(Dice ¦ Data) Data: 11121112221212122121112221112121112211111

14. Conditional Probabilities

15. Conditional Probabilities P (Win Lottery ¦ Participation) The Probability that something happens IF something else ALSOHappens

16. Conditional Probability Dice 1Dice 2 P(6¦ Dice 1)=1/6P(6¦ Dice 2)=1/2 Loaded! The Probability that something happens IF something else ALSOHappens

17. Joint Probability The Probability that something happens IF something else ALSOHappens AND P(6¦ D1)=1/6P(6¦ D2)=1/2 P(6,D2)=P(6¦D2) * P(D2)=1/2* 1/100 Comma

18. Joint Probability P(6¦ DF and DL)= P(6, DF) + P(6, DL) = P(6 ¦ DF) * P(DF) + P(6¦ DL)*P(DL) = 1/6*0.99 + 1/2*0.01 = 0.17 Question: What is the probability of Making a 6, given that the Loaded Dice is used 1% of the time (0.16 for an unloaded dice)

19. Joint Probability Unsuspected Heterogeneity In the training set  Inaccurate Parameters Estimation P(6¦ DF and DL)= P(6, DF) + P(6, DL) = P(6 ¦ DF) * P(DL) + P(6¦ DF)*P(DL) = 1/6*0.99 + 1/2*0.01 = 0.17 (0.16 for an unloaded dice)

20. Bayes Theorem P(Y¦Xi) * P(Xi) P(Xi¦ Y) = S(P(Y¦Xi)*P(Xi)) i X : Model or Data or any Event Y : Model or Data or any Event

21. Bayes Theorem P(Y¦X) * P(X) P(X¦ Y) = P(Y¦X)*P(X)+ P(Y¦X)*P(X) P(Y,X)+ P(Y,X) P(Y) X : Model or Data or any Event Y : Model or Data or any Event XT=X+ X

22. Bayes Theorem Proba of Observing Y AND X simultaneously Proba of Observing XIF Y is fulfilled ‘Remove’ P(Y) to Get P(X¦Y) X : Model or Data or any Event Y : Model or Data or any event P(Y¦X) * P(X) P(X¦ Y) = P(Y)

23. Bayes Theorem X : Model or Data or any Event Y : Model or Data or any event Proba of Observing Y and X simultaneously P(X,Y) P(X¦Y) = P(Y) Proba of Observing XIF Y is fulfilled ‘Remove’ P(Y) to Get P(X¦Y)

24. Using Bayes Theorem We will use Bayes Theorem to test our belief: If the Dice was loaded (model) what would be the probability of this Model Given the data (three 6 in a row) Question:The dice gave three 6s in a row IS IT LOADED !!!

25. Using Bayes Theorem Question:The dice gave three 6s in a row IS IT LOADED !!! P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 Occasionally Dishonest Casino…

26. Using Bayes Theorem P(Y¦X)*P(X) P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 P(X¦ Y) = P(Y) Y: 63 X: D2 P(63 ¦D2)*P(D2) P(D2¦63) = P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) 63 with D2 63 with D1 Question:The dice gave three 6s in a row IS IT LOADED !!!

27. Using Bayes Theorem P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 P(X,Y) P(X¦ Y) = P(Y) Question:The dice gave three 6s in a row IS IT LOADED !!! P(63 ¦D2)*P(D2) P(D2¦63) = = 0.21 P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) Probably NOT

28. Posterior Probability 0.21 is a posterior probability: it was estimated AFTER the Data was obtained P(63¦D2) is the likelihood of the Hypotheses Question:The dice gave three 6s in a row IS IT LOADED !!! P(63 ¦D2)*P(D2) P(D2¦63) = = 0.21 P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)

29. Debunking Headlines 50% of the crimes are committed by Migrants. Question: Are 50% of the Migrants Criminals??. P(Migrant) =0.1 P(Criminal) =0.0001 P(M¦C)=0.5 P(M¦C)*P(C) P(M¦C)*P(C) P(C¦M) = P(C¦M) = P(M) P(M) 0.5*0.0001 =0.0005 = 0.1 NO: 0.05% Migrants only are Criminals (NOT 50%!)

30. Debunking Headlines P(T¦P)*P(P) P(T)=0.1 P(P)=0.0001 P(T¦P)=0.5 P(T¦P)*P(P) P(P¦T) = P(P¦T) = P(T) P(T) 0.5*0.0001 =0.0005 = 0.1 50% of Gene Promoters contain TATA. Question:IS TATA a good gene predictor NO

31. Bayes Theorem TATA=High Sensitivity / Low Specificity Bayes Theorem Reveals the Trade-off Between Sensitivity:Finding ALL the genes and Specificity: Finding ONLY genes

32. Markov Chains

33. What is a Markov Chain ? Markov Chain: Two Dices -You only use ONE dice: the fair OR the loaded -The Dice you roll only depends on the previous roll Simple Chain: One Dice -Each Roll is the same -A Roll does not depend on the previous

34. What is a Markov Chain ? Biological Sequences Tend To Behave like Markov Chains Question/Example Is it possible to Tell Whether my sequence is CpG island ???

35. What is a Markov Chain ? Question: Identify CpG Island sequences Old Fashion Solution -Slide a Window of size: Captain’s Height/p -Measure the % of CpG -Plot it against the sequence -Decide

36. sliding Window Methods Sliding Window Average Sliding Window

37. What is a Markov Chain ? Question: Identify CpG Island sequences Bayesian Solution -Make a CpG Markov Chain -Run the sequence through the Chain -Likelihood for the chain to produce the sequence?

38. Transition State T A C G Transition Probabilities Probability of Transition from G to C AGC=P(Xi=C ¦ Xi-1=G)

39. P(sequence)=P(XL,XL-1,XL-2,….., X1) Remember: P(X,Y)=P(X¦Y)*P(Y) In The Markov Chain, XL only depends on XL-1 P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) )

40. AGC=P(Xi=C ¦ Xi-1=G) P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) ) P(sequence)= P(x1)* Axi-1 xi L P i=2

41. T B A C G Arbitrary Beginning and End States can be added To The Chain. By Convention, Only the Beginning State is added

42. E A C G Adding An End State with a Transition Proba T Defines Length probabilities P(all the sequences length L)=T(1-T)L-1 T B

43. A C G T E B The transition are probabilities The sum of the probability of all the possible Sequences of all possible Length is 1

44. Using Markov Chains To Predict

45. What is a Prediction Given A sequence We want to know what is the probability that this sequence is a CpG 1-We need a training set: -CpG+ sequences -CpG- sequences 2-We will Measure the transition frequencies, and treat them like probabilities

46. What is a Prediction Transition GC: G followed by a C = GCCGCTGCGCGA Ratio between the number of transitions GC, and all the other transitions involving G->X + S N + X GC A + GC N GX Is my sequence a CpG ??? 2-We will Measure the transition frequencies, and treat them like probabilities

47. What is a Prediction A 0.18 0.17 0.16 0.08 A 0.30 0.32 0.25 0.17 + A C G T - A C G T C 0.27 0.36 0.33 0.35 C 0.21 0.30 0.25 0.24 G 0.42 0.27 0.37 0.38 G 0.28 0.08 0.30 0.29 T0.12 0.18 0.12 0.18 T0.21 0.30 0.20 0.29 1 Is my sequence a CpG ??? 2-We will Measure the transition frequencies, and treat them like probabilities

48. What is a Prediction - + A 0.18 0.17 0.16 0.08 A 0.30 0.32 0.25 0.17 + A C G T - A C G T C 0.27 0.36 0.33 0.35 C 0.21 0.30 0.25 0.24 G 0.42 0.27 0.37 0.38 G 0.28 0.08 0.30 0.29 T0.21 0.30 0.20 0.29 T0.12 0.18 0.12 0.18 P(seq ¦ M-)= Axi-1 xi P(seq ¦ M+)= Axi-1 xi L L P P i=1 i=1 Is my sequence a CpG ??? 3-Evaluate the probability for each of these models to generate our sequence