Fisica Computazionale applicata alle Macromolecole

Fisica Computazionale applicata alle Macromolecole Pier Luigi Martelli Università di Bologna gigi@biocomp.unibo.it 051 2094005 338 3991609 Modelli probabilistici per Sequenze Biologiche

PROLOGUE: Pitfalls of standard alignments

Scoring a pairwise alignment A: ALAEVLIRLITKLYP B: ASAKHLNRLITELYP Blosum62

Alignment of a family (globins) …………………………………………. Different positions are not equivalent

Sequence logos http://weblogo.berkeley.edu/cache/file5h2DWc.png The substitution score IN A FAMILY should depend on the position (the same for gaps) For modelling families we need more flexible tools

Probabilistic Models for Biological Sequences • What are they?

Probabilistic models for sequences • Generative definition: • Objects producing different outcomes (sequences) with different probabilities • The probability distribution over the sequences space determines the model specificity Probability Sequence space M Generates si with probability P(si | M) e.g.: M is the representation of the family of globins

Probabilistic models for sequences • We don’t need a generator of new biological sequences • the generative definition is useful as operative definition • Associative definition: • Objects that, given an outcome (sequence), compute a probability value Probability M Sequence space Associates probability P(si | M) to si e.g.: M is the representation of the family of globins

Probabilistic models for sequences Most useful probabilistic models are Trainable systems The probability density function over the sequence space is estimated from known examples by means of a learning algorithm Known examples Pdf estimate (generalization) Probability Probability Sequence space Sequence space e.g.: Writing a generic representation of the sequences of globins starting from a set of known globins

Probabilistic Models for Biological Sequences • What are they? • Why to use them?

Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 0.98 0.21 0.12 0.89 0.47 0.78 Modelling a protein family Probabilistic model Given a protein class (e.g. Globins), a probabilistic model trained on this family can compute a probability value for each new sequence This value measures the similarity between the new sequence and the family described by the model

Probabilistic Models for Biological Sequences • What are they? • Why to use them? • Which probabilities do they compute?

P( s | M ) or P( M | s ) ? A model M associates to a sequence s the probability P( s | M ) This probability answers the question: Which is the probability for a model describing the Globins to generate the sequence s ? The question we want to answer is: Given a sequence s, is it a Globin? We need to compute P( M | s ) !!

Bayes’ rule: Example • A rare disease affects 1 out of 100,000 people. • A test shows positive • with probability 0.99 when applied to an ill person, and • with probability 0.01 when applied to a healthy person. • What is the probability that you have the disease given that you test positive?

Bayes’ rule: Example P(+|ill) = 0.99 P(+|healthy) = 0.01 P(ill) = 10-5 Happy End: More likely the test is incorrect!!

Is the pope an alien? Since the probability P(Pope|Human) =1/(6,000,000,000) do this imply that the Pope is not a human being? Beck-Bornholdt HP, Dubben HH, Nature 381, 730 (1996)

P(Pope|Human) is not the same as P(Human|Pope) but P(Alien) ~ 0 So P(Human|Pope) ~ 1.0 The pope is (probably) not an alien S Eddy and D McKay’s answer

The A priori probabilities P(s | M) P(M) A priori probabilities P(M | s) = P(s) P(M) is the probability of the model (i.e. of the class described by the model) BEFORE we know the sequence: can be estimated as the abundance of the class P(s) is the probability of the sequence in the sequence space. Cannot be reliably estimated!!

Null model Otherwise we can score a sequence for a model M comparing it to a Null Model: a model that generates ALL the possible sequences with probabilities depending ONLY on the statistical amino acid abundance P(s | M) S(M, s) = log P(s | N) Sequences NOT belonging to model M Sequences belonging to model M S(M, s) In this case we need a threshold and a statistic for evaluating the significance (E-value, P-value)

The simplest probabilistic models: Markov Models • Definition

Markov Models Example: Weather Register the weather conditions day by day: as a first hypothesis the weather condition in a day depends ONLY on the weather conditions in the day before. Define the conditional probabilities P(C|C), P(C|R),…. P(R|C)….. R C F S C: Clouds R: Rain F: Fog S: Sun The probability for the 5-days registration CRRCS P(CRRCS) = P(C)·P(R|C) ·P(R|R) ·P(C|R) ·P(S|C)

Markov Model Stochastic generator of sequences in which the probability of state in position i depends ONLY on the state in position i-1 Given an alphabet C = {c1; c2; c3; ………cN } a Markov model is described with N×(N+2) parameters {art, aBEGIN t , ar END; r, tC} arq = P( s i= q| s i-1= r ) aBEGIN q = P( s 1= q ) ar END= P( s T= END | s T-1= r ) c2 c1 c3 END BEGIN t art + ar END = 1  r t aBEGIN t = 1 cN c4

T aBEGIN s i=2as s  as END = i T 1 i-1 Markov Models Given the sequence: s = s1s2s3s4s6 ……………sT with siC = {c1; c2; c3; ………cN } P( s | M ) = P( s1) i=2 P( s i| s i-1 ) = P(“ALKALI”)= aBEGIN A aA L aL K aK A aA L aL I aI END

?? 0.3 0.2 0.4 0.3 0.3 C C R R 0.2 0.5 0.3 0.1 0.2 0.0 0.2 ?? 0.1 0.2 ?? 0.1 ?? 0.7 0.0 0.2 0.1 0.4 S S S S F F 0.2 1.0 0.3 0.0 0.8 0.2 0.4 0.0 Markov Models: Exercise 1) Fill the non defined values for the transition probabilities

0.1 0.3 0.2 0.4 0.3 0.3 C C R R 0.2 0.5 0.3 0.1 0.2 0.0 0.2 0.0 0.1 0.2 0.2 0.1 0.0 0.7 0.0 0.2 0.1 0.4 S S S S F F 0.2 1.0 0.3 0.0 0.8 0.2 0.4 0.0 Markov Models: Exercise 2) Which model better describes the weather in summer? Which one describes the weather in winter?

0.3 0.1 0.2 0.4 0.3 0.3 C C R R 0.2 0.5 0.1 0.3 0.2 0.0 0.2 0.0 0.1 0.2 0.1 0.2 0.7 0.0 0.2 0.0 0.1 0.4 S S S S F F 1.0 0.2 0.3 0.0 0.8 0.2 0.4 0.0 Markov Models: Exercise Winter 3) Given the sequence CSSSCFS which model gives the higher probability? [Consider the starting probabilities: P(X|BEGIN)=0.25] Summer

0.3 0.4 0.3 C R 0.5 0.3 0.2 0.2 0.2 0.2 0.0 0.2 0.1 S S F 0.2 0.3 0.2 0.4 0.1 0.2 0.3 C R 0.2 0.1 0.0 0.0 0.1 0.1 0.7 0.0 0.4 S S F 1.0 0.0 0.8 0.0 Markov Models: Exercise Winter P (CSSSCFS | Winter) = =0.25x0.1x0.2x0.2x0.3x0.2x0.2= =1.2 x 10-5 P (CSSSCFS | Summer) = =0.25x0.4x0.8x0.8x0.1x0.1x1.0= =6.4 x 10-4 4) Can we conclude that the observation sequence refers to a summer week? Summer

0.3 0.4 0.3 C R 0.5 0.3 0.2 0.2 0.2 0.2 0.0 P(Summer | Seq) 0.2 0.1 S S P(Winter | Seq) F 0.2 0.3 0.2 0.4 0.1 0.2 0.3 C R 0.2 0.1 0.0 0.0 0.1 0.1 0.7 0.0 0.4 S S F 1.0 0.0 0.8 0.0 Markov Models: Exercise Winter P (Seq | Winter) =1.2 x 10-5 P (Seq | Summer) =6.4 x 10-4 = Summer P(Seq |Summer) P(Summer) = P(Seq| Winter) P(Winter)

C G A T Simple Markov Model for DNA sequences DNA: C = {Adenine, Citosine, Guanine, Timine } 16 transition probabilities (12 of which independent) + 4 Begin probabilities + 4 End probabilities. The parameters of the model are different in different zones of DNA They describe the overall composition and the couple recurrences

C G A T P ( s | GpC ) ·P(GpC) P (GpC | s) = P (s | GpC) ·P(GpC) + P (s | nonGpC) ·P(nonGpC) C G A T Example of Markov Models: GpC Island GATGCGTCGC CTACGCAGCG GpC Islands Non-GpC Islands In the Markov Model of GpC Islands aGC is higher than in Markov Model Non-GpC Islands Given a sequence s we can evaluate

The simplest probabilistic models: Markov Models • Definition • Training

nik aik = Sjnij Training of Markov Models • Let M be the set of parameters of model M. • During the training phase, M parameters are estimated from the set of known data D • Maximum Likelihood Extimation (ML) • ML = argmaxP( D | M,  ) • It can be proved that: Frequency of occurrence as counted in the data set D Maximum A Posteriori Extimation (MAP) MAP = argmax P(  | M, D ) = argmax[P( D | M,  )  P( ) ]

Maximum Likelihood training: Proof Given a sequence s contained in D: s = s1s2s3s4s6 ……………sT We can count the number of transitions between any to states j and k: njk Where states 0 and N+1 are BEGIN and END Normalisation contstraints are taken into account using the Lagrange multipliers lk

Hidden Markov Models • Preliminary examples

Loaded dice We have 99 regular dice (R) and 1 loaded die (L). P(1) P(2) P(3) P(4) P(5) P(6) R 1/6 1/6 1/6 1/6 1/6 1/6 L 1/10 1/10 1/10 1/10 1/10 1/2 Given a sequence: 4156266656321636543662152611536264162364261664616263 We don’t know the sequence of dice that generated it. RRRRRLRLRRRRRRRLRRRRRRRRRRRRLRLRRRRRRRRLRRRRLRRRRLRR

Loaded dice Hypothesis: We chose a different die for each roll Two stochastic processes give origin to the sequence of observations. 1) Choosing the die ( R o L ). 2) Rolling the die The sequence of dice is hidden The first process is assumed to be Markovian (in this case a 0-order MM) The outcome of the second process depends only on the state reached in the first process (that is the chosen die)

0.01 R L 0.01 0.99 0.99 Loaded dice • Model • Each state(R and L) generates a character of the alphabet • C = {1, 2, 3, 4, 5, 6 } • The emission probabilities depend only on the state. • The transition probabilities describe a Markov model that generates a state path: the hidden sequence (p) • The observationssequence (s) is generated by two concomitant stochastic processes

Alcuni esempi semi-seri 1) DEMOGRAFICO Osservabile: Numero di nascite e/o morti in un anno in un luogo Variabile Nascosta: Stato economico (in prima istanza, se consideriamo la “fortuna” economica in un anno un processo casuale, la ricchezza accumulata è un prodesso markoviano ---> possiamo ricostruire lo stato economico a partire dai registri anagrafici? 2) DOCENTE METEOPATICO Osservabile: Media dei voti giornalieri registrati su un registro di un professore meteopatico Variabile Nascosta: Stato meteorologico ---> possiamo ricostruire lo stato meteorologico a partire dal registro del docente?

Alcuni esempi più seri 1) STRUTTURA SECONDARIA Osservabile: Sequenza proteica Variabile Nascosta: Struttura secondaria ---> possiamo predire la struttura secondaria a partire dalla sequenza? 2) ALLINEAMENTI Osservabile: Sequenza proteica Variabile Nascosta: Posizione di ogni residuo nell’allineamento di una famiglia proteica ---> possiamo ricostruire l’allineamento della sequenza alla famiglia proteica a partire dalla sequenza?

G+ C+ C- G- A+ A- T- T+ GpC Islands Non-GpC Islands GpC Island Given a long non-annotated DNA sequence, we want to localise the GpC Islands (if they exist) We build a model that unifies the two Markov models for GpC Islands and Non-GpC Islands. Transitions between any state of the first and of the second one are added Now there is not one-to-one correspondence between states and symbols

C G Non GpC GpC A T G C A T Non-GpC Islands GpC Islands GpC Island: conditioning events Instead of the model: Can we use a model similar to that of the dice example? On the alphabet C = {A,G,C,T } Using such a model all the characters of the generated sequence would be independent of the preceding character.

Hidden Markov Models • Preliminary examples • Formal definition

Formal definition of Hidden Markov Models • A HMM is a stochastic generator of sequences characterised by: • N states • A set of transition probabilities between two states {akj} • akj = P(  (i) = j | (i-1) = k ) • A set of starting probabilities {a0k} • a0k = P(  (1) = k ) • A set of ending probabilities {ak0} ak0 = P( p (i) = END | (i-1) = k ) • An alphabet C with M characters. • A set of emission probabilities for each state {ek (c)} • ek (c) = P( s i= c |  (i) = k ) • Constraints: ka0 k = 1 ak0 + jak j = 1 k cCek (c) = 1 k s: sequence p: path through the states

Choose the initial state p (1) following the probabilities a0k i = 1 Choose the character s i from the alphabet C following the probabilities ek(c) Choose the next state following the probabilities ak j and ak0 No Yes ii +1 End Is the END state choosed? Generating a sequence with a HMM

BEGIN a0Y= 0.2 a0N = 0.8 Gpc Island Non- Gpc Island aYN = 0.2 aNN = 0.8 Y N aYY = 0.7 aNY = 0.1 eY (A) = 0.1 eY (G) = 0.4 eY (C) = 0.4 eY (T) = 0.1 eN (A) = 0.25 eN (G) = 0.25 eN (C) = 0.25 eN (T) = 0.25 aN0 = 0.1 aY0 = 0.1 END GpC Island, simple model • s :AGCGCGTAATCTG • p :YYYYYYYNNNNNN • P( s, p | M ) can be easily computed

BEGIN a0Y= 0.2 a0N = 0.8 GpC Island Non- GpC Island aYN = 0.2 aNN = 0.8 Y N aYY = 0.7 aNY = 0.1 eY (A) = 0.1 eY (G) = 0.4 eY (C) = 0.4 eY (T) = 0.1 eN (A) = 0.25 eN (G) = 0.25 eN (C) = 0.25 eN (T) = 0.25 aN0 = 0.1 aY0 = 0.1 END P( s, p | M ) can be easily computed s : A G C G C G T A A T C T G p : Y Y Y Y Y Y Y N N N N N N Emission: 0.1  0.4 0.4  0.4 0.4 0.4 0.10.250.250.250.250.250.25 Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.20.8  0.8 0.80.8 0.8  0.1 Multiplying all the probabilities gives the probability of having the sequence AND the path through the states

Evaluation of the joint probability of the sequence ad the path

Hidden Markov Models • Preliminary examples • Formal definition • Three questions

Fisica Computazionale applicata alle Macromolecole

Fisica Computazionale applicata alle Macromolecole

Presentation Transcript