Molecular Evolution

Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM

a b Mainly a STATISTICAL problem!

Models of sequence evolution • Sequence similarity • Estimating the number of substitutions between two sequences • Phylogenetic reconstruction

Evolution at the molecular level is the substitution of one allele by another 1/ Allele A Allele B Allele C 1 frequency 0 time The basic forces are: mutation, genetic drift and natural selection

By this process, a DNA sequence accumulates substitutions through time TAGCGTAGG ATCGCATCC ATTGCGTAC TAACCCATG t

In the study of molecular evolution, this changes in a DNA sequence are used for both: Estimating the rate of molecular evolution Reconstructing the evolutionary history

Models of sequence evolution

Models of DNA evolution p A C t To study the dynamics of nucleotide substitution we must made assumptions regarding the probability (p) of substitution of one nucleotide by another at the end of time interval t

pAC For instance, PAC represents the probability that a site that has started with nucleotide i (A in this case) change to nucleotide j (C in this case) at the end of interval t

Models of DNA evolution using matrix theory Substitution probability matrix PAA PAC PAG PAT PCA PCC PCG PCT Pt = PGA PGC PGG PGT PTA PTC PTG PTT Base composition of sequences f = [fAfCfGfT]

The Jukes and Cantor’s One-Parameter Model  A G     C T 

The Jukes and Cantor’s One-Parameter Model Substitution probability matrix *   *  * pii = 1 - jipij Pt =  *   * Base composition of sequences f = [ ¼ ¼ ¼ ¼ ]

The Jukes and Cantor’s One-Parameter Model What is the probability of having an A in a site in a DNA sequence at time t =1, in a site that started whit an A at time t = 0 ? A A t = 0 t = 1 pA(0) = 1 pA(1) = 1 - 3 Since we started whit A The probability that the nucleotide has remained unchanged

The Jukes and Cantor’s One-Parameter Model What is the probability of having an A in a site in a DNA sequence at time t = 2? Scenario 1 Scenario 2 t = 0 A A No substitution Substitution t = 1 Not A A No substitution Substitution t = 2 A A (After Li, 1997)

The Jukes and Cantor’s One-Parameter Model What is the probability of having an A in a site in a DNA sequence at time t = 2? Scenario 1 Scenario 2 t = 0 A A pA(1) = (1 - 3) [1 -pA(1)] t = 1 Not A A (1 - 3)  t = 2 A A (After Li, 1997)

The Jukes and Cantor’s One-Parameter Model What is the probability of having an A in a site in a DNA sequence at time t = 2? Scenario 1 Scenario 2 t = 0 A A pA(1) [1 -pA(1)] + t = 1 Not A A (1 - 3)  t = 2 A A (After Li, 1997)

The Jukes and Cantor’s One-Parameter Model What is the probability of having an A in a site in a DNA sequence at time t = 2? pA(2) =(1 - 3) pA(1)+[1 -pA(1)] The probability of not having a substitution from t = 1 to t = 2 The probability of not having a substitution from t = 0 to t = 1 The probability of having a substitution from notA to A, from t = 1 to t = 2 The probability of having a substitution from A to not A, in t = 0 to t = 1 The probability of no change The probability of reversible change

The Jukes and Cantor’s One-Parameter Model The following recurrence equation holds for any t: pA(t + 1) =(1 - 3) pA(t)+[1 -pA(t)]

The Jukes and Cantor’s One-Parameter Model Rewriting this equation in terms of the amount of change: pA(t + 1)- pA(t) = (1 - 3) pA(t) + [1 -pA(t)] - pA(t)

The Jukes and Cantor’s One-Parameter Model Doing some algebra: pA(t + 1)- pA(t) = (1 - 3) pA(t) + [1 -pA(t)] - pA(t)

The Jukes and Cantor’s One-Parameter Model Doing some algebra: pA(t + 1)- pA(t) = (1 - 3) pA(t) + [1 -pA(t)] - pA(t) pA(t + 1)- pA(t) =pA(t) - 3pA(t) + [1 -pA(t)] - pA(t)

The Jukes and Cantor’s One-Parameter Model Doing some algebra: pA(t + 1)- pA(t) = (1 - 3) pA(t) + [1 -pA(t)] - pA(t) pA(t + 1)- pA(t) =pA(t) - 3pA(t) + [1 -pA(t)] - pA(t) pA(t) = - 3pA(t) + [1 -pA(t)]

The Jukes and Cantor’s One-Parameter Model Doing some algebra: pA(t + 1)- pA(t) = (1 - 3) pA(t) + [1 -pA(t)] - pA(t) pA(t + 1)- pA(t) =pA(t) - 3pA(t) + [1 -pA(t)] - pA(t) pA(t) = - 3pA(t) + [1 -pA(t)] pA(t) = - 4pA(t) + 

The Jukes and Cantor’s One-Parameter Model Rewriting this equation for a continuous time model: d pA(t) =- 4pA(t) +  d t

The Jukes and Cantor’s One-Parameter Model Rewriting this equation for a continuous time model: d pA(t) =- 4pA(t) +  d t The solution is given by: pA(t)= ¼ +  pA(0) - ¼ e -4t

The Jukes and Cantor’s One-Parameter Model Since we started with A, pA(0) = 1 pA(t)= ¼ + 1 - ¼ e -4t= ¼ + ¾ e -4t An if we start with nonA, pA(0) = 0 pA(t)= ¼ + 0 - ¼ e -4t= ¼ - ¼ e -4t

The Jukes and Cantor’s One-Parameter Model We can write the equations in a more explicit form: The probability of initially having A, and still having A at time t is: pAA(t)= ¼ + ¾ e -4t The probability of initially having G, and then having Aat time t is: pGA(t)= ¼ - ¼ e -4t

The Jukes and Cantor’s One-Parameter Model And since all nucleotides are equivalent under the JC model, pGA(t)= pCA(t)= pTA(t). pii(t)= ¼ + ¾ e -4t pij(t)= ¼ - ¼ e -4t where i  j

pA(t) For instance, pA(t) can also be interpreted as the frequency of A in a DNA sequence. For example, if we start with a sequence made of A‘s only, then pA(0) = 1, and pA(t) is the expected frequency of A in the sequence at time t.

The Jukes and Cantor’s One-Parameter Model 1 pii Probability ¼ pij 0 20 40 60 80 100 120 140 160 180 200 Time (million years) Temporal changes in the probability of having a certain nucleotide at a given nucleotide site ( = 5x10-9 substitutions/site/year).

Other models of sequence evolution

The Kimura two-Parameter Model  A G Transitions     Transversions Transitions C T 

The Kimura two-Parameter Model 100 Transitions 80 60 Base pair differences 40 20 Transversions 0 5 10 15 20 25 Time since divergence (Myr) Number of transition and transversions between pairs of bovid mammal mitochondrial sequences (684 base pairs from the COII gene) against the estimated time of divergence.

The Kimura two-Parameter Model Substitution probability matrix *   *  * pii = 1 - jipij Pt =  *   * Base composition of sequences f = [ ¼ ¼ ¼ ¼ ]

The Felsenstein (1981) Model Substitution probability matrix * C G T  A  * G T  * pii = 1 - jipij Pt = A C  * T  A C G  * This model assumes that there is variation in base composition Base composition of sequences f = [ACGT ]

The Hasegawa, Kishino and Yano (1985) Model Substitution probability matrix * C G T  A  * G T  * pii = 1 - jipij Pt = A C  * T  This model assumes that there is variation in base composition and that transition and transversions occur at different rates. A C G  * Base composition of sequences f = [ACGT ]

The General Reversible (REV) Model Substitution probability matrix * C aG bT c A a * G dT e * pii = 1 - jipij Pt = A bC d * T f A cC eG f * This model assumes that there is variation in base composition and that each substitution has its own probability. Base composition of sequences f = [ACGT ]

Comparing the Models Jukes-Cantor Allow for / bias Allow for base frequency to vary Kimura 2 parameter Felsenstein (1981) Allow for base frequency to vary Allow for / bias Felsenstein (1981) Allow all six pairs of substitutions to have different rates General Reversible (REV) From Page and Holms (1998)

Among site rate variation

Among site rate variation For protein coding sequences not all sites have the same probability of change (there is among site rate variation). If this effect is not taken into account, the number of substitutions per site between two sequences can be underestimated (Li and Graur, 1991).

Effect of among site rate variation in sequence divergence (A) Substitution rate of 0.5 % / M.a. and 80 % of the sites free to vary (B) Substitution rate of 2 % / M.a. and 50 % of the sites free to vary (Page and Holms, 1998)

Gamma distribution f(r) = [ba / (a)] e–br r a-1 where: (a) = ∫0e–tta-1 dt

The a shape parameter

Time reversibility

A pAA(t) pAA(t) A A A t = 0 t = 1 t = 2 t t pAA(t)2 pAA(t) pAA(t) pAA(t)2 A A Time reversibility in the Jukes and Cantor’s One-Parameter Model

Time reversibility in the Jukes and Cantor’s One-Parameter Model A t t pAA(t) A A

Time reversibility in the Jukes and Cantor’s One-Parameter Model A t t pAA(t) pAA(t) A A

Time reversibility in the Jukes and Cantor’s One-Parameter Model A t t pAA(t) pAA(t) pAA(t)2 A A

Molecular Evolution