Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures Probabilistic Approaches to Phylogeny Lecturer: Dr. Rose Slides by: Dr. Rose April 24, 2003

Probabilistic Approaches to Phylogeny Principal Goal: rank trees according to: • Their likelihood P(data|tree) or • Their posterior probability P(tree|data) Secondary goal: • Find the probability of some taxonomic feature • Example: the grouping of a set of taxa on a branch

Probabilistic Approaches to Phylogeny Notation and definitions: Let P(x•|T,t•) denote the probability of a set of data given a tree, where: • x• denotes n sequences • T denotes a tree with n leaves with sequence j at leaf j • t• denotes the edge lengths of the tree The definition of P(x•|T,t•) depends on our choice of model of evolution.

Probabilistic Approaches to Phylogeny Let P(x|y,t) denote the probability that sequence y evolves into x along an edge of length t. Assume that we can define P(x|y,t). If we can do this for each edge of T  we can calculate the probability of T. Let’s look at an example.

Probabilistic Approaches to Phylogeny There is a small problem: normally, we do not have sequences for internal nodes. Solution: calculate P(x1,.., x3 | T, t•) by summing over all possible ancestors x4 & x5. Given this model we can search for T and t• that maximizes P(x• | T, t•) Maximizing P(x• | T, t•) requires: • Searching over possible tree topologies, T • Searching over possible lengths of edges, t•

Probabilistic Approaches to Phylogeny Concerning the search over tree topologies: Q: How many rooted binary trees are there with n leaves? A: (2n – 3)!! rooted binary trees with n leaves. For n = 10 there are ~ 40 million trees For n = 20 there are ~ 4.4 * 1021 An efficient search procedure might prove useful.

Probabilistic Models of Evolution Ridiculously simplistic model of evolution: • Every site is independent • Deletions and insertions do not occur • Substitution accounts for all evolution Let P(b|a, t) denote the probability of the substitution of residue b for residue a over an edge length of t. Extending to aligned gapless sequences x and y, P(x | y, t) = PuP(xu|yu, t), where u indexes over sites

Probabilistic Models of Evolution Consider the substitution matrix which specifies P(b|a,t): A residue alphabet of size K, entails a K-by-K subsitution matrix.

Probabilistic Models of Evolution A nice property for substitution matrices is the multiplicative property in the sense that: S(t)S(s) = S(t + s) for all values of the lengths s and t. This is equivalent to: SbP(a|b, t)P(b|c, s) = P(a|c, s + t) for all a, c, s, and t.

Probabilistic Models of Evolution Viewing t as a time variable, multiplicativity is a consequence of the nature of the substitution process. The process is: • Markovian and • Stationary, i.e., the substitution of a at time t for b at time s depends only on the time interval (t-s)

Probabilistic Models of Evolution Example: Jukes & Cantor [1969] model In this model all nucleotides undergo transitions at the same rate a, giving the rate substitution matrix R: A C G T A C G T

Probabilistic Models of Evolution The substitution matrix for a short time S(e) is approximately I+Re where I is the identity matrix. Recall from the previous slide

Probabilistic Models of Evolution We want to derive the substitution matrix for time t. S(t+e) = S(t)S(e)  S(t)(I+Re) by multiplicativity S(t+e)  S(t)I+ S(t)Re multiplying out S(t+e) - S(t) S(t)Resubtract S(t) from both sides [S(t+e) - S(t)]/e  S(t)R divide by e S´(t) = S(t)R as e 0

Probabilistic Models of Evolution Noting that the rate matrix is symmetrical, we expect something of the following form for S(t): • Substituting our matrix S(t) into S´(t) = S(t)R gives: • ŕ = -3ar +3as, • ś = -as + ar

Probabilistic Models of Evolution From the previous slide: substituting our matrix S(t) into S´(t) = S(t)R gives: ŕ = -3ar +3as, ś = -as + ar Which are satisfied by: rt = ¼(1 + 3e-4at) st = ¼(1 - e-4at) Substituting rt and st into S(t) give the Jukes-Cantor model

Probabilistic Models of Evolution Jukes-Cantor Model:

Probabilistic Models of Evolution • When t = , rt = st = ¼. • Hence the nucleotide equilibrium frequencies implied by the Jukes-Cantor model are: qA = qC = qG = qT = ¼. • This model is too simple even for nucleotide substitution. Q: what obvious substitution issue is ignored? A: The difference between transitions and transversions.

Probabilistic Models of Evolution Transitions: • purine  purine, i.e., A  G • pyrimidine  pyrimidine, i.e., C  T Transversions: • purine  pyrimidine • A  C • A  T • G  C • G  T

Probabilistic Models of Evolution Kimura [1980] proposed a rate matrix that accounts for the difference between transitions and tranversions:

Probabilistic Models of Evolution We can integrate the Kimura rate matrix to give the general time-dependent form: • Where: • rt = ¼(1 - e-4bt) • ut = ¼(1 + e-4bt - 2e-2(a+b)t) • rt = 1 - 2st - ut

Probabilistic Models of Evolution This model is still unrealistic: • The equilibrium frequencies are equal qA = qC = qG = qT = ¼. • This is unrealistic for some taxa where there is a preponderance of bias in AT vs GC ratio.

Probabilistic Models of Evolution • Moving on to protein sequences, we consider evolution models for amino acids • Dayhoff et. al. compiled a matrix Aab • hypothetical phylogenetic tree of 71 families. • compilation of frequecies of transitions between paired residues • Gives pab, the probability of a aligned with b.

Probabilistic Models of Evolution • Next, a second matrix Bab is derived from Aab • Bab = Aab/ScAac • Gives the conditional probability p(b|a) • This is the short time interval estimate. • Let qa denote the frequency of occurrence of residue a in a protein. Similarly for qb. • The expected number of substitutions in a protein is then Sa,bqaqbBab

Probabilistic Models of Evolution • A substitution matrix where Sa,bqaqbBab= 0.01 is a 1 PAM matrix. • Dayhoff et. al. define a 1 PAM (point accepted mutation) matrix as the expected number of substitutions is 1% • Next Bab is scaled to produce a 1 PAM matrix of substitution probabilities

Probabilistic Models of Evolution A third matrix Cab is derived from Bab • Cab = sBab for a  b • Caa = sBaa + (1 – s) • i.e., scale off-diagonals by s and adjust diagonals to maintain a row sum of 1. • s is chosen so that Cab is a 1 PAM matrix. • Let S(1) denote this 1 PAM matrix

Probabilistic Models of Evolution Q: How can substitution matrices for longer times be created? A: raise S(1) to a power n. Example: • S(2) = S(1)S(1) • Entries P(a|b, t = 2) = ScP(a|c, t = 1)P(c|b, t = 1) • These are the substitutions from b to a via any intermediary c

Probabilistic Models of Evolution • S(n) can be viewed as the result of a n steps in a Markov chain with 20 states. • The 20 states correspond to the 20 amino acids • Each step in this Markov chain has probability given by S(1).

Probabilistic Models of Evolution Recap: S(1) was defined by • normalizing the rows of the symmetric matrix A • then rescaling These operations can be interchanged • rescale • then normalize Note: the rescaled matrix is still symmetrical  it can be diagonalized

Probabilistic Models of Evolution We can thus write S(1) = UD(li)U-1 where • U is a coordinate transformation and • D(li) is the diagonal matrix with • eigenvalues l1…l20 on the diagonal. • the eigenvalues range between 0 and 1 and can be written l1 = exp(-mi) • Powers of S(1) are represented very simply in the diagonal matrix system • S(2) = S(1)S(1) = UD(li)U-1UD(li)U-1 = UD(li2)U-1

Probabilistic Models of Evolution For arbitrary t, S(t) = UD(lit)U-1 , i.e., • If Ai is the ith amino acid then • P(Aj|Ai, t) = Skuikexp(-mkt) vkj where • uik and vkj are entries in U and U-1, respectively

Probabilistic Models of Evolution Letting t = , we get the PAM matrix: Where the qAi are the equilibrium frequencies for amino acids.

Likelihood for ungapped alignments Q: What is the simplest tree that we might consider? A: A tree comprised of two leaves, x1 and x2. Q: Why is this interesting? We already know the topology. A: We don’t know the edge lengths!

Likelihood for ungapped alignments Q: How does the likelihood of the tree vary with the edge lengths? A: That is the question  Consider a single site with residues x1, and x2. Assign the variable a to the root as shown below:

Likelihood for ungapped alignments The probability of • having a at the root (we assume the equilibrium distribution) and • substituting a by x1 and • Substituting a by x2 is given by: P(x1, x2, a|T, t1, t2) = qa P(x1|a,t1)P(x2|a,t2) Recall: Since we don’t know what residue is at the root, we must sum over all possibilities.

Likelihood for ungapped alignments Thus the generalized form of the equation is: P(x1, x2|T, t1, t2) = Sa qa P(x1|a,t1)P(x2|a,t2) The generalization of this equations to N sites is:

Likelihood for ungapped alignments Example: likelihood of two sequences CCGGCCGCGCG CGGGCCGGCCG Q: What is the likelihood of these sequences in our simple tree of two leaves? For this example, assume the Jukes-Cantor model. We will need: • The substitution equations from the previous section. • The single site equation from the previous slide.

Likelihood for ungapped alignments Based on: • the substitution matrix from the previous section, • the single site equation from the previous slide, • P(x1, x2|T, t1, t2) = Sa qa P(x1|a,t1)P(x2|a,t2) • The probability of C occurring in both leaves is:

Likelihood for ungapped alignments Q: Why does the first term use the probabilities rt1 and rt2 while the other three terms use st1 and st2? A: the first term models the case where there is no change. • In the first term the root residue is C and both leaves are also C, so there is no substitution  rt1 and rt2 • The other terms model the substitution of the residue at the root by C at the leaves  st1 and st2

Likelihood for ungapped alignments Recall that the equilibrium distribution for Jukes-Cantor entails qA = qC = qG = qT = ¼. Hence: By symmetry P(G,G|T,t1,t2) = P(C,C|T,t1,t2) Likewise, P(G,C|T,t1,t2) = P(C,G|T,t1,t2) = ?

Likelihood for ungapped alignments Recall: rt and st from the previous section: rt = ¼(1 + 3e-4at) st = ¼(1 - e-4at) Substituting gives:

Likelihood for ungapped alignments Ok, now we have probabilities for single sites. We derive the probability for sequences by taking the product of site probabilities, i.e., If there are match sites where the sequences match and mismatch sites where they mismatch we get:

Likelihood for ungapped alignments Q: How can we get the likelihood for multiple sequences? Consider a tree T with edge lengths t. Denote the ancestor of node i by a(i) Let P(x1,..,xn |T, t) denote the probability of generating residues x1,.., xn at the n leaves P(x1,.., xn |T, t) is found by multiplying the probabilities of all edges of the tree.

Likelihood for ungapped alignments P(x1,..,xn |T, t) = Note: this can be computed from the leaves up.

Likelihood for ungapped alignments • Let Lk denote the leaves below node k. • Let P(Lk|a) denote the probability of these leaves given that the residue at k is a. • Let i and j be the children of k. • P(Lk|a) is computed from P(Li|b) and P(Lj|c) for all residues b and c.

Felsenstein’s Likelihood Algorithm Finally, site likelihood values are combined to give likelihood of sequences at leaves. This step assumes independence of sites:

Likelihood for ungapped alignments Let’s look at a simple example. Tree with 3 sequences (only GC bases) CCGGCCGCGCG CGGGCCGGCCG GCCGCCGGGCC Assume Jukes-Cantor model Consider site where C occurs at all leaves: P(C, C, C | T, t1, t2, t3) = ?

Likelihood for ungapped alignments • Equilibrium probability of residue C at the root • C is at all nodes • 4 does not have residue C • Equilibrium probabilities of other residues at root • 4 has same residue as root, i.e., not C • 4 has a different residue from root, but not C • 4 has residue C

Likelihood for ungapped alignments N.B. The sorcery at the colored boxes is a consequence of multiplicativity of the Jukes-Cantor matrices.

Bioinformatics Algorithms and Data Structures