Molecular evolution, cont. Comparing estimation methods. Application to human and mouse sequences

Molecular evolution, cont.Comparing estimation methods.Application to human and mouse sequences Lecture 16, Statistics 246 March 16, 2004

The resolvent method • Müller and Vingron (2000) proposed a fast estimation method for sequence pairs based on resolvents. It can in fact be applied to multiple aligments generated by a reversible Markov process. For  > 0, the resolvent R of a rate matrix Q is given by • R = (I - Q)-1. • Solving for Q gives the following formula: • Q = I - R-1. • It turns out that the resolvent is the Laplace transform of the transition matrices,

Resolvent method, cont. • This is the key formula. (Prove it.) Given many pairs of sequences, not necessarily disjoint, that are separated by t PAMs, an unbiased estimate of P(t) can be obtained by normalizing the symmetrized sum of frequency tables. If P(t) can be estimated for a wide range of t, then we can get an estimate of Q via the last two equations. That is the idea. • In practice, there are two issues: (i) the distances are unknown, and so must be estimated by ML, and (ii), the estimated distances are discrete, and so interpolation must be used to estimate the rate matrix. • Let the estimated distances be 0 < t1 <….< tN . Then the integral is approximately equal to the sum of N pieces:

Resolvent method, cont. • which can be evaluated exactly, after replacing the Ps by their estimates. Summing these integrals gives an estimate of R, and by inversion, of Q.

REVIEW: BLOck SUbstitution Matrices (BLOSUM) • Henikoff and Henikoff (1992) used an ad hoc method that takes time inhomogeneity into account to construct the BLOSUM (block substitution matrix) matrices. The input is a set of blocks, which are gap-free multiplealignments of segments of homologous amino acid sequences. A frequency table is derived from the blocks by summing over the match and mismatch patterns from all within-block pair-wise comparisons. Since a mismatch such as an A aligned to an S, can be written in two ways, AS and SA, we get rid • of the ambiguity by using only AS. In general, a mismatch is represented by • sequences XY, where X precedes Y alphabetically. For example, suppose that in a block with six sequences, two columns are as follows: • ..AD.. • ..AD.. • ..AE.. • ..AE.. • ..AD.. • ..SD..

REVIEW: BLOSUM matrices, cont. • There are a total of 15 pairwise comparisons. The left column contributes 10 AA and 5 AS pairs to the frequency table. Similarly, the right column contributes 6 DD, 1 EE and 8 DE pairs. Adding these column contributions within the block, and then across all blocks, gives a triangular frequency table. The matrix is symmetrized by adding itself to its transpose. Dividing the matrix by its sum yields a symmetric joint distribution, and a substitution matrix is obtained as described for the PAM matrices. • To downweight the contribution of the more closely related sequences to the frequency table, the sequences within each block are clustered. Let  be a fixed number between 0 and 100. Sequences that are more than % similar are “greedily” clustered. In other words, any two sequences that are more than % similar are put in the same cluster, and if each sequence already belongs to some cluster, then the two clusters are combined to form a larger cluster. In the end, the sequences within a block are partitioned into disjoint clusters, so that any two sequences from distinct clusters are less than % similar. It is clear that the clusters are well-defined, i.e., independent of the initial choice of sequences.

REVIEW: BLOSUM matrices, cont. • Sequences in the same cluster are downweighted by the cluster size in cross-cluster pairwise comparisons, and pairwise comparisons of sequence in the same cluster do not contribute to the frequency table. In the example, suppose that the first four sequences are clustered while the last two sequences are not. Then the contribution of the left column is the same as an A-A-S column: 1 AA, 2 AS pairs. The right column is effectively (D/E)-D-D, where D/E represents half a D and half an E. Its contribution is 2 DD (1 + 1/2 + 1/2) and 1 DE (1/2 + 1/2) pairs. Equivalently, sequences in the same cluster are replaced by an “average” sequence with fractional number of residues at each position. Then the frequency table is derived as if the average sequences are real sequences; blocks that have only one cluster are left out. Let the symmetric joint distribution be denoted by f . Let  be the row of column sum of f . Then the substitution matrix BLOSUM is given by • S(, a, b) = 2log2{f(a,b)/(a)(b)}

REVIEW: BLOSUM, final remarks. • If  is 100, then every cluster is of size 1, so f is an average of the substitution patterns over all distances. If  is a small value, such as 20, then there are a small number of large clusters, and the similarity between the sequences from distinct clusters ranges from 0 to 20%, so f depends only on the substitution patterns of the distantly related sequences. • Thus reducing the threshold  gives substitution matrices which are more suitable for aligning distantly related sequences. In this context, BLOSUM40 is to be preferred to BLOSUM80 when aligning distantly related protein sequences.

An inhomogeneous process • In order to compare methods of estimating rate matrices, we are going to define a simple class of inhomogeneous processes, i.e. processes for which no single rate matrix Q describes the evolution.Let Q1 be a calibrated reversible rate matrix, with all off-diagonal entries strictly positive, and let be its stationary distribution. Define a reversible matrixQ0as follows: • Q0(a,b) = (b), if a b; Q0(a,a) = - ba(b), if a=b. • Clearly Q0 is the simplest reversible rate matrix with stationary distribution . Calibrate Q0by scaling, and note that Q0 commutes with Q1. Consider the process Q defined by • Q(x) = Q0 if 0 ≤ x ≤ 100; Q(x) = Q1 if 100 ≤ x ≤ 200. • Note that Q has stationary distribution , is reversible and calibrated. Indeed the same would be true of the more general form • Q(x) = Q0 + g(x)[Q1 - Q0], x≥ 0, for any integrable g : [0, )  [0,1]. • Using some (product integral) theory we omit, we can get well defined inhomogeneous Markov processes having the Q(x) as instantaneous rate matrices at time x≥ 0. We use one further notion: a joint distribution f on amino acids that is embeddable in a continuous time Markov process can be written f = *exp(Q*t*) for a calibrated Q*, stationary distribution * and effective divergence time t*.

Simulation comparison of methods • We will compare the BLOSUM, resolvent and ML methods in the following way. Let   {30, 35, …, 85}. Given sequence blocks generated by a substitution process, run BLOSUM at threshold  (i.e. using sequences more than  % similar) to get f with effective divergence time t*. Apply the resolvent and ML methods to get estimated rate matrices leading to fresand fmle. Define the deviations • dblo = d(f , f(t*)) • dres = d(fres(t*), f(t*)) • dmle = d(fmle(t*), f(t*)), • where d(, ) = a |(a) - (a)|. Suppose that the substitution model is fixed, i.e. the substitution process, the trees and the sequence lengths are all fixed. Then the average of the deviations over all realizations is the sequences is a measure of the relative performance of the methods.

Simulation comparison, cont. • Our simulations are based on a pair of rate matricesQ0 and Q1. Here Q1is the rate matrix estimated by ML from 13,255 pairwise amino acid alignments of length at least 100, which were obtained from the SYSTERS database courtesy of Tobias Müller. Q0is constructed fromQ1as indicated earlier. In every simulation 30 sequence pairs separated by the distances 10, 20, …, 300 PAMs are generated according to a substitution model. In other words, there are 30 blocks, and each block has 2 sequences. Also, all trees have depth 150 PAMs, so that a sequence pair separated by 100 PAMs have a most recent common ancestor 50 PAMs ago, and the most recent common ancestor of a sequence pair separated by 300 PAMs sits at the top of the tree. The simulations differ through the substitution models, coming next, and all sequences have length 5,000. • 1) A reversible homogeneous process generated by Q1. • 2) An inhomogeneous process with Q(t) = Q0 + (Q1 -Q0)t/150, 0 ≤ t ≤150. • 3) An inhomogeneous process with • Q(t) = Q0 , 0 ≤ t ≤100; Q(t) = Q1 , 100 ≤ t ≤150.

Reversible homogeneous process ML better than RES better than BLOSUM, note pattern

Smooth inhomogeneous process ML still best, now BLOSUM next, and RES worst.

Discontinuous inhomogeneous process At low thresholds, BLOSUM beats ML handily (as it should!)

Summary of comparison • For an inhomogeneous process, the BLOSUM method is rather more stable and less biased than either the resolvent or ML methods, at low thresholds, but not so good at high thresholds. BLOSUM is only slightly worse than ML and resolvent at high thresholds. • The simulations suggest that • The BLOSUM method is a robust method for deriving substitution matrices at large evolutionary distances; while • The ML and resolvent methods are better than the BLOSUM (and ML appreciably better than resolvent) for deriving substitution matrices at small distances. These conclusions have been reached by simulation, for pair data, but are probably true more widely.

Modelling genomic DNA base substitution • In the final part of this lecture, we look at estimates of rate matrices connecting human chromosome 22 sequence with the homologous mouse sequence. More precisely, a method previously described is applied to a large collection of local alignments, so-called high scoring pairs, or HSPs. We focus on intergenic regions, and gap-free HSPs, for simplicity. Our assumption is that in a gap-free alignment, every pair of aligned bases has evolved (descended) from a single ancestral base. • We examine the estimated rate matrices for strand-symmetry, and homogeneity across varying base composition. We find approximate strand-symmetry, but considerable heterogeneity. Grouping by GC content helps, but we still see evidence of non-stationarity.

Data preprocessing and summary • The essentially complete sequence of human chromosome 22 was aligned to the mouse sequence using blastz in Pipmaker, giving 47,166 HSPs. Those that intersected genes were removed, leaving 26,593 HSPs totally in the intergenic region. Those closer than 10 bp were merged, and the 10,638 of length greater than 100 bp were used. • Given an HSP, the count matrix is the 4x4 matrix that has the numbers of each of the 16 possible pairs of aligned bases in the HSP. The %identity is the sum of the diagonal counts divided by the total sum = the HSP length. • The percent non-identity is a monotonely increasing function of evolutionary distance, and approximately linear for small distances. • The two marginal frequencies of the count matrix are the base compositions of the human and mouse sequences, respectively. The base composition of an HSP is the average of these two, while the GC content is the fraction of baases that are either G or C. (Protein-coding genes tend to be located in the regions with higher GC-content.) • We now give plots of some of these quantities.

Plot of percent identity against position along chr 22 for each HSP

Plot of percent identity against position for the first 1,000 HSPs Gaps are due to repetetive elements.

Plot of percent identity against position for the 5,001st to 6,000th HSP

Plot of human GC-content against position along chromosome 22 Pretty variable!

Scatterplot of human and mouse GC-content for each HSP

Comparing base compositions • If we denote the human and mouse base compositions for an HSP by h and m, then the difference in base compositions is • D = a |h -m |. • If the sequences are generated by a stationary model, then D  0 as the length n of the HSP increases, at approximately rate n-1/2. • The standardized quantityn-1/2D is a measure of composition imbalance between the two sequences making up an HSP.

Histogram of imbalance between human and mouse base compositions The mean and SD of the histogram of imbalances simulated under a fitted model are 1.3 and 0.6 Some evidence of non-stationarity in substitution.

Plot of composition imbalance against position along chromosome 22

Estimating reversible calibrated rate matrices • We turn now to estimating rate matrices by ML. In principle, each position in each HSP could evolve at a different rate, but with just two species, we have no hope of discovering this. What we do is permit each HSP to have its own evolutionary distance in PAMs separating the human and mouse components. These are estimated by ML, along with the reversible calibrated rate matrices, initially a common one, and later we stratify by GC-content. The methods have been given in the last lecture.

First results • We find that 10,000 times the MLE Q^ is • A C G T • A -99 19 62 18 • C 20 -100 16 64 • G 65 16 -101 20 • T 18 63 19 -100 • The transition rates (AG, CT, bold) are around 60, which is about 3 times the transversion rates. The stationary distribution of Q^ is almost uniform: • A G C T • .26 .25 .24 .25 • while the symmetric part, R^ is • A C G T • A -386 79 253 71 • C 79 -407 65 254 • G 253 65 -413 79 • T 71 254 79 -396 • The transition rates here are almost identical, but the transversion rates are more variable. Clearly Q^ is nearly strand symmetric, i.e., the rate for AC is essentially the same as for TG (the complements), and so on.

Plot of percent identity against estimated distance in PAMs. Curvature corresponds to repeated substitutions. Increasing spread probably corresponds to larger variance and/or lack of fit.

Goodness-of-fit • A diagnostic tool for evaulating the goodness of fit of the model to all the 4x4 tables is the Pearson chi-squared statistic. Suppose that ni and tiare the length and estimated distance for the ith HSP. Then the expected count matrix is • Define the statistic

Goodness-of-fit, cont. • Under the model, if the number N of HSPs is large, and they are all long, then the N 2 statistics are approximately independent from a 214 distribution. To check this we simulate 1,000 HSPs of the same size as the first 1,000 in the data, using the MLEs for Q and the corresponding ti. The model parameters are then estimated and the 2 statistics computed. Their mean and SD are about 14 and 5.3 = √(2x14), see the histogram on the next slide. • The mean and the SD of the 10,638 observed 2statistics turns out to be 53 and 54 respectively, indicating that the fit is very bad. This is partly due to compositional heterogeneity. Removing non-stationary HSPs helps, e,g, using only those HSPs with √nD < 2 leaves us with a mean 40 and SD of 33, a modest improvement.

Histogrtam of 1,000  2statistics on simulated HSPs. Mean  14, SD  √28 as expected.

GC-specific rate matrices

GC-specific rate matrices: symmetric parts and stationary distributions Transitions in bold type: note similarity across the tables..

Goodness-of-fit revisited Here are the summaries of the 2 statistics. Although far from good, they are much better. We conclude that heterogeneity contributes to the lack of fit of the model.

Goodness-of-fit, almost concluded. • Dividing our HSPs into three groups based on GC-content helps, but still leaves considerable heterogeneity in each group. An important issue is base composition, related to but more complex than GC-content. • We now consider smaller, more homogeneous groups of HSPs, and check the adequacy of model within them. • Consider those with composition close to  = (.26, .25, .24, .25). If an HSP has length n and marginal frequencies hand m, its standardized distance from  is given by • √n/2 x a|h(a) + m(a) - 2 (a)|. • A rough cut-off can be obtained by simulation based on the overall rate matrix and associated distances, and we find the mean and SD of simulated standardized distances are 1.2 and 0.5, respectively. Fitting the model to the 1,305 HSPs with standardized distance <1.7 gives and estimated rate matrix quite close to the overall rate matrix. Further, the mean and SD of the 2 statistics are 18 and 10 respectively. The qq-plot is the next page but one.

214 qq-plot of HSPs with composition close to (.2,.3,.2,.3)

Goodness-of-fit, concluded. • We see some HSPs not well fitted by our model. These tend to have high imbalance. Removing those HSPs with imbalance exceeding 2 leaves 745 HSPs. Fitting the model to them gives the same rate matrix, but now the mean and SD of the 2are 14,5 and 5.1 respectively. This suggests that the lack of fit is mainly due to non-stationarity, and that stationary HSPs seem to have evolved according to a reversible process, which is also nearly strand-symmetric. • Applying the same procedure for the compositions (.2,.3,.3,.2) and (.3,.2,.2,.3) and (.2,.3,.2,.3) leads to similar observations: up to 40% of the HSPs in each group are non-stationary, and removing them makes the fit very good. For the strand-symmetric compositions (all exept (.2,.3,.2,.3)), the estimated rate matrices are all nearly strand-symmetric, indicating that the evolution of those stationary HSPs is probably strand-symmetric.

Molecular evolution, cont. Comparing estimation methods. Application to human and mouse sequences

Molecular evolution, cont. Comparing estimation methods. Application to human and mouse sequences

Presentation Transcript

Introduction to Molecular Evolution

Human Molecular Evolution Lecture 2

Describing and Comparing Motions cont d

Major Application Areas of Molecular Evolution

MCB5472 Computer methods in molecular evolution

MCB5472 Computer methods in molecular evolution

Molecular Evolution

MCB5472 Computer methods in molecular evolution

Methods cont –

Human Molecular Evolution Lecture 1

Evolution Vocabulary, cont.

Molecular Evolution

Comparing Two Protein Sequences

Molecular evolution, cont. Estimating rate matrices

Comparing Protein Sequences

Comparing Estimation Methods in Product Surveys

Methods(cont.)

Molecular Evolution

Introduction to Molecular Evolution

Comparing Two Protein Sequences

Textual Analysis of DNA Sequences and Molecular Evolution