570 likes | 616 Vues
This article provides an introduction to amino acid exchange matrices and their use in inferring protein evolution models. It covers topics such as protein sequences, sequence alignment, and phylogenetic methods. The text also explores the evolution of amino acids, the degeneracy of the DNA code, and the impact of protein structure on mutation patterns.
 
                
                E N D
MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES Robert Hirt Institute for Cell and Molecular Biosciences, Newcastle University, UK
Inferring trees is difficult!!! 1. The method problem A Method 1 Dataset 1 B ? C A Method 2 C Dataset 1 B
Inferring trees is difficult!!! 2. The dataset problem A Method 1 B Dataset 1 ? C A Method 1 C Dataset 2 B
From DNA/protein sequences to trees * 1 Sequence data * 2 Align Sequences Phylogenetic signal? Patterns—>evolutionary processes? * 3 Distances methods Characters based methods * Distance calculation (which model?) 4 Choose a method MB ML MP Wheighting? (sites, changes)? Model? Model? Optimality criterion Single tree LS ME NJ Calculate or estimate best fit tree 5 Test phylogenetic reliability Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487
Agenda • Some general considerations • Why protein phylogenetics? • What are we comparing? Protein sequences - some basic features • Protein structure/function and its impact on patterns of mutations • Amino acid exchange matrices: where do they come from and when do we use them? • Database searches (e.g. Blast, FASTA) • Sequence alignment (e.g. ClustalX) • Phylogenetics (model based methods: distance, ML & Bayesian)
Why protein phylogenies? • For historical reasons - the first sequences • Most genes encode proteins • To study protein structure, function and evolution • Comparing DNA and protein based phylogenies can be useful • Different genes - e.g. 18S rRNA versus EF-2 protein • Protein encoding gene - codons versus amino acids
Proteins were the first molecular sequences to be used for phylogenetic inference • Fitch and Margoliash (1967). Construction of phylogenetic trees. Science 155, 279-284.
Phylogenies from proteins • Parsimony • Distance matrices • Maximum likelihood • Bayesian methods
Evolutionary models for amino acid changes • All methods have explicit or implicit evolutionary models • Can be in the form of simple formula • Kimura or poisson corrections to estimate distances • Most models for amino acid changes typically include • A 20x20 rate matrix (or reduced version of it, 6x6 rate matrix) • Correction for rate heterogeneity among sites (G [a]+ pinv) • Assume stationarity and neutrality - what if there are biases in composition, or non neutral changes such as selection? • Some more recent models include correction for compositions heterogeneity across sites (CAT model) or across the tree (NDCH - node discrete compositional heterogeneity --- see Peter’s lectures)
Character states in DNA and protein alignments • DNA sequences have four states (five): A, C, G, T, (and ± indels) • Proteins have 20 states (21): A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y (and ± indels) • —> more information in DNA or protein alignments?
DNA->Protein: the code • 3 nucleotides (a codon) code for one amino acid (61 codons! 61x61 rate matrices?) • Degeneracy of the code: most amino acids are coded by several codons —> more data/information in DNA?
DNA—>Protein • The code is degenerate: 20 amino acids are encoded by 61 possible codons (3 stop codons) • Complex patterns of changes among codons: • Synonymous/non synonymous changes • Synonymous changes correspond to codon changes not affecting the coded amino acid
Codon degeneracy: protein alignments as a guide for DNA alignments Glu-Gly-Ser-Ser-Trp-Leu-Leu-Leu-Gly-Ser Glu-Gly-Ser-Ser-Tyr-Leu-Leu-Ile-Gly-Ser Asp-Gly-Ser-Ala-Trp-Leu-Leu-Leu-Gly-Ser Asp-Gly-Ser-Ala-Tyr-Leu-Leu-Ala-Gly-Ser GAA-GGA-AGC-TCC-TGG-TTA-CTC-CTG-GGA-TCC GAG-GGT-TCC-AGC-TAT-CTA-TTA-ATT-GGT-AGC GAC-GGC-AGT-GCA-TGG-TTG-CTT-TTG-GGC-AGT GAT-GGG-TCA-GCT-TAC-CTC-CTG-GCC-GGG-TCA
DNA->Protein: code usage • Difference in codon usage can lead to large base composition bias - in which case one often needs to remove the 3rd codon, the more bias prone site… and possibly the 1st • Comparing protein sequences can reduce the compositional bias problem —> more information in DNA or protein?
Models for DNA and Protein evolution • DNA: 4 x 4 rate matrices • “Easy” to estimate - can be combined with tree search in some cases • Protein: 20 x 20 matrices • More complex: time and estimation problems (rare changes?) -> • Empirical models from large datasets are typically used • One can correct for amino acid frequencies for a given dataset • Reduced alphabet (e.g. 6 x 6 matrix) allows to optimize the model for a given dataset
Proteins and their amino acids • Proteins determine shape and structure of cells and carry most catalytic processes - 3D structure • Proteins are polymers of 20 different amino acids • Amino acids sequence composition determines the structure (2ndary, 3ary…) and function of the protein • Amino acids can be categorized by their side chain physicochemical properties • Size (small versus large) • Polarity (hydrophobic versus hydrophilic, +/- charges)
D R
Amino acid physico-chemical properties • Major factor in protein folding • Key to protein functions • —> Major influence in pattern • of amino acid mutations • As for Ts versus Tv in DNA sequences, some amino acid changes are more common than others: fundamental for sequence comparisons(alignments and phylogenetics!) • Small <—> small > small <—> big
Estimation of relative rates of residue replacement (models of evolution) • Differences/changes in protein alignments can be pooled and patterns of changes investigate. • Patterns of changes give insights into the evolutionary processes underlying protein diversification -> estimation of evolutionary models • Choice of protein evolutionary models can be important for the sequence analysis we perform (database searching, sequence alignment, phylogenetics)
Amino acid substitution matrices based on observed substitutions: “empirical models” • Summarise the substitution pattern from large amount of existing data (‘average’ models) • Based on a selection of proteins • Globular proteins, membrane proteins? • Mitochondrial proteins? • Uses a given counting method and set of recorded changes • tree dependent/independent • restriction on the sequence divergence
Amino acid physico-chemical properties • Size • Polarity • Charges (acidic/basic) • Hydrophilic (polar) • Hydrophobic (non polar)
Taylor’s Venn diagram of amino acids properties Tiny Small P A Aliphatic CS-S G N Polar S CS-H Q V D - T I E L Charged K M + Y F H R W Hydrophobic Aromatic Taylor (1986). J Theor. Biol. 119: 205-218
Hydrophylic Small Large Hydrophobic Kosiol et al. (2004). J. Theor. Biol. 228: 97-106
Amino acids categories 1:Doolittle (1985). Sci. Am. 253, 74-85. • Small polar: S, G, D, N • Small non-polar: T, A, P, C • Large polar: E, Q, K, R • Large non-polar: V, I, L, M, F • Intermediate polarity: W, Y, H
Amino acids categories 2(PAM matrix) • Sulfhydryl: C • Small hydrophilic: S, T, A, P, G • Acid, amide: D, E, N, Q • Basic: H, R, K • Small hydrophobic : M, I, L, V • Aromatic: F, Y, W
Amino acids categories 3(implemented in SEAVIEW colour coding) • Tiny 1, non-polar: C • Tiny 2, non-polar: G • Imino acid: P • Non-polar: M, V, L, I, A, F, W • Acid: D, E • Basic: R, K • Aromatic: Y, H • Uncharged polar: S, T, Q, N
Amino acids categories Changes within a category are more common then between them • Colour coding of alignments to help visualise their quality (ClustalX, SEAVIEW) • Differential weighting of cost matrices in parsimony analyses • Mutational data matrices in model based methods (e.g. ML and Bayesian framework) • Recoding of the 20 amino acids into bins to focus on changes between bins (categories) (e.g. 6x6 matrix)
—> Colour coding of different categories is useful for protein alignment visual inspection
Phylogenetic trees from protein alignments • Parsimony based methods- unweighted/weighted • Distance methods- model for distance estimation • probability of amino acid changes, site rate heterogeneity • Maximum likelihoodandBayesian methods- model for ML calculations • probability of amino acid changes, site rate heterogeneity
Trees from protein alignment:Parsimony methods - cost matrices • All changes weighted equally • Differential weighting of changes: anattempt to correct for homoplasy!: • Based on the minimal number of amino acid substitutions, the genetic code matrix (PHYLIP-PROTPARS) • Weights based on physico-chemical properties of amino acids • Weights based on observed frequency of amino acid substitutions in alignments
Parsimony: unweighted matrix for amino acid changes • Ile-> Leu cost = 1 • Trp -> Asp cost = 1 • Ser -> Arg cost = 1 • Lys -> Asp cost = 1
Parsimony: weighted matrix for amino acid changes, the genetic code matrix • Ile-> Leu cost = 1 • Trp -> Asn cost = 3 • Ser -> Arg cost = 2 • Lys -> Asp cost = 2
Weighting matrix based on minimal amino acid changes PROTPARS inPHYLIP – also in SEAVIEW A C D E F G H I K L M N P Q R 1 2 T V W Y [A] 0 2 1 1 2 1 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2 [C] 2 0 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 2 2 1 1 [D] 1 2 0 1 2 1 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 [E] 1 2 1 0 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 [F] 2 1 2 2 0 2 2 1 2 1 2 2 2 2 2 1 2 2 1 2 1 [G] 1 1 1 1 2 0 2 2 2 2 2 2 2 2 1 2 1 2 1 1 2 [H] 2 2 1 2 2 2 0 2 2 1 2 1 1 1 1 2 2 2 2 2 1 [I] 2 2 2 2 1 2 2 0 1 1 1 1 2 2 1 2 1 1 1 2 2 [K] 2 2 2 1 2 2 2 1 0 2 1 1 2 1 1 2 2 1 2 2 2 [L] 2 2 2 2 1 2 1 1 2 0 1 2 1 1 1 1 2 2 1 1 2 [M] 2 2 2 2 2 2 2 1 1 1 0 2 2 2 1 2 2 1 1 2 3 [N] 2 2 1 2 2 2 1 1 1 2 2 0 2 2 2 2 1 1 2 3 1 [P] 1 2 2 2 2 2 1 2 2 1 2 2 0 1 1 1 2 1 2 2 2 [Q] 2 2 2 1 2 2 1 2 1 1 2 2 1 0 1 2 2 2 2 2 2 [R] 2 1 2 2 2 1 1 1 1 1 1 2 1 1 0 2 1 1 2 1 2 [1] 1 1 2 2 1 2 2 2 2 1 2 2 1 2 2 0 2 1 2 1 1 [2] 2 1 2 2 2 1 2 1 2 2 2 1 2 2 1 2 0 1 2 2 2 [T] 1 2 2 2 2 2 2 1 1 2 1 1 1 2 1 1 1 0 2 2 2 [V] 1 2 1 1 1 1 2 1 2 1 1 2 2 2 2 2 2 2 0 2 2 [W] 2 1 2 2 2 1 2 2 2 1 2 3 2 2 1 1 2 2 2 0 2 [Y] 2 1 1 2 1 2 1 2 2 2 3 1 2 2 2 1 2 2 2 2 0 W: TGG ||| N: AAC AAT A minimum of 3 changes are needed at the DNA level for W<->N
Phylogenetic trees from protein alignments • Parsimony based methods- unweighted/weighted • Distance methods- model for distance estimation • probability of amino acid changes, site rate heterogeneity • Maximum likelihoodandBayesian methods- model for ML calculations • probability of amino acid changes, site rate heterogeneity
Distance methods A two step approach - two choices! 1) Estimate all pairwise distances Choose a method (100s) - has an explicit model for sequence evolution 2) Estimate a tree from the distance matrix Choose a method: with or without an optimality criterion?
Estimation of protein pairwise distances • Simple formula • More complex models • 20 x 20 matrices (evolutionary model): • Identity matrix • Genetic code matrix • Mutational data matrices (MDMs) • Correction for rate heterogeneity between sites (G [a]+ pInv)
The Kimura and poisson corrections dij = -Ln (1 - Dij - (Dij2/5)) - Kimura dij = -Ln (1 –Dij) - poisson Dij is the observed proportion of differences between sequences i and j (1≥Dij≥0). - Can give good estimates for small Dij It can approximates the PAM matrix well However if Dij =1 (poisson) or Dij ≥ 0.8541 (Kimura), dij = infinite Does not take into account which amino acid are changing and assumes all sites change with the same rate Both are implemented in SEAVIEW
Amino acid substitution matrices (MDMs) • Sequence alignments based matrices PAM, JTT, BLOSUM, WAG, LG... • Structure alignments based matrices STR (for highly divergent sequences)
Protein distance measurements with MDM 20 x 20 matrices: • PAM, BLOSUM, WAG…matrices • Maximum likelihood calculation which takes into account: • All sites in the alignment • All pairwise rates in the matrix • Branch length dij = ML [P(n), Xij, (G, pinv)] (dodgy notation!) dij = -Ln (1 - Dij - (Dij2/5))= F(Dij)
How is an MDM inferred? • Observed raw changes are corrected for: • The amino acid relative mutability • The amino acid normalised frequency • Differences between MDM come from: • Choice of proteins used (membrane, globular) • Range of sequence similarities used • Counting methods • On a tree [MP, ML] • Pairwise comparison from an alignment -> empirical models from large datasets are typically used
How is an MDM inferred? The raw data: observed changes in pairwise comparisons in an alignment or on a tree seq.1 AIDESLIIASIATATI |*||*||*||*||*|| seq.2 AGDEALILASAATSTI
seq.1 AIDESLIIASIATATI |*||*||*||*||*|| seq.2 AGEEALILASAATSTI A S T G I L E D A 3 S 2 1 T 0 0 1 G 0 0 0 0 I 1 0 0 1 2 L 0 0 0 0 1 1 E 0 0 0 0 0 0 1 D 0 0 0 0 0 0 1 0 Raw matrix Symmetrical! -> The larger the dataset the better the estimates!
Amino Acid exchange matrices - s1,2 s1,3 … s1,20 s1,2 - s2,3 … s2,20 s1,3 s2,3 - … s3,20 … … … … … s1,20 s2,20 s3,20 … - X diag(π1, …, π20) = Q matrix Q Rate matrix Qij Instantaneous rates of change of amino acids sij Exchangeabilities of amino acid pairs ij sij = sij Time reversibility πi Stationarity of amino acid frequencies (typically the observed proportion of residues in the dataset)
Amino Acid exchange matrices R Relative rate matrix (no composition, no branch length) Q Rate matrix (with composition, not branch length) F P R Raw matrix Observed changes (counted on a MP tree or in pairwise comparisons) Probability matrix (composition + branch length) Can be estimated using ML on a tree Relatedness odd matrix Used for scoring alignments (BlastP, ClustalX) Modified from Peter Foster
The PAM and JTT matrices • PAM - Dayhoff et al. 1968 • Nuclear encoded genes, ~100 proteins • JTT - Jones et al. 1992 • 59,190 accepted point mutations for 16,300 proteins Jones, Taylor & Thornton (1992). CABIOS 8, 275-282
The BLOSUM matrices Henikoff & Henikoff (1992). Proc Natl Acad Sci USA 89, 10915-9 • BLOcks SUbstitution Matrices • The matrix values are based on 2000 conserved amino acid patterns (blocks) - pairwise comparisons —> more efficient for distantly related proteins —> more agreement with 3D structure data BLOSUM62 - 62% minimum sequence identity (BlastP default) BLOSUM50 - 50% minimum sequence identity BLOSUM42 - 42% minimum sequence identity (BlastP)
The WAG matrix Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699 • Globular protein sequences • 3,905 sequences from 182 protein families • Produced a phylogenetic trees for every family and used maximum likelihood to estimate the relative rate values in the rate matrix (overall lnL over 182 different trees) • Better fit of the model with most data (significant improvement of the tree lnL when compared to PAM or JTT matrices) • Might not be the best option in some cases such as for mitochondria encoded proteins or other membrane proteins…
Further improvements: the LG matrix Le and Gascuel (2008) Mol. Biol. Evol. 25, 1307-20 • Used the same phylogenetic approach then for WAG • Further refine the method by adding the variability of evolutionary rates across sites when estimating the matrix and increase the number of sequences used • Better fit of the model with most data (significant improvement of the tree lnL when compared to WAG and other matrices) • Might not be the best option in some cases such as for mitochondria encoded proteins or other membrane proteins… • Incorporated in ProTest that compares fit between various matrices and a given alignment and is available in PhyML – in SEAVIEW – for phylogenetic inferences