Protein homology I: Evolution and comparison of protein sequences Biochem 565, Fall 2008 09/17/08 Cordes
Outline Homology and kinds of homology Mutations and sequence conservation 3. Pairwise alignment--global vs. local 4. Sequence identity and homology 5. Sequence similarity and homology-- use of substitution matrices 6. Alignment scores and statistics 7. Limitations of pairwise alignment 8. Remote homologies--use of evolutionary profiles
Evolutionary relationships between proteins boxes represent protein-coding genes A1 gene duplication A1 A2 speciation orthologs paralogs A1 A2 A1 A2 key terms to describe evolutionary relationships among proteins homologousdescended from a common ancestor, e.g. “A1 and A2 are homologous”. Also sometimes defined as “Similar due to descent from a common ancestor.” Homology is either/or-- there is no such thing as “percent homology”! Homologous is not a synonym for “similar”! It is, however, possible for only a part of two sequences to be homologous, for instance one domain in multidomain prot. paralogous related by gene duplication orthologous related by speciation
Orthologous and paralogous proteins As a general rule, orthologous proteins tend to perform the samefunction in different species, while paralogous proteins tend to have diversified somewhat in function--duplication is a very common way in which evolution gives proteins the freedom to develop new functions. For example, the chymotrypsin serine proteases are orthologous to each other, and they retain not only the same general function (proteolysis using a catalytic triad including a serine), but also the same substrate specificity (cleavage at positions following aromatic side chains). The chymotrypsins are paralogous to the trypsins and the elastases. These proteins share the same general serine protease function but have evolved different substrate specificities. These proteins also have paralogs which have lost all protease activity.
Homology at the domain level • proteins often have a modularorganization • single polypeptide chain may be divisible into smaller independent units of tertiary structure called domains • different domains in a protein are also often associated with different functions carried out by the protein, though some functions occur at the interface between domains • domains are a more fundamental unit of protein homology than a full protein--it is possible for two proteins to have one or more domains that are homologous combined with one or more that aren’t. In other words, domains can be “shuffled” in evolution. domain organization of P53 tumor suppressor 1 60 100 300 324 355 363 393 activation domain sequence-specific DNA binding domain non-specific DNA-binding domain tetramer- ization domain
Simple mutations in protein-coding gene sequences nonsynonymous substitutions--change in codon and in translated amino acid diagram shows DNA and translated protein sequence for two sequences related by mutations MetGluGlyTyrCysValAla... ATGGAAGGGTACTGCGTGGCA... ATGGAGGGGTACAGC---GCA... MetGluGlyTyrSer---Ala... deletions and insertions (indels)--if occur in multiples of three will lead to deletion/insertion of amino acids. Otherwise will produce frameshifts which change the entire downstream sequence. silent or synonymous substitutions--change in codon but not in translated amino acid
Acceptance and rejection of mutations Depends upon many factors, among which are: • Is the mutation a substitution, indel, frameshift? • If it is a substitution, is the mutation nonsynonymous or synonymous? • If it is nonsynonymous, is it “conservative”? Does it preserve the approximate physicochemical properties of the amino acid mutated, or does it change them radically? • What protein does it occur in? Some proteins more essential and more tightly constrained by natural selection than others. • Where does it occur in the protein? Is it important for function or the stability of the structure, or both/neither?
Synonymous and nonsynonymous substitutions Table. Substitution rates in genes encoding orthologous rodent and human proteins. Units are substitution rates per site per billion years. protein nonsynonymous ratesynonymous rateKA/KS histone 3 0.00 6.38 0 actin a 0.01 3.68 0.002 insulin 0.13 4.02 0.03 myoglobin 0.56 4.44 0.126 b-globin 0.80 3.05 0.262 urokinase 1.28 3.92 0.362 KA/KS is the ratio of nonsynonymous to synonymous changes in the gene, and is a measure of the functional selection on a protein. In general, synonymous changes are more likely to be accepted than nonsynonymous changes, but how much more likely varies a lot: the sequences of proteins with highly constrained function tend to evolve more slowly and have lower KA/KS values. This includes critical proteins with multiple levels of function and regulation, such as histones. adapted from Protein Evolution by L. Patthy, Blackwell Science, 1999 and from Fundamentals of Molecular Evolution by Li & Graur, Sinauer, 1991
Generalized substitution matrices The likelihood of a nonsynonymous substitution occurring and being accepted also depends upon whether the mutation is “conservative”, meaning that it preserves similar properties, or “nonconservative”. Substitutions observed in alignments of related proteins have been used to construct generalizedsubstitution matrices (e.g. BLOSUM, PAM, Gonnet)which reflect the average likelihood of a mutation occurring and being accepted in a protein. Cys, Trp least mutable, most unique in properties Polar more mutable than hydrophobic. Polar more likely to be substituted by polar, hydrophobic by hydrophobic the PAM 250 matrix (Margaret Dayhoff)
Generalized substitution matrices the PAM 250 matrix (Margaret Dayhoff)
Position-specific conservation and sequence variation Multiple alignments of members of families of related proteins, color coded by categories of amino acids, can reveal conservation at specific positions in the sequence. Color coding in this alignment: Orange: conserved small Green: conserved aliphatic Red: conserved basic Blue: conserved aromatic Position in sequence alignment Level of the bar indicates level of conservation, or lack of tolerance to mutation. Some positions variable, others not. names of family members
Position-specific conservation and sequence variation Multiple alignment Alternative representation: a sequence logo Logos represent sequence conservation in an easy to read format, with letter heights essentially representing the frequency with which a residue type occurs at a position in an alignment, relative to the frequency with which it would occur at random. The units of the y-axis are “bits” of information, which is to say that if a residue did not occur more often than expected at random, it would not offer us any information and the letter height would be zero. Note that the letter heights only become very high when a residue really dominates in the alignment, like Ala at the fifth position here. weblogo server: http://weblogo.berkeley.edu sequence logos paper: Schneider and Stevens, Nucleic Acids Res 18, 6097 (1990).
Classic studies of sequence conservation: the globins The globinsare the best studied family in terms of sequence conservation, partly because they were one of the first families for which multiple members were sequenced, and partly because some of the earliest protein structures (in fact, the earliest) solved were globins. The classic papers of Perutz, Kendrew and Watson were the first to correlate sequence conservation with aspects of protein structure and function. They drew their conclusion based on only a few aligned sequences. Later globin studies, such as those of Bashford, Chothia and Lesk, expanded the analyses of globin sequence conservation to include hundreds of sequences. Perutz, Kendrew & Watson J Mol Biol13, 669 (1965) Bashford, Chothia & Lesk J Mol Biol196, 199 (1987) Scapharca inaequivalvis oxygenated hemoglobin
Conservation of functional residues There were only 2 perfectly conserved residues among the 8 known globin structures at the time Bashford et al did their study. These are residues critical in binding of heme and/or interaction w/heme-bound oxygen. It will often be found that the best conserved (least tolerant of mutation) residues in related proteins are those involved in critical aspects of the general function. Phe 43 heme His 87 Residues involved in more specific aspects of function may or may not be conserved, depending upon the relationship between the proteins under consideration. For example, residues involved in substrate specificity for serine proteases may be conserved among orthologs, such as the chymotrypsins, but not between paralogs, such as chymotrypsins and trypsins.
Conservation at buried (interior) positions • Core or buried residues, which are usually hydrophobic, often tolerate conservative substitutions, i.e. to other hydrophobics • overall core volume is well-conserved (Lim & Ptitsyn, 1970) though individual core positions tolerate variation in volume • this reflects what we know about the packing in protein interiors and the effects of interior mutations on stability--thus, sequence conservation is partly related to maintaining a stable structure! portion of alignment of prokaryotic and eukaryotic globins residues on one face of this helix are in the interior Tyr 140 yellow = small green = hydrophobic pink/red = neutral polar/acidic blue = basic buried human hemoglobin beta chain His 156
Conservation at solvent-exposed positions • Solvent-exposed (surface) positions are mutable and usually tolerate • mutation to many residue types including hydrophobics. Bashford et al., • however, noted that for globins at least, some surface positions do not • tolerate large hydrophobics. Since polar-to-hydrophobic mutations on protein • surfaces do not reduce stability, this conservation could reflect constraints • on solubility. Indeed, it is clear that the overall polar character of the • surface is conserved for soluble, globular proteins, even though a certain • number of hydrophobics may be tolerated. residues on the other face of this helix are exposed to solvent Tyr 140 examples of surface residues yellow = small green = hydrophobic pink/red = neutral polar/acidic blue = basic human hemoglobin beta chain His 156
Conservation of loops and turns • Loops and turns that connect regular secondary structures are often hypermutable and vary not only in sequence but in length, tolerating insertion and deletion events (which are not well-tolerated within regular secondary structure elements). human hemoglobin a chain part of alignment of animal hemoglobin a and b chains
Covariation analysis Substitution patterns at different positions in a sequence alignment are not necessarily independent. This is sometimes referred to as covariation or correlated evolution. namesequence A YADLGRIKS B YSDLGSEKE C IDDFGEIAA D IDDFGVIGT For example, in the mini multiple alignment shown at left, the identity of the residue at the 4th position is correlated to the identity of the residue at the 1st position. A statistical perturbation analysis can be used to characterize this covariation. An alignment of related sequences is “perturbed” by only considering sequences at which, for example, the first position is Y. The effect of this perturbation on the residue distribution observed at other positions is then measured. If the distribution changes significantly, covariation between sequence changes at the first site and other sites in the alignment is inferred.
Covariation and hydrophobic core packing The hydrophobic core residues in related proteins tend to be covariant due to constraints on core packing. One sees compensatory volume changes at different positions. Davidson and coworkers found that for 266 aligned SH3 domain sequences, the strongest covariation was observed for a cluster of central hydrophobic residues. For example, substitution of a smaller residue (Ala->Gly) at 39 was strongly correlated to substitution of a larger residue (Ile->Phe) at 50. Hydrophobic core of SH3 domains, with most frequently covarying residues shown in yellow S.M. Larson, A.A. DiNardo and A.R. Davidson, J Mol Biol 303, 433 (2000)
Some recent studies (Suel et al) have suggested a connection between covarying clusters of residues and transduction of signals between distant sites in proteins. For example, G-protein coupled receptors bind a ligand on one side of a membrane, and then transduce that signal to the other side through conformational change. Suel et al showed that the main clusters of covarying residues tended to connect the ligand and G-protein binding sites. ligand covarying networks (brown) membrane G-protein binding sites Suel et al. Nat Struct Biol 2003
Inferring homology between proteins The simplest way of identifying homology is by sequence comparison. If two protein sequences are sufficiently similar (we’ll talk about what similarity means in a moment), they can be statistically inferred to be homologous. In addition, if a sequence obeys conservation patterns observed in a known family of related sequences, it can be inferred to be a member of that family. For sequences of statistically borderline similarity, structural and functional comparison, if such information is available, can be used as a supplement to establish common ancestry. If similarity between two sequences is really statistically weak, very strong structural and functional similarity can still make a convincing argument for homology. Finally, gene context can play a role--for example, do two genes occupy the same location within an operon in different organisms? We will next focus on identification of homology through sequence comparison. We will begin with simple pairwise comparison.
Pairwise alignment of sequences--global and local F R T Y I A E W Q R T E P G A D H F Q T Y A A D Y - R T E P S S D H * * * * * * * * * * GLOBAL ALIGNMENT entire length of sequence aligned--about 60% identity over 17 residues. Note that allowance for gaps improves the % identity. The best alignment would be determined by using some optimization algorithm in combination with a scoring scheme, e.g. +1 for every identity and 0 for every mismatch or gap (identity matrix). - - - - - - - - - R T E P G A D H - - - - - - - - - R T E P S S D H * * * * * * LOCAL ALIGNMENT only the best matching portion(s) of sequence is (are) included in the alignment--75% percent identity over 8 residues. How does a local alignment algorithm decide where to stop? By lengthening the alignment only insofar as it increases the score. For example, one could increase the score by +2 for every identical amino acid, while assigning a penalty of -1 for every mismatch or gap. Such penalties would prevent the alignment from extending to dissimilar regions
Pairwise alignment of sequences--global vs. local Local alignment is more versatile than globaland is thus more widely used. It can be used to align proteins that are not related throughout their lengths but share a conserved domain, as well as proteins with very unevenly distributed sequence similarity. Many many such cases exist. Thus, when one has no prior knowledge of what to expect, local alignment routines are preferable. This will especially be the case if one is using pairwise alignment to search a database for sequences that are related to a query sequence. Thus, alignment algorithms for database searching essentially always use local alignment. It should be noted that the scoring scheme used can be tailored to favor longer or shorter local alignments. Global alignment is usually used to align sequences that are approximately the same length and are already known to be related. Once we’ve aligned all or part of a pair of sequences, how do we decide whether they are homologous?
Percent sequence identity and homology Common rule-of-thumb: 30% identical residues between two aligned protein sequences indicates homology. This is too simplistic and only works if the 30% is measured over a long stretch of amino acids! high level of identity between unrelated proteins is common at short alignment lengths do not worry about this line 20-30% identity called the “twilight zone”: difficult to assess relatedness from identity from Brenner et al. PNAS 95, 6073 (1998) the 30% identity threshold for identification of homology only works for long alignments, i.e. >100-150 amino acids
Sequence identity and homology: false positives Note also that gaps are allowed in this alignment--identity would be lower if gaps were not allowed. However, gaps are common among true homologs. sequence identity is 39% over 64 residues, yet the two proteins are unrelated--this would be a false positive using a 30% cutoff rule. Use of a length-dependent cutoff would help. from Brenner et al. PNAS 95, 6073 (1998)
Sequence identity and homology: poor coverage the two proteins have the same fold,both bind heme and oxygen in same place: good independent structural/functional evidence for homology... Yet alignments of their sequences reveal only 24% identity. There are also many examples of related globins and other proteins with much lower identity than this. 1MBO and 1HBB hemoglobin and myoglobin Any reasonable sequence identity criterion, whether it is a flat percent cutoff or a length-dependent cutoff, will give incomplete coverage--in other words, it will fail to identify many distant but true relationships.
“Sequence similarity” and homology Sequence identity is one specific way of assessing sequence similarity, and it’s not a very good one. If you just use sequence identity, you are throwing away a lot of information. As we have just learned, not all mutations are equally likely to occur and be accepted during the course of evolution. Knowledge of what substitutions commonly occur among related proteins can be put to use both in aligning sequences and in using sequence similarity to identify homology/common ancestry. Various methods have been developed which use such knowledge to assess sequence similarity. The most widely used and familiar of these methods work by using generalized amino acid substitution matrices (aka scoring matrices) in tandem with effective computational alignment algorithms that find the best (highest scoring) alignment. This is coupled with a statistical assessment of the significance of the alignment score obtained between two sequences using a given matrix.
Percent similarity in sequence alignments G D A Y M - - V R D W I G + Y M + R D W G E R Y M Q P L R D W G 6 2 -1 7 5 2 5 6 11 -4 Substitution matrix element assessing probability of mutations exchanging the two aligned amino acids Identical amino acids Similar amino acids: positive matrix element These two sequences have 50% identity, but 67% similarity
Scoring alignments using substitution matrices G D A Y M - - F R D W I G E R Y M Q P L R D W G 6 2 -1 7 5 -11 -1 0 5 6 11 -4 = 25 gap extension penalty overall score is sum of scores at each position gap opening penalty substitution penalties are just elements from a substitution matrix A more sophisticated way to assess similarity is to actually “score” the alignment using the substitution matrix. One must also apply penalties for introducing and lengthening gaps in the alignment. In theory, the raw alignment score is related to the odds or probability that the alignment represents an actual homologous relationship between two proteins. Because scoring matrices are in logarithmic odds form, the overall alignment score is the sum of the scores at each position rather than the product.
Common pairwise alignment methods Smith-Waterman dynamic programming algorithm: Mathematically guaranteed to find highest scoring alignment for a given set of input parameters. Tradeoff is that it is slow, although computer speed is getting to the point where this is less of a problem. The global version of Smith-Waterman is called Needleman-Wunsch. If one were simply comparing any 2 sequences to see if they are homologous, Smith-Waterman would be the method of choice. BLAST (Basic Local Alignment Search Tool) FASTA These two are very similar--both achieve a speed advantage over Smith-Waterman by initially looking for short “words” of 2 or 3 residues that (nearly) exactly match. Alignments are then built from these initial seed matches. The tradeoff for the speed advantage is that some homologies may be missed. Because of their speed, BLAST and FASTA are used in searches of large databases for homologues. This is a very common application--I have a protein, and I want to ask, is it related to anything about which anything is known?
Variables in local alignment-basedsearch algorithms scoring matrixthe generalized log odds substitution matrix used to score alignment--BLOSUM and PAM are the most commonly used. BLOSUM 62 is default on BLAST and BLOSUM 50 on most FASTA servers gap penaltiesgap opening penalty (for initiating a gap) gap extension penalty (adding elements to existing gap) “word size”(“ktup” parameter in FASTA). BLAST and FASTA are so fast partly because they start by looking for short “words” that match exactly and build up a longer alignment from these words. The size of the starting words can be varied with this parameter (the shorter the word the more it slows down the program) filterfilters sequence to get rid of “low complexity” regions. Such regions can lead to false positives due to their compositional bias.
Statistical significance of alignment scores: The extreme value distribution Raw alignment scores by themselves are not particularly meaningful. In order to assess the statistical significance of an alignment, i.e. the chances that it represents a real relationship, one must understand what the distribution of alignment scores would be for random pairs of sequences of similar length and composition. Such scores obey what is called an extreme value distribution, which is like a normal distribution but has a positively skewed tail. The exact characteristics of the distribution will depend upon the scoring matrix, the gap penalties employed, the composition of the sequences, etc. what is probability P that a random alignment will have a given score or higher? example of extreme value distribution # of occurrences vs. alignment score Altschul et al. Nucleic Acids Research 25, 3389 (1997)
Statistical significance of alignment scores: Z-scores, P-values and E-values A Z-score is the number of standard deviations between the alignment score and the mean of a normal distribution. The FASTA algorithm reports Z scores in its output. A P-value is the probability that an alignment between two random sequences will have a score equal to or greater than the observed score, as calculated from the extreme value distribution. The E-value or expect value represents the number of times that the observed score or higher would be observed when searching a database of D sequences. For cases where P < 0.1, E ~ D*P. Both FASTA and BLAST report E-values for alignments. Basically, to be confident that a match between two sequences represents true homology, you generally want an E-value < 0.01. That means there’s a 1 in 100 chance that you have a false positive. It has been shown (Brenner et al. 1998) that FASTA and BLAST E-values do a pretty good job of distinguishing true and false positives.
Sample BLAST output alignment score E-value GenBank identifier “positives” means positions at which scoring matrix element is positive percent positives is sometimes also called “percent similarity”
BLAST and FASTA can identify some homologues in the “twilight zone”--20 to 30% identity Score = 43.5 bits (101), Expect = 0.001 Identities = 36/145 (24%), Positives = 56/145 (37%), Gaps = 2/145 (1%) Query: 2 LSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDL 61 L+ E V +W KV D G + L RL +P T F+ F L T + + + Sbjct: 4 LTPEEKSAVTALWGKVNVDEVGG--EALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 Query: 62 KKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPG 121 K HG VL A L + + L++ H K + + + ++ VL Sbjct: 62 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 Query: 122 DFGADAQGAMNKALELFRKDIAAKY 146 +F Q A K + +A KY Sbjct: 122 EFTPPVQAAYQKVVAGVANALAHKY 146 BLAST alignment of hemoglobin and myoglobin Even though sequence identity here is low, the E-value is statistically significant
Comparing pairs of sequences will not detect all homologies Matrix-scored pairwise alignments with robust statistics like E-values do a good job of avoiding false positives--however their coverageis imperfect (though it’s better than just using % identity). That is, there will be many relationships that they will miss because the sequences have drifted too far apart! white bars: pairs of “remote homologs” missed by pairwise alignment homology identified independently in this trial database by known structural/ functional similarity EPQ means errors per query, ideally like E < 0.01 (1 in 100 chance of false positive) SSEARCH: Smith-Waterman algorithm black bars: relationships successfully identified by sequence comparison. Most are pairs with more than 20% identical sequences. from Brenner et al. PNAS 95, 6073 (1998)
Multiple alignment of sequences Conservation patterns observed in families of homologous sequences carry much more useful information than do single sequences, both from the point of view of understanding structure and function for a family, as well as for identifying whether a particular sequences is homologous to a particular family. Obtaining this information depends upon the ability to generate alignments of multiple related sequences: We aren’t going to have time to talk about methods for multiple alignment. Some of the better known methods/websites, such as ClustalX for global multiple alignment, will be listed as links on the course website. I recommend Chapter 4 of David Mount’s Bioinformatics for thorough coverage of the topic. We’re going to focus instead on what one can do with multiply aligned sequences.
Position-dependent scoring matrices or “profiles” of sequence families can be generated from multiple alignments row in matrix is constructed by weighting a generalized substitution matrix by the appearance of the different amino acids in the alignment. For example, this row might be made from an equal weight of the E, G, V and L columns in, say, a PAM250 matrix. The resulting matrix contains position-dependent information about sequence conservation within a particular family of sequences, as opposed to a generalized scoring matrix, which is constructed by averaging general sequence conservation tendencies among many families of related sequences Gribskov, McLachlan & Eisenberg, PNAS 84, 4355, 1987
Examples of models generated from multiple alignments these two are almost the same thing profiles position-specific scoring matrices (PSSM) hidden Markov models (HMM) These models can be generated for lengthy sequences or for short ungapped conserved regions (blocks or motifs)
PSI-BLAST (Position-Specific Iterated) Altschul et al. Nucleic Acids Research 25, 3389 (1997) initial BLAST search hits with significant similarity (e.g. E < 0.005) multiple alignment of hits query sequence PSSM utility obviously depends on getting some seed hits iterated BLAST search using the PSSM as query the utility of PSI-BLAST in finding more remote homologues than simple pairwise searches has been demonstrated. An example of a similar program that uses a Hidden Markov model rather than a PSSM is SAM-T99 (now SAM-T02)
Example of utility of PSI BLAST two BRCT domains from BRCA1 used as query initial BLAST with cutoff of E <0.01 brings up only BRCT domains from other BRCA1s (orthologues) few false positives were found using E<0.01 cutoff repeated rounds of PSI-BLAST bring up many others and reveal first plant protein to contain BRCTs Altschul et al. Nucleic Acids Research 25, 3389 (1997)
Searching profile databases database of HMMs, PSSMs query sequence A number of researchers have used similarity searches to cluster the known proteins into homologous groups, and then generated profiles for each cluster using HMMs or PSSMs. Servers now allow one to do similarity searches of these database profiles using a single query sequence. This is qualitatively the reverse of what is done in PSI-BLAST, in which one generates a profile and uses it to match individual database sequences. Some of these profiles represent motifs or short ungapped “blocks”, whereas others are the length of entire domains. Among the best known collections of domain profiles are SMART and Pfam. These two form part of what is now called the Conserved Domain Database (CDD). BLAST searches with the NCBI server will now automatically do a search against the CDD unless you opt not to.