610 likes | 628 Vues
Explore protein families and alignments using BLOSUM Matrices in biological sequence analysis. Understand substitutions, odds, logodds, and clustering for related enzyme sequences.
 
                
                E N D
Biological Sequence Analysis • Chapter 3
Protein Families Organism 1 Enzyme 1 Related Sequences MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS ::::::.::::::::::::::::::::.:::::::::::::::::::::::::::::::::::::::::: MSEKKQTVDLGLLEEDDEFEEFPAEDWTGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS Closely related Same Function Protein Family Enzyme 2 Organism 2
Alignments ACDEFGHIKLMN ACEDFGHIPLMN 75%ID ACDEFGHIKLMN ACACFGKIKLMN 75%ID
Substitutions D Aspartic acid E Glutamic acid
Substitutions T Threonine S Serine
Substitutions A Alanine W Tryptophane
Deriving Substitution ScoresBLOSUM, Henikoff & Henikoff, 1992 Protein Family Block A Block B
BLOSUM MatricesHenikoff & Henikoff, 1992 w ...A... A 8 AA 1 AS ...A... 7 AA 1 AS ...A... 6 AA 1 AS ...A... 5 AA 1 AS s ...A... 4 AA 1 AS ...A... 3 AA 1 AS ...A... 2 AA 1 AS ...S... 0 AA 2 SA ...A... 1 AA 0 AS ...A... f 36 9 ws(s-1)/2 = 1x10x9/2 = 45
AA VY AR AC AS WW AD AT WY AE AV YY AW AY CC CD 0 0 0.2 0 0 0 0 0 0 0 0 0 0.8 0 0 0 BLOSUM MatricesHenikoff & Henikoff, 1992 Raw Frequencies ......... ......... q
BLOSUM MatricesHenikoff & Henikoff, 1992 45 pairs = 90 participants in pairs AA AS A’s in pairs: 36x2 + 9x1 = 81 S’s in pairs: 0x2 + 9x1 = 9 Probability pA for encountering an A: 81/90 = 0.9 Probability pS for encountering an S: 9/90 = 0.1 The probability of occurrence of the i’th amino acid in an i, j pair is:
BLOSUM MatricesHenikoff & Henikoff, 1992 Expected probability, e, of occurrence of pairs: eAA = pApA = 0.9x0.9 = 0.81 eSS = pSpS = 0.1x0.1 = 0.01 eAS = pApS + pSpA = 0.9x0.1 + 0.1x0.9 = 2x(0.9x0.1) = 0.18
BLOSUM MatricesHenikoff & Henikoff, 1992 Odds and logodds: Odd ratio: logodd, s: means that the observed frequencies are as expected means that the observed frequencies are lower than expected means that the observed frequencies are higher than expected In the final BLOSUM matrices values are presented in half-bits, i.e., logodds are multiplied with 2 and rounded to nearest integer.
BLOSUM MatricesHenikoff & Henikoff, 1992 • Segment clustering • Sequences with more than X% ID are represented as one average sequence (cluster) • Sequences are added to the cluster if it has more than X% ID to any of the sequences already in the cluster • If the clustering level is more than 50% ID, the final Matrix is a BLOSUM50, more than 62% leads to the BLOSUM62 matrix, etc.
BLOSUM MatricesHenikoff & Henikoff, 1992 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 ACDEFGHIKLMN ACACFGKIKLMN ACDEFGHIKLMN ACEDFGHIPLMN 4+9-2-4+6+6-1+4+5+4+5+6 = 42 4+9+2+2+6+6+8+4-1+4+5+6 = 55
Gaps A 10 20 30 40 50 60 70 humanD -----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS ..: . : :.:. : . .. ...:. :::::::::::::::..::::.:::: Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK------ 10 20 30 40 50 60 B 10 20 30 40 50 60 70 humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS ....:...:::::::::::::::::::::..:::..........::....:..::.......... Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK----- 10 20 30 40 50 60 Figure 3.3: (A) The human proteasomal subunit aligned to the mosquito homolog using the BLOSUM50 matrix. (B) The human proteasomal subunit aligned to the mosquito homolog using identity scores. 10 20 30 40 50 60 70 humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAH-VWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS ....:...:::::::::::::::::::::..:::... :::::::::::::::..::::.:::: Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK------ 10 20 30 40 50 60
Gap Penalties • A gap is a kind like a mismatch but... • Often the gap score (gap penalty) has an even lower value than the lowest mismatch score • Having only one type of gap penalties is called a linear gap cost • Biologically gaps are often inserted/deleted as a one or more event • In most alignment algorithms is two gap penalties. • One for making the first gap • Another (higher score) for making an additional gap • Affine gap penalty
Dynamic Programming The rest of the slides are stolen from Anders Gorm Petersen
Alignment depicted as path in matrix T C G C A T C C A TCGCA TC-CA T C G C A T C C A TCGCA T-CCA
Alignment depicted as path in matrix T C G C A T C C A Meaning of point in matrix: all residues up to this point have been aligned (but there are many different possible paths). x Position labeled “x”: TC aligned withTC --TC-TCTC TC--T-CTC
Dynamic programming: computation of scores T C G C A T C C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x
Dynamic programming: computation of scores T C G C A T C C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x score(x,y-1) - gap-penalty score(x,y) = max
Dynamic programming: computation of scores T C G C A T C C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x,y) = max
Dynamic programming: computation of scores T C G C A T C C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty score(x,y) = max
Dynamic programming: computation of scores T C G C A T C C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from. Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner. score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty score(x,y) = max
Dynamic programming: example A C G T A1-1 -1 -1 C -11-1 -1 G -1 -11-1 T -1 -1 -11 Gaps: -2
Dynamic programming: example T C G C A : : : : T C - C A 1+1-2+1+1 = 2
Global versus local alignments Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm). Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm). Global alignment Seq 1 Local alignment Seq 2
Local alignment overview • The recursive formula is changed by adding a fourth possibility: zero. This means local alignment scores are never negative. • Trace-back is started at the highest value rather than in lower right corner • Trace-back is stopped as soon as a zero is encountered score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty 0 score(x,y) = max
Substitution matrices and sequence similarity • Substitution matrices come as series of matrices calculated for different degrees of sequence similarity (different evolutionary distances). • ”Hard” matrices are designed for similar sequences – Hard matrices a designated by high numbers in the BLOSUM series (e.g., BLOSUM80) – Hard matrices yield short, highly conserved alignments • ”Soft” matrices are designed for less similar sequences – Soft matrices have low BLOSUM values (45) – Soft matrices yield longer, less well conserved alignments
Alignments: things to keep in mind “Optimal alignment” means “having the highest possible score, given substitution matrix and set of gap penalties”. This is NOT necessarily the biologically most meaningful alignment. Specifically, the underlying assumptions are often wrong: substitutions are not equally frequent at all positions, affine gap penalties do not model insertion/deletion well, etc. Pairwise alignment programs always produce an alignment - even when it does not make sense to align sequences.
Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database
Database searching Most common use of pairwise sequence alignments is to search databases for related sequences. For instance: find probable function of newly isolated protein by identifying similar proteins with known function. Most often, local alignment ( “Smith-Waterman”) is used for database searching: you are interested in finding out if ANY domain in your protein looks like something that is known. Often, full Smith-Waterman is too time-consuming for searching large databases, so heuristic methods are used (fasta, BLAST).
Database searching: heuristic search algorithms FASTA (Pearson 1995) Uses heuristics to avoid calculating the full dynamic programming matrix Speed up searches by an order of magnitude compared to full Smith-Waterman The statistical side of FASTA is still stronger than BLAST BLAST (Altschul 1990, 1997) Uses rapid word lookup methods to completely skip most of the database entries Extremely fast One order of magnitude faster than FASTA Two orders of magnitude faster than Smith-Waterman Almost as sensitive as FASTA
BLAST flavors BLASTN Nucleotide query sequence Nucleotide database BLASTP Protein query sequence Protein database BLASTX Nucleotide query sequence Protein database Compares all six reading frames with the database TBLASTN Protein query sequence Nucleotide database ”On the fly” six frame translation of database TBLASTX Nucleotide query sequence Nucleotide database Compares all reading frames of query with all reading frames of the database
Searching on the web: BLAST at NCBI Very fast computer dedicated to running BLAST searches Many databases that are always up to date Nice simple web interface But you still need knowledge about BLAST to use it properly
When is a database hit significant? • Problem: • Even unrelated sequences can be aligned (yielding a low score) • How do we know if a database hit is meaningful? • When is an alignment score sufficiently high? • Solution: • Determine the range of alignment scores you would expect to get for random reasons (i.e., when aligning unrelated sequences). • Compare actual scores to the distribution of random scores. • Is the real score much higher than you’d expect by chance?
Random alignment scores follow extreme value distributions Searching a database of unrelated sequences result in scores following an extreme value distribution The exact shape and location of the distribution depends on the exact nature of the database and the query sequence
Significance of a hit: one possible solution (1) Align query sequence to all sequences in database, note scores (2) Fit actual scores to a mixture of two sub-distributions: (a) an extreme value distribution and (b) a normal distribution (3) Use fitted extreme-value distribution to predict how many random hits to expect for any given score (the “E-value”)
Significance of a hit: example Search against a database of 10,000 sequences. An extreme-value distribution (blue) is fitted to the distribution of all scores. It is found that 99.9% of the blue distribution has a score below 112. This means that when searching a database of 10,000 sequences you’d expect to get 0.1% * 10,000 = 10 hits with a score of 112 or better for random reasons 10 is the E-value of a hit with score 112. You want E-values well below 1!
Database searching: E-values in BLAST BLAST uses precomputed extreme value distributions to calculate E-values from alignment scores For this reason BLAST only allows certain combinations of substitution matrices and gap penalties This also means that the fit is based on a different data set than the one you are working on A word of caution: BLAST tends to overestimate the significance of its matches E-values from BLAST are fine for identifying sure hits One should be careful using BLAST’s E-values to judge if a marginal hit can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
Refresher: pairwise alignments • Most used substitution matrices are themselves derived empirically from simple multiple alignments A/A 2.15% A/C 0.03% A/D 0.07% ... Calculate substitution frequencies Multiple alignment Convert to scores Freq(A/C),observed Freq(A/C),expected Score(A/C) = log
Multiple alignments: what use are they? • Starting point for studies of molecular evolution
Multiple alignments: what use are they? • Characterization of protein families: • Identification of conserved (functionally important) sequence regions • Prediction of structural features (disulfide bonds, amphipathic alpha-helices, surface loops, etc.)
Scoring a multiple alignment:the “sum of pairs” score AA: 4, AS: 1, AT:0 AS: 1, AT: 0 ST: 1 Score: 4+1+0+1+0+1 = 7 ...A... ...A... ...S... ...T... One column from alignment ⇒In theory, it is possible to define an alignment score for multiple alignments (there are many alternative scoring systems)
Multiple alignment: dynamic programming is only feasible for very small data sets • In theory, optimal multiple alignment can be found by dynamic programming using a matrix with more dimensions (one dimension per sequence) • BUT even with dynamic programming finding the optimal alignment very quickly becomes impossible due to the astronomical number of computations • Full dynamic programming only possible for up to about 4-5 protein sequences of average length • Even with heuristics, not feasible for more than 7-8 protein sequences • Never used in practice Dynamic programming matrix for 3 sequences For 3 sequences, optimal path must come from one of 7 previous points