420 likes | 594 Vues
Sequence Alignments and Database Searching. Aug 27, 2008 Biochemistry 111 Thang Chiu, MEB 7E1, x2024. Adapted from DKW lecture. Protein A of interest to you. ornithine decarboxylase?. Why compare protein sequences?.
E N D
Sequence Alignments and Database Searching Aug 27, 2008 Biochemistry 111 Thang Chiu, MEB 7E1, x2024 Adapted from DKW lecture
Protein A of interest to you. ornithine decarboxylase? Why compare protein sequences? Significant sequence similarities allow associations based upon known functions.
Homology vs. similarity Possible for proteins to possess high sequence identity/ similarity between segments and not be homologous Homologous proteins (ie having similar structures) need not posess high sequence identity / similarity: S. griseus trypsin 36% S. griseus protease A 25% cytochrome c4, has reasonably high sequence identity/ similarity with trypsins, yet does not have common ancestor, nor common fold. subtilisin has same spatial arrangement of active site residues, but is not related to trypsins Extracted from ISMB2000 tutorial, WR Pearson, U. of Virginia
Homology vs. similarity • Homologous proteins always share a commonthree- • dimensional fold, often with common active or binding site. • Proteins that share a common ancestor are homologous. • Proteins that possess >25% identity across entirelength generally will be homologous. • Proteins with <20% identity are not necessarily not homologous
Homology vs. similarity Extracted from ISMB2000 tutorial, WR Pearson, U. of Virginia Orthologous cyctochrome c isozymes Homologous sequences are either: 1) orthologous, or 2) paralogous Orthologs - sequence differences arises from divergence in different species (i.e. cyctochrome c) Paralogs - sequence differences arise after gene duplication within a given species (i.e. GPCRs, hemoglobins) Hemoglobins contain both orthologs and paralogs • For orthologs - sequence divergence and evolutionary relationships will agree. • For paralogs - no necessary linkage between sequence divergence and speciation.
We’ve all seen and/or used sequence alignments, but how are they accomplished? Sequence searches and alignments using DNA/RNA are usually not as informative as searches and alignments using protein sequences. However. DNA/RNA searches are intuitively easier to understand: AGGCTTAGCAAA........TCAGGGCCTAATGCG |||||||| ||| ||||||||||| ||| AGGCTTAGGAAACTTCCTAGTCAGGGCCTAAAGCG The above alignment could be scored giving a “1” for each identical nucleotide, A zero for a mismatch, and a -4 for “opening a “gap” and a -1 for each extension of the gap. So score = 25 – 11= 14
Protein sequence alignments are much more complicated. How would this alignment be scored? ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH | | | | | ||| | | || ||| AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH Unlike nucleotide sequence alignments, which are either identical or not identical at a given position, protein sequence alignments include “shades of grey” where one might acknowledge that a T is sort of equivalent to an S etc. But how equivalent? What number would you assign to an S-T mismatch? And what about gaps? Since alanine is a common amino acid, couldn’t the A-A match be by chance? Since Trp and Cys are uncommon, should those matches be given higher scores? Do you see that accurately aligning sequences and accurately finding related sequences are the same problem?
Needleman-Wunsch global sequence alignment (JMB (1970), 48, 443-453) Assign score to all cells A B C N J R A B C N J R A B C N J R A B C N J R A B C N J R A B C N J R A J C J N R A 1 0 0 0 0 0 J 01 1 1 2 1 C 01 1 0 0 0 J 01 0 0 1 0 N 01 0 1 0 0 R 01 0 0 0 1 A 1 0 0 0 0 0 J 01 1 1 2 1 C 01 2 1 1 2 J 01 1 2 3 2 N 01 1 3 2 3 R 01 1 2 2 4 A 1 0 0 0 0 0 J 0 0 0 0 1 0 C 0 0 1 0 0 0 J 0 0 0 0 1 0 N 0 0 0 1 0 0 R 0 0 0 0 0 1 A 1 0 0 0 0 0 J 01 1 1 2 1 C 01 2 1 1 2 J 01 1 2 3 2 N 01 1 3 2 3 R 01 1 2 2 4 A 1 0 0 0 0 0 J 0 0 0 0 1 1 C 0 0 1 0 0 0 J 0 0 0 0 1 0 N 0 0 0 1 0 0 R 0 0 0 0 0 1 Traceback A B C N J R A J C J N R A B C N J R A J C J N R SUM S(I,j) with max of S(I,j) of previous column/row OR
Databases Nucleotide: GenBank (NCBI), EMBL, DDBJ Protein: SwissProt, TrEMBL, GenPept(GenBank) Huge databases – share much information. Many entries linked to other databases (e.g. PDB). SwissProt small but well “curated”. NCBI non-redundant (nr) protein sequence database is very large but sometimes confusing. These databases can be searched in a number of ways. Can search only human or metazoan sequences. Can eliminate entries made before a given Date. Etc.
Continued…. NCBI GI numbers:a series of digits that are assigned consecutively by NCBI to each sequence it processes. Version numbers:consist of the accession number followed by a dot and a version number. Nucleotide sequence: GI: 6995995VERSION: NM_000492.2 Protein translation: GI: 6995996VERSION: NP_000483.2 >gi|897557|gb|AAA98443.1| TIAM1 protein http://www.ornl.gov/sci/techresources/Human_Genome/posters/chromosome/geneguide.shtml
We’ve got the data, now how do we score/search? First, we need a way to assign numbers to “shades of grey” matches. Genetic code scoring system – This assumes that changes in protein sequence arise from mutations. If only one point mutation is needed to change a given AA to another (at a specific position in alignment), the two amino-acids are more closely related than if two point mutations were required. Physicochemical scoring system – a Thr is like a Ser, a Trp is not like an Ala…… These systems are seldom used because they have problems. Why try to second guess Nature? Since there are many related sequences out there, we can look at some (trusted) alignments to SEE which sub- stitutions have occurred and the frequency with which they occur.
PAM (Percent Acceptable Mutation) matrices • Are derived from studying global alignments of well-characterized protein families. • PAM1 = only 1% of residues has changed (ie short evolutionary distance) • Raise this to 250 power to get 250% change of two sequences (greater • evolutionary distance), or about 20% sequence identity. • Therefore, • a PAM 30 would be used to analyze more closely related proteins, • a PAM 400 is used for finding and analyzing distantly related proteins. • PAMx = PAM1x • (Dayhoff, Atlas of Protein Sequence and Structure, vol. 5, suppl 3, p 345-352)
Block substitution matrices (BLOSUM) Arederived from studying local alignments (blocks) of sequences from related proteins that differ by no more than X%. (Henikoff & Henikoff, PNAS ‘92, 89, p10915-10919) In other words, one might use the portions of aligned sequences from related proteins that have no more than 62% identity (in the portions or blocks) to derive the BLOSUM 62 scoring matrix. One might use only the blocks that have <80% identity to derive the BLOSUM 80 matrix. 3) BLOSUM and PAM substitution matrices have the opposite effects: The higher the number of the BLOSUM matrix (BLOSUM X), the more closely related proteins you are looking for. The higher the number of the PAM matrix (PAM X), the more distantly related proteins you are looking for.
Amino acid substitution matrices • Negative scores - unlikely substitutions Note that for identical matches, scores vary depending upon observed frequencies. That is, rare amino acid (i.e. Trp) that are not substituted have high scores; frequently occuring amino acids (i.e. Ala) are down-weighted because of the high probability of aligning by chance. PAM250 matrix Extracted from ISMB2000 tutorial, WR Pearson, U. of Virginia
Gap penalties – Intuitively one recognizes that there should be a penalty for introducing (requiring) a gap during identification/alignment of a given sequence. But if two sequences are related, the gaps may well be located In loop regions which are more tolerant of mutational events and probably have little impact on structure. Therefore, a new gap should be penalized, but extending an existing gap should be penalized very little. Filtering – many proteins and nucleotides contain simple repeats or regions of low sequence complexity. These must be excluded from searches and alignments. Why? Significance of a “hit” during a search - More important than an arbitrary score is an estimation of the likelihood of finding a hit through pure chance. Ergo the “Expectation value” or E-value. E-values can be as low as 10-70.
E-value So, for sufficiently large databases (so can apply statistics): E = Kmne-S m- query length n - database length E - expectation value K - scale factor for search space (database) - scale factor for scoring system S - score, dependent on substitution matrix, gap-penalties, etc. Doubling either sequence string doubles number of sequences with a given expectation value; similarly, double the score and expectation value decreases exponentially Expectation value - probability that given score will occur by chance given the query AND database strings
Removing length bias from scoring statistics • Must account for increases in similarity score due to increase in sequence length searched. • Scaling with against the sequence length allows the detection of distantly-related sequences. • solids = individual sequence • opens = average score Extracted from ISMB2000 tutorial, WR Pearson, U. of Virginia
Global versus local alignments • Global scores require alignment of entire sequence length. • Cannot be used to detect relationships between domains in mosaic proteins. Local alignments are necessary to detect domains within mosaic proteins, internal duplications. Extracted from ISMB2000 tutorial, WR Pearson, U. of Virginia
Basic local alignment search tool (BLAST) • Break query up into “words” e.g. ASTGHKDLLV • AST • WORDS STG • TGH • 2) Generate expanded list of words that would match with (i.e. PAM250) • a score of at least T – You’re acknowledging that you may not have any • exact matches with original list of words. • 3) Use expanded list of words to search database for exact matches. • 4) Extend alignments from where word(s) found exact match. Heuristic algorithm – Uses guesses. Increases speed without a great loss of accuracy (BLASTP, FASTA (local Hueristic), S-W local rigorous, Needleman-Wunsch global, rigorous)
Pictorial representation of BLAST algorithm (Basic Local Alignment Search Tool). Query sequence Words (they overlap) Expand list of words Search database, find exact hits, extend alignments Report sorted list of hits
BLAST ATCGCCATGCTTAATTGGGCTT CATGCTTAATT exact word match one hit Nucleotide BLAST looks for exact matches Protein BLAST requires two hits GTQITVEDLFYNI SEI YYN neighborhood words NCBI two hits
FASTA Instead of breaking up query into words (and then generating a list of similar words), find all sequences in the database that contain short sequences that are exact or nearly exact matches for sequences within the query. Score these and sort. Sort of reverse methodology to BLAST Database sequence Query sequence
sorted by e values 5 X 10-98 S’ = (λS –lnK)/ln2 E=mn2-S’ link to entrez LocusLink
Identifying distant homologies (use several different query sequences) Also remember - If A is homologous to B, and B to C, then A should be homologous to C Examine output carefully. A lack of statistical significance doesn’t necessarily mean a lack of homology! Extracted from ISMB2000 tutorial, WR Pearson, U. of Virginia
PSI-BLAST Very sensitive, but must not include a non-member sequence! • Regular BLAST search • Sequences above a certain threshold (< specified E-value) are • included. Assumed to be related proteins. This group of sequences • is used to define a “profile” that contains the essence of the “family”. • Now with the important sequence positions highlighted, can look • for more distantly related sequences that should still have the essence • of the protein family. • Inclusion of more distantly related sequences modifies the profile • further (further defines the essence) and allows for identification of • even more distantly related sequences. Etc. Note: PSI-BLAST may find and then subsequently lose a homologous sequence during the iteration process! “Drifting” of the program, would be the gradual loss of close homologs during the iteration process.
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOH MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYY VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQ EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNG RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTH VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY PSI-BLAST: initial run NCBI e value cutoff for PSSM
PSI-BLAST: first PSSM search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST NCBI
PSI-BLAST: importance of original query (remember, if A is like B….) iteration 1 iteration 2 PSI-Blast of human Tiam1
PSI-BLAST: importance of original query iteration 1 iteration 2 PSI-Blast of mouse Tiam2 (~90% identity with human Tiam1) Ras-binding domains iteration 3
Weakly conserved serine Active site serine Position specific scoring matrix (PSSM) (learning from your “hits”) NCBI
Position specific scoring matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine scored differently in these two positions Active site nucleophile NCBI
Multiple sequence alignments (MSAs) In this example, an MSA is used to identify regions of high sequence conservation presumably reflecting structural and functional constraints. Useful for delimiting known domains and potential new functional regions (e.g. the Ras-binding domain in yellow and the blue box of currently unknown function).
Fun with MSA... MSA used to locate functional residues and domain boundaries in homologs of Dbl-proteins with known structure (Dbs and Tiam1). Red amino acids directly interact with GTPases. Blue residues directly interact with phosphoinositides.
Tutorial on Jalview for MSA Determining domain boundary for construct to express Secondary, possible 3D structural information to help narrow down 5’ and 3’ regions for PCR primers
What you should know Homology If two proteins are homologous, they have a common fold and a common ancestor If two proteins have >25% identity across their entire length, they are likely to be Homologs. However, sometimes true homologs have quite low sequence identity! Orthologs Homologous (and equivalent) proteins from different species. Arise from speciation. Paralogs Homologous (and equivalent) proteins found in same species. Divergence of sequences NOT from speciation. Alignments How to score? Minimum # of mutations?, Physicochemical properties (as perceived by us)?, Or learn from nature? Scoring schemes PAM, BLOSUM
E values What it means in words E = Kmne -λS Alignment algorithms BLAST (Basic Local Alignment Search Tool) FASTA (Fast Alignment) Needleman-Wunsch (Global alignment) Why use local alignment algorithm?