Protein Evolution and Sequence Analysis
Central Premise Significant sequence similarity allows one to assign function to an unknown protein(s) based on properties of known proteins and is a direct consequence of evolutionary relationships. Speciation- Evolution of a new gene/protein that is genetically independent of the ancestral gene from which it arose. Homolog- A gene/protein related to a second gene/protein by descent from a common ancestral gene by speciation. Ortholog- Genes/proteins in different species that evolved from a common ancestral gene by speciation and that retain the same function. Paralog- Genes/proteins related by duplication of a common ancestral gene that evolves new functions even if related to that of the ancestor. Convergent evolution-Evolution of similar features or properties in genes/proteins of different genetic lineages.
Divergent and Convergent Evolution Among the Serine Proteases Chymotrypsin Overlay Trypsin 3NKK 1ACB Subtilisn 1SBT
Mechanisms Involved in Molecular Evolution of Genes/Proteins Mutation- Stochastic single point changes in the genetic material due to errors in DNA replication during mitosis, radiation exposure, chemical or environmental stressors, or viruses and transposable elements. Slow but constant rate (molecular clock) of 10-9 to 10-8 mutations per base per generation. Splicing errors in eukaryotes that retain introns. Recombination- Exchange of genes or portions of genes between different chromosomes to create new combinations of elements. Geneduplication-Duplication of a gene or portions of a gene, one of which continues the original function and the other is free to evolve and acquire new functions. Retrotransposition- Incorporation of mRNA sequences back into DNA, frequently inserting into new locations with different expression patterns. The mechanism by which new genes/proteins arise allow for the possibility of sequence analysis to infer functional and structural relationships among different sequences.
Sequence alignments are methods to arranging DNA, RNA, or protein sequences to identify regions of similarity or identity with the goal of inferring structure, function, or both. Sequence searches and alignments using DNA/RNA are usually not as informative as searches and alignments using protein sequences. However. DNA/RNA searches are intuitively easier to understand: AGGCTTAGCAAA........TCAGGGCCTAATGCG |||||||| ||| ||||||||||| ||| AGGCTTAGGAAACTTCCTAGTCAGGGCCTAAAGCG The above pairwise alignment could be scored giving a “1” for each identical nucleotide, A zero for a mismatch, and a -4 for “opening a “gap” and a -1 for each extension of the gap. So score = 25 – 11= 14
Protein sequence alignments are much more complicated but are more informative because they involve 20 degrees of freedom (total possible amino acids) rather than 4 (total possible bases). ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH | | | | | ||| | | || ||| AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH Unlike nucleotide sequence alignments, which are either identical or not identical at a given position, protein sequence alignments include “shades of grey” where one might acknowledge that a T is sort of equivalent to an S. But how equivalent? What number would you assign to an S-T mismatch? And what about gaps? Since alanine is a common amino acid, couldn’t the A-A match be by chance? Since Trp and Cys are uncommon, should those matches be given higher scores? Therefore, accurately aligning sequences and accurately finding related sequences are approximatelythe same problem?
Multiple Sequence Alignments Sequence comparisons fall into two categories: Local alignment in which regions of a large sequences are compared to identify regions of similarity such as in domains and global alignments in which similar sequences of similar length are compared to analyze overall similarity. Various methods are available depending on the assumptions of the algorithm and the types of sequences to be analyzed. All require a scoring matrix for dealing with similarities, gaps, and insertions. Clustal is a commonly used global alignment algorithm for performing multiple sequence alignments. Algorithm is executed in three stages: (1) A pairwise sequence comparison is performed across all sequences starting from the most similar; (2) The pairwise information is used to create a guide tree; (3) The guide tree is used to perform the final alignment.
PAM (Percent Acceptable Mutation) matrices • Are derived from studying global alignments of well-characterized protein families. • PAM1 = only 1% of residues has changed (ie short evolutionary distance) • Raise this to 250 power to get 250% change of two sequences (greater evolutionary distance), or about 20% sequence identity. • Therefore, a PAM 30 would be used to analyze more closely related proteins, a PAM 400 is used for finding and analyzing distantly related proteins. • PAMx = PAM1x
Block substitution matrices (BLOSUM) Arederived from studying local alignments (blocks) of sequences from related proteins that differ by no more than X%. In other words, one might use the portions of aligned sequences from related proteins that have no more than 62% identity (in the portions or blocks) to derive the BLOSUM 62 scoring matrix. One might use only the blocks that have <80% identity to derive the BLOSUM 80 matrix. 3) BLOSUM and PAM substitution matrices have the opposite effects: The higher the number of the BLOSUM matrix (BLOSUM X), the more closely related proteins you are looking for. The higher the number of the PAM matrix (PAM X), the more distantly related proteins you are looking for.
Gap penalties– Intuitively one recognizes that there should be a penalty for introducing (requiring) a gap during identification/alignment of a given sequence. But if two sequences are related, the gaps may well be located in loop regions which are more tolerant of mutational events and probably have little impact on structure. Therefore, a new gap should be penalized, but extending an existing gap should be penalized very little. Filtering – many proteins and nucleotides contain simple repeats or regions of low sequence complexity. These must be excluded from searches and alignments. Significance of a “hit” during a search - More important than an arbitrary score is an estimation of the likelihood of finding a hit through pure chance (lower the value to more certainty of a match). Ergo the “Expectation value” or E-value. E-values can be as low as 10-70.
Useful Bioinformatics Sites National Center for Biotechnology Information (NCBI)- National Institutes of Health sponsored sites with rich array of resources and data bases. [http://www.ncbi.nlm.nih.gov/pubmed] ExPASy (Swiss Institute of Bioinformatics)- Large number of different tools for sequence and function analysis. [http://www.expasy.org/tools/] RCSB Protein Data Bank- Largest data base for curated of protein structures. [http://www.rcsb.org/pdb/home/home.do] BioGRID- Large data base of curated protein interaction datasets. [http://thebiogrid.org/] Osprey- Software and interactome analysis tools for visualizing interaction data sets. [http://en.bio-soft.net/protein/Osprey.html] Tree of Life website- Database information on phylogenetic relationships among organisms with useful link outs. [http://tolweb.org/tree/]