Lecture 11:

Page 1 Lecture 11: Database Search Hui Lu BioE 480 Introduction to Bioinformatics Bioinformatics Program University of Illinois at Chicago (note: This lecture contains about 1.5 lectures’ materials)

Page 2 Why search databases? • To find out if a new sequence found in the Lab, shares similarities with sequences already deposited in the databanks. • To find proteins homologous to a putative coding ORF. • To find similar non-coding DNA stretches in the database (for example: repeat elements, regulatory sequences). • To locate false priming sites for a set of PCR oligonucleotides. Main algorithms for database searching • FastA • Is theoretically better for nucleotides than blast (statistics are more rigorous) • BLAST - Basic Local Alignment Search Tool • Better for proteins than for nucleotides

Page 3 BLAST: Basic Local Alignment Search Tool • Database: a large set of catalogued sequences. • Segment: a substring of a sequence. • Segment pair between two sequences: a pair of segments of the same length, one from each sequence. It is essentially a gapless local alignment. • Can have a gapless alignment: with a scoring matrix, and no gap penalty needed. • HSPs or high scoring segment pairs: segment pairs whose scores can not be improved by extension or trimming. They lie on the same diagonal and within a certain distance of each other. • Maximum segment pair (MSP): a segment pair of maximum score. • Seed = an occurrence of a MSP between two sequences • BLAST: for a query sequence, it returns all segment pairs between the query and a database sequence with scores above a threshold S. • S can be set by the user, but a default value is provided.

Page 4 BLAST • List of words of length w (eg. 3) in the query protein sequence is made. • length 11-12 for DNA sequences. • Words are evaluated for matches with any other combination of 3 amino amino acids using Blosum 62 scoring matrix as default. • matches of PQG to PEG would score 15, to PRG 14, to PSG 13 and to PQA 12 • For DNA words, a match score of +5 and a mismatch score of -4 is used based on the changes expected in sequences separated by a PAM distance of 40. • Cut-off score T (neighborhood word score threshold) is selected to reduce the number of matches. • Each word has about 50 high scoring amino acid words above T. • Repeat the above procedure for each 3-letter word in the query sequence. • For a sequence of length 250 amino acids, the total number of words to search for is approximately 50 x 250 = 12,500. • Each database sequence is scanned for an exact match to one of the 50 high scoring amino acid words corresponding to every position in the query sequence. • In Blast2 or gapped Blast, short matched regions (HSPs) are extended in each direction as long as the score keeps rising. • HSPs of score greater than a cutoff score S are kept.

Page 5 BLAST • In practice, very common words generated from the query sequences are ignored. • Low complexity regions: repeats. • BLAST can be viewed as a three-step algorithmic procedure: 1. Compile list of high-scoring words. 2. Search for hits -- each hit gives a seed. 3. Extend seeds.

Page 6 Using BLAST • http://www.ncbi.nlm.nih.gov/BLAST/ • The BLAST search pages allow you to select from several different programs. • Blastp: Compares an amino acid query sequence against a protein sequence database. • Blastn: Compares a nucleotide query sequence against a nucleotide sequence database. • Blastx: Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. • Tblastn: Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. • Tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. The tblastx program is computationally intensive.

Program Input Database 1 blastnDNADNA 1 blastpproteinprotein 6 blastxDNAprotein 6 tblastnproteinDNA 36 tblastxDNADNA Page 7 Choose the BLAST program 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

Page 8 Selecting the BLAST Database • Can select several NCBI databases to compare your query sequences against. • some databases are specific to proteins or nucleotides and cannot be used in combination with certain BLAST programs (for example a blastn search against swissprot). • Proteins Database

Page 9 Nucleotides Database

Page 10 FASTA format • A sequence in FASTA format begins with • a single-line description, followed by • lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that • all lines of text be shorter than 80 characters in length. >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK • Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: • lower-case letters are accepted and are mapped into upper-case; • a single hyphen or dash can be used to represent a gap of indeterminate length; • in amino acid sequences, U and * are acceptable letters (see below). • Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).

Page 11 Nucleic acid codes & amino acid codes • The nucleic acid codes supported are: A --> adenosine M --> A C (amino) C --> cytidine S --> G C (strong) G --> guanine W --> A T (weak) T --> thymidine B --> G T C U --> uridine D --> G A T R --> G A (purine) H --> A C T Y --> T C (pyrimidine) V --> G C A K --> G T (keto) N --> A G C T (any) - gap of indeterminate length For those programs that use amino acid query sequences (BLASTP and TBLASTN), • The accepted amino acid codes are: A alanine P proline B aspartate or asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U selenocysteine G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate or glutamine L leucine X any M methionine * translation stop N asparagine - gap of indeterminate length

Page 12 Submit job • Can also submit by GI or accession number. • Submitting your Search • 1. Select the correct BLAST program and BLAST database. • 2. Entered your FASTA sequence or an Accession or GI number, click the "Submit Query Button". • 3. BLAST will now open a new window and tell you it is working on your search. • 4. Once your results are computed they will be presented in the window. An example: query AA001614 (insulin receptor) using blastn, obtain 96 hits.

Page 13 http://www.ncbi.nlm.nih.gov/BLAST

Page 14 Online BLAST Server

Page 15 Online BLAST Server

Page 16 Blast results Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples 2,905,734 sequences; 1,000,656,685 total letters If you have any problems or questions with the results of this search please refer to the BLAST FAQs Taxonomy reports Query= gi|532319|pir|TVFV2E|TVFV2E envelope protein (379 letters)

Page 17 Blast results Score E Sequences producing significant alignments: (Bits) Value gi|1314215|gb|AAA99542.1| envelope protein 682 0.0 gi|882106|gb|AAA99544.1| envelope protein 617 2e-175 gi|119491|sp|P27757|ENV_SIVA1 Envelope polyprotein GP160 prec... 612 5e-174 gi|334407|gb|AAA91919.1| env polyprotein [Simian immunodefici... 603 3e-171 gi|1085778|pir||S46345 env polyprotein - simian immunodeficie... 594 1e-168 . . . . . . . . . >gi|1314215|gb|AAA99542.1| envelope protein Length=379 Score = 682 bits (1760), Expect = 0.0 Identities = 344/379 (90%), Positives = 346/379 (91%), Gaps = 0/379 (0%) Query 1 ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCXXXXXXXXXXXXXXXGSYSENRT 60 ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT Sbjct 1 ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT 60 Query 61 QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC 120 QIWQKHRTSNDS LILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC Sbjct 61 QIWQKHRTSNDSVLILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC 120 Query 121 HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK 180 HF NWKGAWKEVKEEIV LPKERYRGTND KRIF QRQWGDPETANLWFNCHGEFFYCK Sbjct 121 HFQGNWKGAWKEVKEEIVKLPKERYRGTNDTKRIFLQRQWGDPETANLWFNCHGEFFYCK 180 Query 181 MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK 240 MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVI TISKK Sbjct 181 MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVINDWYTISKK 240 Query 241 TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF 300 TY+PPREGHLECTSTVTGMTVELNYI +NRTNVTLSPQIESIWAAELDRYKLVEITPIGF Sbjct 241 TYSPPREGHLECTSTVTGMTVELNYINQNRTNVTLSPQIESIWAAELDRYKLVEITPIGF 300 Query 301 APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL 360 APTEVRRYTGGHERQKRVPFV VQSQHLLAGILQQQKNL Sbjct 301 APTEVRRYTGGHERQKRVPFVLGFLGFLGAAGTAMGAAATALTVQSQHLLAGILQQQKNL 360 Query 361 LAAVEAQQQMLKLTIWGVK 379 LAAVEAQQQMLKLTIWGVK Sbjct 361 LAAVEAQQQMLKLTIWGVK 379

Page 18 Blast results (another example) Sequences producing significant alignments: (bits) Value emb|X02160|HSIRPR Human mRNA for insulin receptor precursor 438 e-121 ref|NM_000208.1|| Homo sapiens insulin receptor (INSR) mRNA 131 2e-28 gb|M32972|HUMINSR22 Human insulin receptor (INSR) gene, exo... 131 2e-28 gb|M10051|HUMINSR Human insulin receptor mRNA, complete cds. 131 2e-28 gb|M29014|RATINSAB Rat insulin receptor mRNA, complete cds. 88 2e-15 gb|L07782|HUMINSRE Human insulin receptor gene, last exon 64 4e-08 ……………… gb|AC006017.2|AC006017 Homo sapiens PAC clone RP5-981O7 from 7q34-q36, complete sequence Length = 162556 Score = 46.1 bits (23), Expect = 0.008 Identities = 31/34 (91%) Strand = Plus / Plus Query: 52 aacaaaagttcgaaagaaaaaacaanacaaaaac 85 |||||||||| ||||||||||| || |||||||| Sbjct: 3983 aacaaaagttggaaagaaaaaaaaaaacaaaaac 4016 gb|AC008018.18|AC008018 Homo sapiens, complete sequence Length = 180894 Score = 44.1 bits (22), Expect = 0.033 Identities = 22/22 (100%) Strand = Plus / Minus Query: 276 gtgtgtatgcacgtgtgtgtgt 297 |||||||||||||||||||||| Sbjct: 122199 gtgtgtatgcacgtgtgtgtgt 122178

Page 19 FASTA • FastA accelerates database searching by using several passes over the database. • only retaining a `best matching' subset for further analysis at each pass, `pruning down' the database progressively. • The first pass: short sequence `words' are compared for rapid detection of small regions of similarity. • FastA uses a small word size (called k-tuple or KTup): by default, 6 for nucleic acids, and 2 for proteins (versus 11 and 3 for BLAST). • FastA requires the words to match perfectly, and may overlook at this stage some weak but significant similarity between protein sequences (for example, a Lys-Arg match will be ignored). • The initial short regions of similarity are then extended into alignments without gaps, and a scoring matrix is used to calculate a score for these alignments. • The score for the best of these alignments in a given database sequence is called the init1 score for this sequence. • When a database sequence contains more than one of these aligned regions, FastA joins them together into a gapped alignment if possible. • Calculates another alignment score, called initn, for the whole joined region.

Page 20 FASTA • Finally, regions with a high initn are aligned with the query sequence using a slower, more sensitive method. • The threshold parameter controls how many sequences in the database are retained for this final phase. • To speed up the search, only residues in the neighborhood of those used for the init1 alignments are considered in the sensitive alignment. The size of this neighborhood is controlled by the Width parameter. • The score calculated for these alignments is called the opt score. By default this score is used to sort the sequences in the output. • Statistical evaluation of FastA: (less rigorous than BLAST) • The statistical model used assumes alignments without gaps. • A normalized z-score is derived from one of the other scores (by default, the opt score). • Z score is then converted into a statistical expectation value, which approximates the probability that a given match is a chance occurrence. • Database sequences with an expectation value higher than the Cutoff expectation parameter are listed in the program output (as long as their number does not exceed the value of the Show how many scores parameter).

Page 21 Terminology • Selectivity: the ability of a program to discriminate between true matches and matches occurring by chance alone. • A decrease in selectivity results in more false positives being reported. • Sensitivity: the ability of a program to identify weak but biologically significant sequence similarity. True positive True positive all positive all true • init1: The alignment score derived by FastA from the best initial alignment found between the query sequence and a database sequence in the first phase of a search. • initn: The alignment score calculated in the second phase of a FastA search, after the shorter alignments found in the first phase of the search are joined together. • opt: The alignment score derived from the `optimised' sensitive alignment derived in the third phase of a FastA search.

Page 22 FastA Output • A histogram showing the distribution of the z- scores and of the statistical expectations in the database. • The numbers of sequences with a particular z-score are shown using = symbols, and the expected distribution of these scores (if the database and the query sequence were random) is indicated using * symbols. • The number of sequences which had a much better score than expected can be seen at a glance. • The high similarity sequences are clustered on the tail of the histogram, and is magnified. • A summary of the statistics and of the program parameters follows the histogram. • An important number in this summary is the Kolmogorov-Smirnov statistic, which indicates how well the actual data fit the theoretical statistical distribution. The lower this value, the better the fit, and the more reliable the statistical estimates. • In general, a Kolmogorov-Smirnov statistic under 0.1 indicates a good fit with the theoretical model. If the statistic is higher than 0.2, the statistics may not be valid, and it is recommended to repeat the search, using more stringent (more negative) values for the gap penalty parameters. • The best scoring database sequences are listed next as one line descriptions, with the various search scores. • A statistical expectation below 0.05 is usually considered statistically significant at the 95% confidence level. However, some biologically significant similarities may have a much higher expectation value.

Page 23 Example FASTA histogram • The y axis is the score. growing from top to bottom and printed on the left column. • The x axis shows the number of matching database records having the score. • The observed distribution of the score is plotted by bars of ``='' signs and printed in second column. • The expected random distribution of the score is plotted by ``*'' signs and printed on the third column. • For example, a score of 34 was attained by 3145 sequences when the query was searched, compared to 7603 expected sequences with a random sequence search. FastA output: The first lines contain general information about the search parameters. Score lines are made of nine rows: 1-3 details the name and the annotation of the hit, 4-9 are the FastA scores.

Page 24 Customizing a FastA search • The default parameter values provide a good compromise between speed and sensitivity. • Improving the sensitivity of a search in order to detect sequences with low similarity is best achieved by lowering the KTup value. • Setting the KTup to 1 with a protein query sequence decreases the chance that weak protein similarities will be missed in the initial stage of the search, but slows down the search significantly. • Changing the threshold value to 1 causes every sequence in the database to be aligned to the query, but should not be done routinely since it slows down the search about 5-fold for an often unnecessary gain in sensitivity. • For DNA database searches, forcing a Smith-Waterman alignment may in some cases improve sensitivity, but at the expense of a 2 to 5-fold increase in search time. • If a rigorous statistical analysis is required, it is important to check that the Kolmogorov-Smirnov statistic in the program output is under 0.1. • If this is not the case, the absolute value of the gap penalties may need to be increased. For example, for a protein sequence search, the gap creation penalty could be changed from -12 to -16, and the gap extension penalty from -2 to -4.

Page 25 Customizing a FastA search • A special case requiring modification of the parameters: cDNA query sequence to search for genomic DNA in the database. • Because the final alignment is expected to comprise short regions of similarity (exons) surrounded by extremely large gaps (introns), it is crucial that long gaps be not penalized. It is therefore essential to set the gap extension penalty to 0. • In addition, FastA only looks at the residues around the init1 region when creating the final alignment. In a cDNA-genomic DNA alignment, the init1 region will correspond to the highest scoring exon, and other exons which may be very distant from this region will not be aligned, unless the Width parameter is set to a very large value (10000 is a good start). • This should only be done in specific cases, since it slows down the search considerably. • FastA offers a range of scoring matrices for protein sequence comparisons. When performing a thorough search for sequences with weak similarity, it is a good idea to try more than one scoring matrix. • In all cases, it is recommended first to use the default values when running the program, and to modify the parameters only if the results are not satisfactory.

Page 26 Similarity scores and sequence length

Page 27 FASTA vs. BLAST • BLAST's major advantage is its speed. • 2-3 minutes for BLAST versus several hours for a sensitive FastA search of the whole of GenBank. • When both programs use their default setting, BLAST is usually more sensitive than FastA for detecting protein sequence similarity. • Since it doesn't require a perfect sequence match in the first stage of the search. BLAST offers the added sensitivity. • Weakness of BLAST: • The long word size it uses in the initial stage of DNA sequence similarity searches was chosen for speed, and not sensitivity. • For a thorough DNA similarity search, FastA is the program of choice, especially when run with a lowered KTup value. • FastA is also better suited to the specialized task of detecting genomic DNA regions using a cDNA query sequence, because it allows the use of a gap extension penalty of 0. BLAST, which only creates ungapped alignments, will usually detect only the longest exon, or fail altogether. • In general, a BLAST search using the default parameters should be the first step in a database similarity search strategy. In many cases, this is all that may be required to yield all the information needed, in a very short time.

Page 28 Accurate Statistics for Smith-Waterman • If the statistical estimates calculated from the database search are accurate, then: • There should be good agreement between the actual distribution of similarity scores and the predicted distribution (figure below) • The highest scoring unrelated sequence should have an expectation value between 0.5 - 2.0. In the table below, the highest scoring unrelated sequence has an expectation value of 4.4.

Page 29 In the table below, >>...<< indicate high-scoring, unrelated sequences; ** indicate related sequences with expectations values < 11 that were detected with SSEARCH but not FASTA, ktup=2. The best scores are: s-w Z-score E(43470) GTB1_MOUSE GLUTATHIONE S-TRANSFERASE GT8.7 (EC 2.5 1490 2006.4 0 GTM1_HUMAN GLUTATHIONE S-TRANSFERASE HB SUBUNIT 4 1235 1661.9 0 GT2_CHICK GLUTATHIONE S-TRANSFERASE 2 (EC 2.5.1.18 954 1282.1 0 GT26_FASHE GLUTATHIONE S-TRANSFERASE 26 KD (EC 2.5 703 942.9 0 GTP_PIG GLUTATHIONE S-TRANSFERASE P (EC 2.5.1.18) 371 494.8 4.1e-21 GTP_MOUSE GLUTATHIONE S-TRANSFERASE P (EC 2.5.1.18 361 481.2 2.3e-20 GT28_SCHMA GLUTATHIONE S-TRANSFERASE 28 KD (EC 2.5 196 258.1 6.2e-08 GT5A_MOUSE GLUTATHIONE S-TRANSFERASE GST 5.7 (EC 2 183 240.1 6.3e-07 GT28_SCHJA GLUTATHIONE S-TRANSFERASE 28 KD (EC 2.5 169 221.9 6.5e-06 GT2_DROME GLUTATHIONE S-TRANSFERASE 2 (EC 2.5.1.18 164 213.4 1.9e-05 SC1_OCTVU S-CRYSTALLIN 1. 159 209.0 3.4e-05 GTAC_CHICK GLUTATHIONE S-TRANSFERASE, CL-3 SUBUNIT 144 187.1 0.0006 SC18_OMMSL S-CRYSTALLIN SL18. 131 163.0 0.01 GT1_MUSDO GLUTATHIONE S-TRANSFERASE 1 (EC 2.5.1.18 122 158.3 0.02 GTTR_RAT GLUTATHIONE S-TRANSFERASE YRS-YRS (EC 2.5 123 158.1 0.02 **GT1_MAIZE GLUTATHIONE S-TRANSFERASE I (EC 2.5.1.18 120 155.3 0.03 ** ARP_TOBAC AUXIN-REGULATED PROTEIN (STR246C PROTEIN 117 151.0 0.06 **GT32_MAIZE GLUTATHIONE S-TRANSFERASE III (EC 2.5.1 115 148.2 0.08 ** GT1_DROME GLUTATHIONE S-TRANSFERASE 1-1 (EC 2.5.1. 100 128.5 1.0 GT_PROMI GLUTATHIONE S-TRANSFERASE GST-6.0 (EC 2.5 97 124.7 1.7 **DCMA_METSP DICHLOROMETHANE DEHALOGENASE (EC 4.5.1. 98 122.7 2.6 ** GTT1_RAT GLUTHATHIONE S-TRANSFERASE 5 (EC 2.5.1.18 93 117.8 4.1 >>MOD5_YEAST TRNA ISOPENTENYLTRANSFERASE (EC 2.5.1.8 100 117.2 4.4 << GT2_WHEAT GLUTATHIONE S-TRANSFERASE 2 (EC 2.5.1.18 92 114.5 6.2 >>MYSP_MOUSE MYOSIN HEAVY CHAIN, PERINATAL SKELETAL 81 113.5 7.0 << LIGE_PSEPA BETA-ETHERASE (BETA-ARYL ETHER CLEAVING 91 113.5 7.0 >>YFHE_ECOLI HYPOTHETICAL 20.1 KD PROTEIN IN HSCA 5' 86 113.5 7.1 << **EF1G_HUMAN ELONGATION FACTOR 1-GAMMA (EF-1-GAMMA) 94 113.3 7.2 ** GT_ECOLI GLUTATHIONE S-TRANSFERASE (EC 2.5.1.18). 88 112.7 7.9 >>ABF2_YEAST ARS-BINDING FACTOR 2 PRECURSOR. 87 112.2 8.3 << >>KKQ1_YEAST PROBABLE SERINE/THREONINE-PROTEIN KINASE 92 110.7 10.1 << **EF1G_RABIT ELONGATION FACTOR 1-GAMMA (EF-1-GAMMA). 92 110.6 10.2 ** ARP3_TOBAC AUXIN-INDUCED PROTEIN PCNT103. 87 110.3 10.6

Page 30 Accurate Statistics for FASTA The best scores are: init1 opt z-sc E(43267) GTB1_MOUSE GLUTATHIONE S-TRANSFERASE GT8.7 (EC 2. 1490 1490 1798.4 0 GTM1_HUMAN GLUTATHIONE S-TRANSFERASE HB SUBUNIT 4 1235 1235 1492.4 0 GT2_CHICK GLUTATHIONE S-TRANSFERASE 2 (EC 2.5.1.1 954 954 1155.0 0 GT26_FASHE GLUTATHIONE S-TRANSFERASE 26 KD (EC 2. 567 703 853.8 0 GTP_MOUSE GLUTATHIONE S-TRANSFERASE P (EC 2.5.1.1 153 361 443.6 2.9e-18 GTP_HUMAN GLUTATHIONE S-TRANSFERASE P (EC 2.5.1.1 155 356 437.6 6.2e-18 GTP_CAEEL GLUTATHIONE S-TRANSFERASE P (EC 2.5.1.1 129 325 400.4 7.4e-16 GTP_PIG GLUTATHIONE S-TRANSFERASE P (EC 2.5.1.18) 128 321 395.7 1.4e-15 GTA2_MOUSE GLUTATHIONE S-TRANSFERASE YA CHAIN (EC 57 229 284.8 2.0e-09 SC11_OMMSL S-CRYSTALLIN SL11 (MAJOR LENS POLYPEPT 87 217 270.9 1.2e-08 GTC_MOUSE GLUTATHIONE S-TRANSFERASE YC (EC 2.5.1. 91 215 268.0 1.7e-08 GTH2_HUMAN GLUTATHIONE S-TRANSFERASE A2-2 (EC 2.5 42 198 247.6 2.4e-07 GT2_DROME GLUTATHIONE S-TRANSFERASE 2 (EC 2.5.1.1 60 161 202.4 7.9e-05 GTAC_CHICK GLUTATHIONE S-TRANSFERASE, CL-3 SUBUNI 56 144 182.6 0.001 SC1_OCTDO S-CRYSTALLIN 1 (OL1). 78 132 168.6 0.006 GT1_MUSDO GLUTATHIONE S-TRANSFERASE 1 (EC 2.5.1.1 52 122 156.8 0.027 SC1_OCTVU S-CRYSTALLIN 1. 58 121 155.4 0.033 ARP_TOBAC AUXIN-REGULATED PROTEIN (STR246C PROTEI 57 117 150.4 0.062 SC3_OCTDO S-CRYSTALLIN 3 (OL3). 47 111 143.4 0.15 GT1_DROME GLUTATHIONE S-TRANSFERASE 1-1 (EC 2.5.1 50 100 130.3 0.81 GTTR_RAT GLUTATHIONE S-TRANSFERASE YRS-YRS (EC 2. 55 97 125.7 1.5 GT_PROMI GLUTATHIONE S-TRANSFERASE GST-6.0 (EC 2. 37 95 124.5 1.7 >>SPCB_HUMAN SPECTRIN BETA CHAIN, ERYTHROCYTE. 44 108 124.0 1.8 << ARP2_TOBAC AUXIN-INDUCED PROTEIN PGNT35/PCNT111. 47 93 121.5 2.5 >>YHC9_YEAST HYPOTHETICAL 77.8 KD PROTEIN 33 96 117.5 4.2 << GT_ECOLI GLUTATHIONE S-TRANSFERASE (EC 2.5.1.18). 32 88 116.2 5.0 >>VMP_SOCMV MOVEMENT PROTEIN 38 90 115.8 5.3 << >>PPA5_RAT TARTRATE-RESISTANT ACID PHOSPHATASE 38 90 115.3 5.6 << ARP3_TOBAC AUXIN-INDUCED PROTEIN PCNT103. 34 87 114.3 6.4 >>YJJV_ECOLI HYPOTHETICAL 23.7 KD PROTEIN 68 86 113.5 7.1 << >>GUX1_PHACH EXOGLUCANASE I PRECURSOR 52 91 113.3 7.2 << ARP4_TOBAC AUXIN-INDUCED PROTEIN PCNT107. 59 85 112.0 8.6 >>PPA5_MOUSE TARTRATE-RESISTANT ACID PHOSPHATASE 29 87 111.7 8.9 << GT31_MAIZE GLUTATHIONE S-TRANSFERASE III (EC 2.5. 40 84 110.8 9.9

Page 31 Accurate statistical estimates for DNA The best scores are: init1 opt z-sc E(77684) MUSGLUTA Mouse glutathione S-transferase class 5625 5625 4637.2 0 HUMGSTM1B Human glutathione transferase M1B (GST 2330 2382 1960.3 0 GGGSTCL2 Chicken mRNA for glutathionine S-trans 1354 1357 1115.9 0 S62935 GSTM1 (GSTM1b)=class-mu glutathione tr 313 320 266.8 1.5e-07 BTGST Bovine GST mRNA for gluthathione S-tra 190 249 201.1 0.0002 HSGSTPI Human mRNA for class Pi glutathione S- 163 237 192.0 0.0008 MUSGTF Mus musculus glutathione transferase m 154 196 156.4 0.06 RRGTS8 R.rattus mRNA for glutathione transfer 163 182 145.1 0.27 >>HUMKAL2 Human glandular kallikrein gene, compl 81 170 122.6 0.7 << >>HUMTROPI01 Human troponin I, slow-twitch form 123 170 131.9 0.9 << RNGSTYC2F R.norvegicus (Fischer 344) GST Yc2 mRN 136 170 133.7 0.9 MUSGSTYC Mouse glutathione S-transferase (GST Y 133 168 133.1 1.2 MMGLUT M.musculus mRNA for glutathione S-tran 133 168 133.3 1.2 >>MUSTHYGP Mouse Thy-1.2 glycoprotein gene, compl 84 163 117.5 1.5 << CHKDPCP Chicken DNA for pancreatic polypeptide 112 162 123.6 2.0

Page 32 Protein vs DNA sequence comparison DNA vs Protein Comparison score E(DNA) E(prot) MUSGLUTA Mouse GST class mu 5625 0 0 MUSGSTA Mouse GST GT9.3 mu 3953 0 0 HUMGSTAA Homo sapiens GST 1257 0 0 MAMGLUTRA M. auratus mu class GST 399 10[-11] 0 RATGSTYD Rat GST Yb subunit 399 10[-11] 0 HSGSTM4 H. sapiens GSTM4 gene for GST 390 10[-10] 0 RATGSTY Rattus norvegicus GST 372 10[-9] 0 HSGSTM1B H.sapiens GSTM1b gene for GST 358 10[-9] 0 HSGSTMU3 Human GSTmu3 gene for a GST 322 10[-7] 0 HSGST145 Human GST-1 gene for GST 308 10[-6] 0 BTGST Bovine GST mRNA for GST 249 0.0002 10[-16] HSGSTPI1 Human mRNA for anionic GST 237 0.0008 10[-17] MUSGTF Mus musculus GST mu 196 0.06 CRUGSTP Chinese hamster GST 196 0.06 10[-16] CRUGSTPIE Cricetulus griseus GST pi 196 0.06 10[-16] HAMGSTPIE Mesocricetus auratus GST pi 191 0.1 10[-16] RRGTS8 R.rattus mRNA for GST 182 0.2 >>HUMKAL2 Human glandular kallikrein gene 170 0.6 >>100 << >>HUMTROPI01Human troponin I, slow-twitch form 170 0.8 >>100 << RNGSTYC2F R.norvegicus GST Yc2 170 0.8 10[-7] MMGLUT M.musculus mRNA for GST 168 1.1 10[-7] >>MUSTHYGP Mouse Thy-1.2 glycoprotein 163 1.3 >>100 << HUMLGTH1 Human liver GST 157 3.4 10[-5] >>ATCON430S1Rattus norvegicus connexin 155 3.6 << >>HUMA1AR2 Human a-1-antitrypsin-related prot. 154 3.6 << >>HUMVLDLR Human VLDL protein receptor 152 4.5 << RABGSTB Oryctolagus cuniculus GST 153 5.1 10[-9] >>HUMHSF1 Human heat shock factor 1 (TCF5) 151 5.5 << >>RATRIIA Rat type I reg. subunit of cAMP 151 5.9 << RNGSTYC1F R.norvegicus GST Yc1 148 8.5 10[-6] RATGSTYC Rat liver GST Yc 148 8.6 10[-6] ... HUMGSTH Human GST 144 14 10[-6] HUMGST2 Human GST 2 144 14 10[-6] S49975 Human GST A1-1 144 14 10[-6

Page 33 Examining marginal sequence relationships • Frequently, diverse protein families contain members that will produce pair-wise similarity scores with expectation values > 0.05. How can we tell if sequences with high, but not statistically significant similarity scores, are related to the query sequence? • If the family is large, a relatively simple test is available. Simply perform additional searches using the sequences with marginal similarity scores. Thus, in the search below with the bacterial DCMA_METSP sequence shows clearly that this protein belongs to the glutathione transferase superfamily. All of the sequences that with statistically significant similarity scores belong to the family. The best scores are: s-w Z-score E(43470) DCMA_METSP DICHLOROMETHANE DEHALOGENASE (EC 4.5.1. 1982 2721.7 0 GTTR_RAT GLUTATHIONE S-TRANSFERASE YRS-YRS (EC 2.5 344 463.2 2.3e-19 GTT1_RAT GLUTHATHIONE S-TRANSFERASE 5 (EC 2.5.1.18 330 444.1 2.7e-18 GT1_MUSDO GLUTATHIONE S-TRANSFERASE 1 (EC 2.5.1.18 271 364.0 7.9e-14 GT1_DROME GLUTATHIONE S-TRANSFERASE 1-1 (EC 2.5.1. 231 308.7 9.4e-11 GT32_MAIZE GLUTATHIONE S-TRANSFERASE III (EC 2.5.1 204 271.0 1.2e-08

Page 34 Examining marginal sequence relationships • In contrast, a search with the yeast isopentyl transferase, the highest scoring unrelated sequence, shows high (but not statistically significant) with a variety of unrelated sequences, as would be expected by chance. The best scores are: s-w Z-score E(43470) MOD5_YEAST TRNA ISOPENTENYLTRANSFERASE (EC 2.5.1.8 2876 3535.1 0 MIAA_AGRTU TRNA DELTA(2)-ISOPENTENYLPYROPHOSPHATE 436 527.2 6.4e-23 MIAA_ECOLI TRNA DELTA(2)-ISOPENTENYLPYROPHOSPHATE 397 478.4 3.3e-20 MIAA_SALTY TRNA DELTA(2)-ISOPENTENYLPYROPHOSPHATE 139 172.2 0.004 GTB3_MOUSE GLUTATHIONE S-TRANSFERASE GT9.3 (EC 2.5 118 137.9 0.31 G6PD_ECOLI GLUCOSE-6-PHOSPHATE 1-DEHYDROGENASE (EC 119 130.9 0.76 GTB1_CRILO GLUTATHIONE S-TRANSFERASE Y1 (EC 2.5.1. 110 128.0 1.1 CAMT_PETCR CAFFEOYL-COA O-METHYLTRANSFERASE (EC 2. 107 123.2 2.0 GTMU_MESAU GLUTATHIONE S-TRANSFERASE (EC 2.5.1.18) 104 120.6 2.8 ACCO_PERAE 1-AMINOCYCLOPROPANE-1-CARBOXYLATE OXIDA 107 120.4 2.9 YIK4_YEAST HYPOTHETICAL 59.2 KD PROTEIN IN PFK26-S 110 119.4 3.3 G6PD_ERWCH GLUCOSE-6-PHOSPHATE 1-DEHYDROGENASE (EC 109 118.5 3.7 RASX_XENLA TRANSFORMING PROTEIN P21/K-RAS. 101 118.4 3.8 GNTV_ECOLI GLUCONOKINASE (EC 2.7.1.12) (GLUCONATE 100 117.2 4.4 TFC1_YEAST TRANSCRIPTION FACTOR TAU 95 KD SUBUNIT 110 116.9 4.6 GTM4_HUMAN GLUTATHIONE S-TRANSFERASE MUSCLE (EC 2. 101 116.9 4.6 • While there are several glutathione transferases on this list, none has a statistically significant similarity score. The presence of several family members is not very informative because they are all very closely related.

Page 35 Relationships between G-protein-coupled receptor families Three different families of G-protein coupled receptors have been characterized; the largest is the rhodopsin family, which includes cationic amine receptors (dopamine, acetylcholine, etc), prostanoids, odorants, and many other hormones. A schematic evolutionary tree for the family is shown above. In addition to the rhodopsin family, two other families of G-protein-coupled receptors are known - the secretin/calcitonin family and the glutamate receptor family.

Page 36 Relationships between G-protein-coupled receptor families • It is possible to find links between the rhodopsin and calcitonin/secretin families of G-protein by looking at sequences that are distantly related to opsin, and then using those sequences to re-search the sequence databases. The table below shows a greatly abbreviated summary of a search using human opsin (OPSD_HUMAN) against SwissPROT. In this example, no members of the secretin/calcitonin family are found with expectation values < 10 (results are shown only to 1.0). However,many members of the rhodopsin family have expectation values > 0.1. The best scores are: s-w Z-score E(43470) OPSD_HUMAN RHODOPSIN. 2347 2978.5 0 OPSG_CHICK GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO 1791 2266.9 0 OPSG_HUMAN GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO 1002 1261.9 0 OPS1_DROME OPSIN RH1 (OUTER R1-R6 PHOTORECEPTOR CE 527 658.4 3.1e-30 SSR5_HUMAN SOMATOSTATIN RECEPTOR TYPE 5. 431 538.2 1.5e-23 TXKR_HUMAN PUTATIVE TACHYKININ RECEPTOR. 419 513.8 3.5e-22 CKR1_HUMAN C-C CHEMOKINE RECEPTOR TYPE 1 (C-C CKR- 280 363.4 8.5e-14 • Additional searches using a very distantly related member of this family - CAR1_DICDI - show the link between OPSD_HUMAN and the calcitonin/secretin family. • It is easy to show that he BIOX_BACSH protein is not a G-protein-coupled receptor. Additional searches with this sequence do not turn up significant matches with any members of that family. ACM1_MOUSE MUSCARINIC ACETYLCHOLINE RECEPTOR M1. 238 305.1 1.5e-10 ... OLFI_HUMAN OLFACTORY RECEPTOR-LIKE PROTEIN HGMP07I 146 179.9 0.001 PAFR_MACMU PLATELET ACTIVATING FACTOR RECEPTOR (PA 130 167.0 0.007 OLF2_RAT OLFACTORY RECEPTOR-LIKE PROTEIN F12. 135 165.6 0.009 MAS_RAT MAS PROTO-ONCOGENE. 131 164.9 0.01 CAR1_DICDI CYCLIC AMP RECEPTOR 1. 130 162.0 0.01 OLF2_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR2. 129 158.1 0.02 R3_DICDI CYCLIC AMP RECEPTOR 3. 124 152.2 0.05 MAS_HUMAN MAS PROTO-ONCOGENE. 120 150.2 0.06 LIVM_ECOLI HIGH-AFFINITY BRANCHED-CHAIN AMINO ACID 109 133.3 0.55 >>APPC_ECOLI PROBABLE CYTOCHROME OXIDASE SUBUNIT I 110 133.1 0.57 << >>BIOX_BACSH BIOX PROTEIN. 102 131.5 0.69 << RTA_RAT PROBABLE G PROTEIN-COUPLED RECEPTOR RTA. 109 131.0 0.74

Page 37 Linking the rhodopisin and calcitonin/secretin families • A search of SwissPROT using CAR1_DICDI is shown below. This sequence seems to bridge the two G-protein coupled receptor families by showing significant similarity with both the calcitonin/secretin family C/S and the rhodopsin family R. The best scores are: s-w Z-score E(43470) CAR1_DICDI CYCLIC AMP RECEPTOR 1. 2678 3208.3 0 CAR3_DICDI CYCLIC AMP RECEPTOR 3. 1524 1819.9 0 CAR2_DICDI CYCLIC AMP RECEPTOR 2. 1497 1789.8 0 CALR_HUMAN CALCITONIN RECEPTOR PRECURSOR (CT-R). 167 189.4 0.00042 C/S IL8B_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 161 185.1 0.00073 R CLRA_RAT CALCITONIN RECEPTOR A PRECURSOR (CT-R-A) 162 183.6 0.00087 C/S CLRB_RAT CALCITONIN RECEPTOR B PRECURSOR (CT-R-B) 162 183.0 0.00095 C/S DIHR_MANSE DIURETIC HORMONE RECEPTOR (DH-R). 150 170.9 0.0045 C/S YOW3_CAEEL PROBABLE G PROTEIN-COUPLED RECEPTOR ZK6 151 169.7 0.0052 CALR_PIG CALCITONIN RECEPTOR PRECURSOR (CT-R). 145 163.2 0.012 C/S GLR_RAT GLUCAGON RECEPTOR PRECURSOR (GL-R). 145 163.1 0.012 C/S IL8B_RABIT HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 141 161.0 0.016 R RDC1_HUMAN G PROTEIN-COUPLED RECEPTOR RDC1 HOMOLOG 139 158.5 0.022 R G10D_RAT PROBABLE G PROTEIN-COUPLED RECEPTOR G10D 133 150.5 0.061 R >>CYB_CROLA CYTOCHROME B (EC 1.10.2.2). 132 149.6 0.069 << NY1R_XENLA NEUROPEPTIDE Y RECEPTOR TYPE 1 (NPY1-R) 131 148.7 0.077 R OPSD_HUMAN RHODOPSIN. 130 148.0 0.085 R VIPR_HUMAN VASOACTIVE INTESTINAL POLYPEPTIDE RECEP 131 146.8 0.098 C/S OPSD_SPHSP OPSIN. 129 146.1 0.11 R SCRC_RAT SECRETIN RECEPTOR PRECURSOR. 129 144.6 0.13 C/S IL8A_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR A 127 144.3 0.14 R GLPR_RAT GLUCAGON-LIKE PEPTIDE 1 RECEPTOR PRECURSO 128 143.1 0.16 C/S AG2S_XENLA TYPE-1-LIKE ANGIOTENSIN II RECEPTOR 2 ( 126 142.8 0.16 R Note that in this example, there are several cytochrome B sequences (which are not members of the G-protein-coupled reseptor family) that obtain very high similarity scores. Thus, one might be suspicious of the conclusion, since unrelated sequences have expectation values << 1.0. The next example shows the results of searching SwissPROT using the SSEARCH program with very slightly higher gap penalties, -14/-2 instead of -12/-2. >>CYB_MONDO CYTOCHROME B (EC 1.10.2.2). 126 142.4 0.17 << >>CYB_ANGRO CYTOCHROME B (EC 1.10.2.2) (FRAGMENT). 122 141.8 0.19 << AG2R_BOVIN TYPE-1 ANGIOTENSIN II RECEPTOR (AT1). 125 141.7 0.19 R >>CYB_CYPCA CYTOCHROME B (EC 1.10.2.2). 125 141.2 0.20 << OPSD_CANFA RHODOPSIN. 124 140.8 0.21 R >>CYB_XENLA CYTOCHROME B (EC 1.10.2.2). 122 137.6 0.32 << OPRM_RAT MU-TYPE OPIOID RECEPTOR (MOR-1) (OPIOID R 122 137.2 0.34 R >>CYB_PARLI CYTOCHROME B (EC 1.10.2.2). 115 129.2 0.94 <<

Page 38 Better statistics with higher gap penalties • A second search of SwissPROT using CAR1_DICDI is shown below. The SSEARCH (Smith-Waterman) program was used with the BLOSUM50 matrix and gap penalties of -14 for the first residue in a gap and -2 for each additional residue. Now, the highest scoring non-G-protein-coupled receptor sequence has an expectation value of E()=0.4. As before, rhodopsin R and calcitonin/secretin C/S members are noted. The best scores are: s-w Z-score E(43470) CAR1_DICDI CYCLIC AMP RECEPTOR 1. 2678 3668.2 0 CAR3_DICDI CYCLIC AMP RECEPTOR 3. 1520 2074.2 0 CAR2_DICDI CYCLIC AMP RECEPTOR 2. 1493 2039.4 0 CALR_HUMAN CALCITONIN RECEPTOR PRECURSOR (CT-R). 155 197.6 0.00015 C/S CLRA_RAT CALCITONIN RECEPTOR A PRECURSOR (CT-R-A) 150 190.9 0.00035 C/S CLRB_RAT CALCITONIN RECEPTOR B PRECURSOR (CT-R-B) 150 190.2 0.00038 C/S YOW3_CAEEL PROBABLE G PROTEIN-COUPLED RECEPTOR ZK6 141 177.8 0.0019 DIHR_MANSE DIURETIC HORMONE RECEPTOR (DH-R). 132 167.8 0.0067 IL8B_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 128 163.2 0.012 R GLR_RAT GLUCAGON RECEPTOR PRECURSOR (GL-R). 128 160.5 0.017 C/S VIPR_HUMAN VASOACTIVE INTESTINAL POLYPEPTIDE RECEP 123 154.2 0.038 C/S CALR_PIG CALCITONIN RECEPTOR PRECURSOR (CT-R). 123 153.7 0.040 C/S G10D_RAT PROBABLE G PROTEIN-COUPLED RECEPTOR G10D 118 148.6 0.079 R IL8B_RABIT HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 115 145.3 0.12 R RDC1_HUMAN G PROTEIN-COUPLED RECEPTOR RDC1 HOMOLOG 114 143.8 0.14 R SCRC_RAT SECRETIN RECEPTOR PRECURSOR. 115 143.3 0.15 C/S GLPR_RAT GLUCAGON-LIKE PEPTIDE 1 RECEPTOR PRECURSO 115 143.1 0.16 C/S IL8A_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR A 111 140.0 0.24 R GRFR_RAT GROWTH HORMONE-RELEASING HORMONE RECEPTOR 112 138.9 0.27 C/S OPSD_SPHSP OPSIN. 109 136.6 0.36 R VIPR_RAT VASOACTIVE INTESTINAL POLYPEPTIDE RECEPTO 110 136.3 0.38 C/S >>VME1_CVPFS E1 GLYCOPROTEIN PRECURSOR 106 135.6 0.41 << GRFR_PIG GROWTH HORMONE-RELEASING HORMONE RECEPTOR 109 135.0 0.45 C/S IL8B_RAT HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B (I 102 127.5 1.18 R AG2S_XENLA TYPE-1-LIKE ANGIOTENSIN II RECEPTOR 2 ( 101 125.9 1.44 R >>CYB_CROLA CYTOCHROME B (EC 1.10.2.2). 101 125.5 1.51<< GRFR_HUMAN GROWTH HORMONE-RELEASING HORMONE RECEPT 101 124.6 1.70 C/S >>PACR_RAT PITUITARY ADENYLATE CYCLASE ACTIVATING 102 124.1 1.81<< >>CYB_CYPCA CYTOCHROME B (EC 1.10.2.2). 100 124.1 1.81<<

Page 39 Distantly related calcium binding domains • The EF-hand calcium binding proteins comprise a very large, well-characterized family of soluble proteins that uses a distinctive helix-loop-helix structural motif to bind Ca++. Frequently multiple EF-hand domains are found in Ca-regulated proteins together with a larger structural or functional domain. Recently it was reported that the BM40/SPARC/osteonectin family of extracellular calcium binding proteins contain an EF-hand-like domain that can be readily superimposed on the traditional calmodulin-type EF-hand domain. • In this case, a "bridge" protein is: HSU06863_1 follistatin-related protein precursor [Homo sapiens] • However, when the full-length version of this protein is used to search the database, only the SPARC/osteonectin/follistatin proteins are detected: The best scores are: s-w Z-score E(43470) SPRC_CAEEL SPARC PRECURSOR (SECRETED PROTEIN ACIDI 257 280.5 3.5e-09 FSA_SHEEP FOLLISTATIN PRECURSOR (FS) (ACTIVIN-BIND 249 270.0 1.4e-08 FSA_HUMAN FOLLISTATIN 1 AND 2 PRECURSOR (FS) (ACTI 249 269.9 1.4e-08 FSA_PIG FOLLISTATIN A AND A' PRECURSOR (FS) (ACTIV 249 269.8 1.4e-08 SPRC_XENLA SPARC PRECURSOR (SECRETED PROTEIN ACIDI 206 223.7 5.1e-06 QR1_COTJA QR1 PROTEIN PRECURSOR. 207 218.9 9.5e-06 SC1_RAT MATRIX GLYCOPROTEIN SC1 PRECURSOR. 197 208.4 3.6e-05 FSA_XENLA FOLLISTATIN (FS) (ACTIVIN-BINDING PROTEI 180 197.1 0.00016 Note that here, again, it appears that possibly unrelated sequences are obtaining "significant" similarity scores. The result is quite different if the search is performed with the calcium-binding domain only. SPRC_MOUSE SPARC PRECURSOR (SECRETED PROTEIN ACIDI 180 195.2 0.00020 SPRC_HUMAN SPARC PRECURSOR (SECRETED PROTEIN ACIDI 175 189.7 0.0004 SPRC_BOVIN SPARC PRECURSOR (SECRETED PROTEIN ACIDI 168 182.0 0.0011 IAC2_BOVIN ACROSIN INHIBITORS IIA AND IIB (BUSI-II 144 167.9 0.0066 IACA_PIG SEMINAL PLASMA ACROSIN INHIBITOR A1. 130 151.6 0.053 IOVO_CHICK OVOMUCOID PRECURSOR. 135 148.5 0.079 IACS_PIG SPERM-ASSOCIATED ACROSIN INHIBITOR. 126 146.6 0.10 IOV7_CHICK OVOINHIBITOR PRECURSOR. 132 139.4 0.25 NCS1_RAT NEURONAL CALCIUM SENSOR 1 (NCS-1). 124 137.3 0.34 IPS2_RAT PANCREATIC SECRETORY TRYPSIN INHIBITOR II 111 129.4 0.92 IBP_TURRS CHELONIANIN (BASIC PROTEASE INHIBITOR) ( 112 128.0 1.1 PE60_PIG PEPTIDE PEC-60 PRECURSOR. 109 126.5 1.3 IPST_ANGAN PROBABLE PANCREATIC PROTEINASE INHIBITO 105 124.7 1.7 … …

Page 40 Searching with a short domain • Results of searching SwissProt using SSEARCH (Smith-Waterman) with gap penalties of -14, -1 using a 76 residue putative calcium-binding domain from "HSU06863_1 follistatin-related protein precursor". Now the calmodulin type EF-hand domains are very well represented, although in this case only one has statistically significant similarity. NCS1_RAT NEURONAL CALCIUM SENSOR 1 (NCS-1). 107 163.3 0.012 TPCS_RANES TROPONIN C, SKELETAL MUSCLE. 96 147.6 0.089 VIS3_RAT VISININ-LIKE PROTEIN 3 (VILIP-3) (NEURAL 92 140.0 0.24 CABI_BOVIN VITAMIN D-DEPENDENT CALCIUM-BINDING PRO 86 138.5 0.29 YLJ5_CAEEL HYPOTHETICAL CALCIUM-BINDING PROTEIN C5 89 136.1 0.39 LPSA_LYTPI CALCIUM-BINDING PROTEIN LPS1-ALPHA. 92 135.7 0.41 BDR1_HUMAN CALCIUM-BINDING PROTEIN BDR-1. 89 135.4 0.43 SP2A_STRPU SPEC 2A PROTEIN. 86 132.9 0.59 SP2C_STRPU SPEC 2C PROTEIN. 86 132.8 0.59 TPCS_RABIT TROPONIN C, SKELETAL MUSCLE. 86 132.4 0.63 CALG_PIG CALGIZZARIN (S100C PROTEIN). 83 131.8 0.67 CALG_RABIT CALGIZZARIN (S100C PROTEIN). 83 131.6 0.70 S10P_HUMAN S-100P PROTEIN. 82 130.6 0.79 TPCS_HUMAN TROPONIN C, SKELETAL MUSCLE. 84 129.3 0.93 PSOR_HUMAN PSORIASIN. 81 128.6 1.00 • This is clearly an example of a sequence relationship that is best appreciated retrospectively, with the data from the tertiary structure. Without that evidence, one could not conclude that these two types of Ca++-binding proteins were homologous.

Page 41 Searching with a short domain

Page 42 Protein sequence similarity searches are most powerful when they are used to infer sequence homology by identifying statistically significant sequence similarity. Homologous sequences are thought to share a common ancestor; i.e. at some point in evolutionary history, there was a single protein sequence, which, through processes of speciation or gene duplication and divergence, produced the two homologous sequences we see today. The inference of homology --- common ancestry --- is the most powerful conclusion that one can draw from a similarity search because homologous proteins share similar three-dimensional structures. This can be seen the figure on the left, where the structures of three members of the serine protease superfamily are shown. Two of these proteins, bovine chymotrypsin and S. griseus trypsin, share strong sequence similarity while the third related sequence, S. griseus protease A, does not share significant similarity (E()< 66) yet the protein has a very similar structure. Thus, homologous proteins need not share statistically significant, or even detectable, sequence similarity

Page 43 Identifying distantly related proteins • Protein sequence comparison is the most powerful tool available today for inferring structure and function from sequence because of the constraints of protein evolution---a protein fold into a functional structure. • Protein sequence similarity can routinely be used to infer relationships between proteins that last shared a common ancestor 1--2.5 billion years ago. • Ability to identify distantly related proteins has improved with the development of accurate statistical estimates, which have provided better normalization methods, and with the use of optimized scoring parameters. • In using sequence similarity to infer homology, one should remember: • Always compare protein sequences if the genes encode proteins. Protein sequence comparison will typically double the look back time over DNA sequence comparison. • While most sequences that share statistically significant similarity are homologous, many distantly related homologous sequences do not share significant homology. (Low complexity regions display significant similarity in the absense of homology). Homologous sequences are usually similar over an entire sequence or domain. Matches that are more than 50% identical in a 20--40 amino acid region occur frequently by chance. • Homologous sequences share a common ancestor, and thus a common protein fold. Depending on the evolutionary distance and divergence path, two or more homologous sequences may have very few absolutely conserved residues. However, if homology has been inferred between A and B, between B and C, and between C and D, A and D must be homologous, even if they share no significant similarity. • Similarity searching techniques can be improved either by increasing the ability of a method to recognize distantly related sequences---increased sensitivity. or by lowering scores for unrelated sequences---increased selectivity.

Page 44 The Statistics of Sequence Similarity Scores • To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. • "chance" means the comparison of • real but non-homologous sequences; • real sequences that are shuffled to preserve compositional properties; • sequences that are generated randomly based upon a DNA or protein sequence model. • Analytic statistical results use the (iii) definition of chance, while empirical results based on simulation and curve-fitting may use any of the definitions.

Page 45 The Statistics of Sequence Similarity Scores • The statistics of global sequence comparison: very little is known about the random distribution of optimal global alignment scores. • Monte Carlo experiments: rough distributional results for some specific scoring systems and sequence compositions, but no generalization. • For a particular global alignment: generate many random sequence pairs of appropriate length and composition, and calculate the optimal alignment score for each. • Express the score in terms of standard deviations from the mean (Z-value). • But the distribution of the score is not normal and the Z-value cannot be converted into a P-value: the tail behavior of global alignment scores is unknown. • P-value: Probability (x > x (observed)) • If 100 random alignments have score inferior to the alignment of interest, the P-value in question is likely less than 0.01. • Avoid exaggerating the significance of a result found among multiple tests. • When many alignments have been generated, e.g. in a database search, the significance of the best must be discounted accordingly. An alignment with P-value 0.0001 in the context of a single trial may be assigned a P-value of only 0.1 if it was selected as the best among 1000 independent trials.

Page 46 The statistics of local sequence comparison without gaps • To analyze how high a score is likely to arise by chance, a model of random sequences is needed. • For proteins, the simplest model chooses the amino acid residues in a sequence independently, with specific background probabilities for the various residues. • The expected score for aligning a random pair of amino acid is required to be negative. Were this not the case, long alignments would tend to have high score independently of whether the segments aligned were related, and the statistical theory would break down. • The maximum of a large number of independent identically distributed (i.i.d) random variables tends to an extreme value distribution. • Cf. The sum of a large number of i.i.d. random variables tends to a normal distribution (central limit theorem).

Page 47 The statistics of local sequence comparison without gaps • The standard Gumbel distribution (pdf): • Cumulative distribution function of the Gumbel distribution is:

Page 48 The statistics of local sequence comparison without gaps With sufficiently large sequence lengths m and n, the statistics of HSP scores are characterized by two parameters, K and lambda. • E-value for the score S: The expected number of HSPs with score at least S is given by the formula • Doubling the length of either sequence should double the number of HSPs attaining a given score. • For an HSP to attain the score 2x it must attain the score x twice in a row, so one expects E to decrease exponentially with score. • The parameters K and lambda are natural scales for the search space size and the scoring system, respectively. Bit scores:raw scores are meaningless, like stating length without knowing the units. • A "bit score" S’ is obtained by normalizing a raw score using: • which has a standard set of units.

Page 49 The statistics of local sequence comparison without gaps • The E-value corresponding to a given bit score is Bit scores contains the statistical essence of the scoring system employed. To calculate significance one needs to know in addition only the size of the search space. • P-values The number of random HSPs with score >= S (E-value) is described by a Poisson distribution. • Assuming a random distribution of the maximum score of the HSPs and considering that >=S is a rare event • The probability of finding exactly a HSPs with score >=S is: • The chance of finding zero (a = 0) HSPs with score >=S is e-E, so the probability of finding at least one such HSP isThis is the P-value associated with the score S. • For example, if one expects to find three HSPs with score >= S, the probability of finding at least one is 0.95 = 1 - e^(-3) = 1 - 0.0498

Page 50 The statistics of local sequence comparison without gaps • BLAST reports E-value rather than P-values: it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995. • However, when E < 0.01, P-values and E-value are nearly identical. Database searches • How to assess the significance of an alignment when comparing a protein of length m to a database containing many different proteins, of varying lengths? • The E-value is for the comparison of two proteins of lengths m and n. • One view is that all proteins in the database are a priori equally likely to be related to the query. • A low E-value for an alignment involving a short sequence should carry the same weight as a low E-value for an alignment with a long sequence. • To calculate a "database search" E-value, the pairwise-comparison E-value is multiplied by the number of sequences in the database. Eg. As in FASTA. • An alternative view: a query is a priori more likely to be related to a long sequence, because long sequences are often composed of multiple distinct domains. • Assume the a priori chance of relatedness is proportional to sequence length, the pairwise E-value involving a database sequence of length n should be normalized by N/n, where N is the total length of the database in residues. • Can treat the database as a single long sequence of length N. eg. BLAST program.

Lecture 11:

Lecture 11:

Presentation Transcript

Lecture 2 : Visual Astronomy -- Stars and Planets

Lecture 2: Software Platforms

CS101 Introduction to Computing Lecture 6 Developing & Hosting a Web Page (Web Development Lecture 2)

Is Advertising Immoral? Lecture 1

BCB 444/544

Mobile Programming Lecture 2

Lecture 1. Introduction

Lecture 3: Dynamic ILP

Topics in Space Weather Lecture 11 The Upper Atmosphere

6.096 Lecture 10

Cold atoms

“Elementary Particles” Lecture 6

Week 1 Lecture Review

“Elementary Particles” Lecture 6

ENT Undergraduate Lecture

Lecture 2. Optimal Sequence Alignment

Cold atoms

Introduction to Regression Lecture 2.1

Lecture 2: Software Platforms

Is Advertising Immoral? Lecture 1

Lecturer name: Dr. fahad albadr Radiology chiarman Lecture Date: 20-9-2011

Lecture 11:

Lecture 11:

Presentation Transcript

Lecture 2 : Visual Astronomy -- Stars and Planets

Lecture 2: Software Platforms

CS101 Introduction to Computing Lecture 6 Developing &amp; Hosting a Web Page (Web Development Lecture 2)

Is Advertising Immoral? Lecture 1

BCB 444/544

Mobile Programming Lecture 2

Lecture 1. Introduction

Lecture 3: Dynamic ILP

Topics in Space Weather Lecture 11 The Upper Atmosphere

6.096 Lecture 10

Cold atoms

“Elementary Particles” Lecture 6

Week 1 Lecture Review

“Elementary Particles” Lecture 6

ENT Undergraduate Lecture

Lecture 2. Optimal Sequence Alignment

Cold atoms

Introduction to Regression Lecture 2.1

Lecture 2: Software Platforms

Is Advertising Immoral? Lecture 1

Lecturer name: Dr. fahad albadr Radiology chiarman Lecture Date: 20-9-2011

CS101 Introduction to Computing Lecture 6 Developing & Hosting a Web Page (Web Development Lecture 2)