Computing in Molecular Biology

Computing in Molecular Biology Hugues Sicotte National Center for Biotechnology Information sicotte@ncbi.nlm.nih.gov

Alignment methods Sequence Alignment representation using a dot plot. For a query of N letters against a subject sequence of M letters, it requires MxN comparisons. Query sequence Subject sequence

query sequence MLIIKRDELVISWASHERE H A S H I N G M E T H O D S Hashing is a common method for accelerating database searches MLI LII IIK IKR all overlappingwords of size 3 Compile “dictionary” of words from the query sequence. Put each word in a look-up table that points to the original position in the sequence. Thus given one word, you can know if it is in the query in a single operation. KRD RDE DEL ELV LVI VIS ISW SWA WAS ASH SHE HER ERE

Index lookup • Each word is assigned a unique integer. • E.g. for a word of 3 letters made up of an alphabet of 20 letters. • Assign a code to each letter Code(l) (0 to 19) • For a word of 3 letters L1 L2 L3 the code is • index = Code(L1)*202 + Code(L2)*201 + Code(L3) • 3. Have an array with a list of the positions that have that word. AAA AAB MLI MLJ 0 1 2 3 1 Position in query sequence of word

query sequence MLIIKRDELVISWASHERE H A S H I N G M E T H O D S Building the dictionary for the query sequence requires (N-2) operations. MLI LII IIK IKR all overlappingwords of size 3 KRD RDE DEL ELV The database contains (M-2) words, and it takes only one operation to see if the word was in the query. LVI VIS ISW SWA WAS ASH SHE HER ERE

H A S H I N G M E T H O D S Query sequence Scan the subject, looking up words in the dictionary Use word hits to determine were to search for alignments fills the dynamic programming matrix in (N-2)+(M-2) operations instead of MxN. Subject sequence

H A S H I N G M E T H O D S Query sequence Scan the database, looking up words in the dictionary Use word hits to determine were to search for alignments Subject sequence FASTA searches in a band

H A S H I N G M E T H O D S Query sequence Scan the database, looking up words in the dictionary Use word hits to determine were to search for alignments Database sequence BLAST extends from word hits

Database Search Space Simplest Database searching could is a large dynamic programming example. With all the database sequences concatenated one after another. Query sequence Concanated Database sequence

Database Search Space Which alignment is more significant? Query sequence Concanated Database sequence

Database Search Space Score can be used to judge alignments. But a score absolute value is a function of the score parameters. Match=+1,Mismatch=-1, Gap_open=5, gap_extend=1 Yields same alignments as Match=+10,Mismatch=-10, Gap_open=50, gap_extend=10 Scores useful for relative ranking. Query sequence Concanated Database sequence

Database Search Space To Judge relevancy of an alignment, need to judge if match is significant. E-value = Expect(S) is a function of the score, database size and composition, and query size. Number of Aligments with scores >= S expected if the query was a random given the database size and composition. Expect of 0.0 means a very good match unlikely to be random. Query sequence Concanated Database sequence

D A T A B A S E S E A R C H I N G Compare one query sequence against an entire database > fasta myquery swissprot -ktup 2 search program querysequence sequencedatabase optionalparameters A typical search has four basic elements

D A T A B A S E S E A R C H I N G With exponential database growth, searches keep taking more time > fasta myquery swissprot -ktup 2 searching . . . . . .

E-value “Hits” can be sorted according to their E-value or their score. The E-value is better known as the EXPECT value and is a function of score, database size and query sequence length. E-value: Number of alignments with a score >=S that you expect to find if the database was a collection of random letters. e.g. For a score of 1, one only requires 1 match, and there should be an enormous amount of alignments. One expects to find less alignments with a score of 5, and so on.. Eventually when the score is big enough, one expects to find an insignificant number of of alignments that could be due to chance. E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually very good and for proteins, E<1e-2 is usually considered significant. It is still possible for a Hit with E>1 to be biologically meaningful, but more analysis is required to comfirm that. Even for VERY good hits, it is possible that the hit is due to a biological artifact (sequencing/cloning vector, repeats, low-complexity sequence…)

D A T A B A S E S E A R C H I N G The “hit list” gives titles and scores for matched sequences > fasta myquery swissprot -ktup 2 The best scores are: initn init1 opt z-sc E(77110) gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996 1262.1 0 gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412 382 395 507.6 1.4e-21 gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238 133 316 407.4 5.4e-16 gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153 98 190 253.1 2.1e-07 gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163 163 184 244.8 6.1e-07 gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN 164 164 170 227.2 5.8e-06 gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130 91 157 210.3 5.1e-05 gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125 125 148 199.7 0.0002 gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4 42 42 140 191.3 0.00058 gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128 73 139 188.7 0.00082 gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT- 76 76 133 181.0 0.0022 gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1 27 27 119 165.2 0.017 gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO 66 66 118 163.0 0.022 gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO 65 65 116 160.5 0.03 gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT- 52 52 117 160.3 0.031 gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO 66 66 115 159.3 0.035 gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO 66 66 112 155.5 0.057 gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN 73 73 112 155.4 0.058 gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN 76 76 110 153.8 0.072 gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP 58 58 104 138.5 0.51 gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE 47 47 103 137.8 0.56 gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T 63 63 98 131.3 1.3 gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA 58 58 99 129.4 1.6 gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA 70 48 91 122.9 3.7 gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 50 92 121.9 4.3 gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU 36 36 85 121.3 4.7 gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC 36 36 84 120.0 5.5 gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA 45 45 90 118.9 6.3 gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA 48 48 92 117.4 7.6 gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED 59 59 89 117.0 8 gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC 48 48 97 117.0 8 gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO 38 38 83 116.8 8.3

D A T A B A S E S E A R C H I N G Detailed alignments are shown farther down in the output > fasta myquery swissprot -ktup 2 >>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa) initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21 Smith-Waterman score: 395; 52.3% identity in 109 aa overlap 10 20 30 40 50 gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF : X: .:.:: :.:: ::..:::::: : : : :..:: :.:..::: gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF 10 20 30 40 50 60 60 70 80 90 100 110 gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK ....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.: gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK 70 80 90 100 110 120 120 130 140 gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ .. gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE 130 140 150 160 170 180 >>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa) initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16 Smith-Waterman score: 316; 37.4% identity in 131 aa overlap 10 20 30 40 gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER :.. :. .v^: :.. ..:::: ::.::::::. ::X :

Concanated Database sequence Database Search Space Some matches are non-meaningful because they occur VERY often in database. e.g. nucleotide AAA (from polyA) Biological repeated elements(retroposons ALU) Low-complexity repeated patterns. (CAGCAG, QQQ,KKK,…) These elements should be FILTERED or MASKED to avoid generating false ‘hits’.. It is ‘OK’ to align through them if they are near meaningful diagonal ‘hits’ Query sequence

Score and Statistics Some amino acids mutations do not affect structure/function very much. Amino acids with similar physico-chemical and steric properties can often replace each other. Scoring system that doesn’t penalize very much mutations to similar amino acid. PAM Matrices: Point Accepted Mutations. Defined in terms of a divergence of 1 percent PAM. For distant sequences use PAM250, while for closer sequences (like DNA) use PAM100. Some sites accumulate mutations some others don’t, thus use of the PAM100 matrice doesn’t mean that the sequences compared were 100% mutated. BLOSUM: BLOCK substitution matrices. Started with the BLOCKS database of multiple alignment only involving distant sequences. BLOSUM62 means that the proteins compated were never closer than 62% Identity. BLOSUM50 matrices involved alignment of more distant sequences. Recommend use BLOSUM matrices (BLOSUM62) for most protein alignments.

S C O R I N G S Y S T E M S A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V Some amino acid substitutions are more common than others BLOSUM62 Substitution scores come from an odds ratio based on measured substitution rates Figure 7.8

S C O R I N G S Y S T E M S A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V Identities get positive scores, but some are better than others BLOSUM62 Figure 7.8

S C O R I N G S Y S T E M S A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 11 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 12 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 31 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V Some non-identities have positive scores, but most are negative BLOSUM62 Figure 7.8

BLAST and BLAST2SEQUENCES BLAST is a database search engine based on using hashing to accelerate the search. blastn (nucleotide query against nucleotide database) blastp (protein query against protein database) blastx (nucleotide query against protein database) - translates a nucleotide query in all 6 reading frames and compare it to a protein database. tblastn (protein query against nucleotide database) - compare a protein against a nucleotide database translated in all 6 reading frames. tblastx (nucleotide query against nucleotide database) - compares a nucleotide sequence against a nucleotide database by translating the query and database in all 6 reading frames. Very slow! A pairwise alignment implementation of this program is available at: http://www.ncbi.nlm.nih.gov/gorf/bl2.html

Protein BLAST databases nr All non-redundant GenBank CDS+ translations+PDB+ SwissProt + PIR + PRF month All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days. swissprot Last major release of the SWISS-PROT protein sequence database (no updates) Drosophila Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP). yeast Yeast (Saccharomyces cerevisiae) genomic CDS translations ecoli Escherichia coli genomic CDS translations pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank kabat [kabatpro] Kabat's database of sequences of immunological interest alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.

Nucleotide BLAST databases nr All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant". month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. Drosophila genome Drosophila genome provided by Celera and Berkeley Drosophila Genome Project (BDGP). dbest Database of GenBank+EMBL+DDBJ sequences from EST Divisions dbsts Database of GenBank+EMBL+DDBJ sequences from STS Divisions htgs Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr) gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences. yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences E. coli Escherichia coli genomic nucleotide sequences

Nucleotide BLAST databases pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank kabat [kabatnuc] Kabat's database of sequences of immunological interest vector Vector subset of GenBank(R), NCBI, in ftp://ncbi.nlm.nih.gov/blast/db/ mito Database of mitochondrial sequences alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. epd Eukaryotic Promotor Database found on the web at http://www.genome.ad.jp/dbget-bin/www_bfind?epd

BLASTN SEARCH (M29204) Search Nucleotide sequence M29204 against nr. http://www.ncbi.nlm.nih.gov/blast/blast.cgi?Jform=1

BLASTP and filtering. Search using blastp against nr With filtering ON (default)\ Then with filtering OFF. >GCF MKKRVTNRERHWTHRRRRQRTRKKKKKKKRVLGRRALGPRPWLTGRKGLFGSARLIPATA

BLASTN vs BLASTX Search blastn against nr (nucleotide) U15595 Now search using blastx against nr (protein) Now Search blastx against ALU

TBLASTX against dbEST Search tblastx against dbEST Picks up homologs based on protein homology of translations. >OCRL-selected mRNA, partial sequenceTTGAACATCATGAAACATGAGGTTGTCATTTGGTTGGGAGATTTGAATTATAGACTTTGCATGCCTGATGCCAATGAGGTGAAAAGTCTTATTAATAAGAAAGACCTTCAGAGACTCTTGAAATTCGACCAGCTAAATATTCAGCGCACACAGAAAAAAGCTTTTGTTGACTTCAATGAAGGGGAAATCAAGTTCATCCCCACTTATAAGTATGACTCTAA

Prosite search Search prosite for NP_000271 (Pax6a) http://www.expasy.ch/prosite

PHI-Blast search Search Prosite db using the NCBI’s PHI-blast.(Pattern-Hit-Initiated blast) using the pattern for Pax6a. [LIVMFYG]-[ASLVR]-X(2)-[LIVMSTACN]-X-(4)-[LIV]-[RKNQESTAIY]-[LIVFSTNKH]-W -e 2e-14

PSI-Blast search Search AB026911 using PSI-blast. (at NCBI). Position-Specific-Iteration. .. Modifies the scoring matrix as a function of conserved or unconserved residues in alignments.

ONLINE tutorials Details of Blast methodology. http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html Blast usage and Tutorial http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html Quick overview of terminology. http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html

Computing in Molecular Biology

Computing in Molecular Biology

Presentation Transcript

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Computing in Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology