Database Searching

Database Searching

Searching for Data • Text Patterns • LookUp • Sequence Patterns • FindPatterns • ProfileSearch • Sequence Similarity • FastA, TFastA • BLAST, NetBLAST

Introduction to Data Base Searching • What are you looking for?

"Exact" matches • "Have I cloned something that someone else has already worked on?"

"Related" sequences • Is there something similar to my sequence • Evolutionary relationships • Convergent function

Search Program Considerations • Sensitivity • Stringency • Speed • Cost

Speed and Cost • Time and cost of the search is dependent on the size of the database and the size of the query • Restrict the size of the database • Use the -batch qualifier to save money • Use GenBank's Services

Results • Histogram • Plot of 'match scores" vs. number of sequences • Allows you to distinguish background noise from significant matches • Sequence Names • Alignments

FindPatterns • Locate short sequence patterns in sequences • Nucleic acid or Protein • Searches both strands of a nucleic acid sequence

Pattern Definitions • Findpatterns, Map, Mapsort, Mapplot, and Motifs all let you search with ambiguous expressions • Expressions can include any legal GCG sequence character • Expressions can also specify: • OR and NOT matching • Begin and end constraints • Repeat counts

Repeats • Parentheses () enclose one or more symbols that can be repeated • Braces {} enclose numbers that tell how many times the symbol(s) must be found • (GA){2,10} - GA repeated 2 to 10 times • G{2,} - G repeated 2 to 350,000 times • (GAT){,10} - GAT repeated 0 to 10 times

TAATA(N){20,30}ATG • TAATA, followed by 20 to 30 of any base, followed by ATG

OR Matching • Enclose the different choices in parentheses and separate the choices with commas • RGF(Q,A)S • RGF followed by either Q or A followed by S. • GAT(TG,T,G){1,4}A means • GAT followed by any combination of TG, T, or G repeated from 1 to 4 times followed by A

NOT Matching • Use the ~ symbol • GC~CAT • GC, followed by any symbol except C followed by AT • GC~(A,T)CC • GC followed by any symbol except A or T, followed by CC.

BEGIN AND END Constraints • The pattern <GACCAT can only be found at the beginning of the sequence • The pattern GACCAT> can only be found at the end of the sequence

analyze% findpatterns -check FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. Minimal Syntax: % findpatterns [-INfile=]Genbank:Humig* -Default Prompted Parameters: -PATterns=GAATTC,RGGAY patterns to be found [-OUTfile=]findpatterns.find the output file name Local Data Files: -DATa=pattern.dat a file with a set of patterns

Optional Parameters: -MISmatch=1 allows mismatches in the search for your subsequence -NAMes writes the output as a list file -ONEstrand searches only the top strand of nucleotide sequences -SIXbase searches only for patterns with six or more symbols -CIRcular searches all sequences as if they were circular -ALL does an "overlapping-set" search in nucleotide sequences -PERFect looks only for perfect matches -APPend appends the pattern data file to the output file -SHOw shows every file searched even if there are no finds -TERminal writes output to the terminal screen instead of a file -NOMONitor suppresses the screen trace showing each file -ONCe limits finds to patterns found a maximum of 1 time -MINCuts=1 limits finds to patterns found a minimum of 1 time -MAXCuts=3 limits finds to patterns found a maximum of 3 times -EXCLude=n1,n2 excludes patterns found between positions n1 and n2 -SINce=6.90 limits search to sequences dated on or after June 1990 -BATch Submits the program to run in the batch queue Add what to the command line ?

FINDPATTERNS in what sequence(s) ? swp:* Enter patterns individually, one per line. End the list with a blank line. Pattern 1: ygdd Pattern 2: What should I call the output file (* findpatterns.find *) ? ygdd.find ** findpatterns will run as a batch or at job. ** findpatterns was submitted using the command: " atnow " Job class000.894911339.a will be run at Mon May 11 13:28:59 CDT 1998. analyze%

! FINDPATTERNS on swp:* allowing 0 mismatches ! 1 YGDD May 11, 1998 11:02 .. AAC1_PSEAE ck: 7052 len: 177 ! P23181 pseudomonas aeruginosa. gentamicin 3'-acetyltransferase (ec 2.3.1.6 1 YGDD 148: YVQAD YGDD PAVAL AMDZ_YEAST ck: 8601 len: 464 ! Q03557 saccharomyces cerevisiae (baker's yeast). probable amidase ymr293c 1 YGDD 450: QVVGQ YGDD STVLD AMOB_NITEU ck: 4649 len: 420 ! Q04508 nitrosomonas europaea. ammonia monooxygenase (ec 1.13.12.-). 2/96 1 YGDD 227: RVLLA YGDD LLMDP AMYM_BACST ck: 5976 len: 717 ! P19531 bacillus stearothermophilus. maltogenic alpha-amylase precursor (ec

POLG_HRV1B VPSGCSGTSI FNTMINNIII RTLVLDAYKN IDLDKLKIIAYGDDVIFSYK POLG_HRV2 VPSGCSGTSI FNTMINNIII RTLVLDAYKN IDLDKLKIIA YGDDVIFSYI POLG_HRV89 MPSGCAGTSI FNTIINNIII RTLVLDAYKN IDLDKLKILA YGDDVIFSYN POLG_CXA16 MPSGCSGTSI FNSMINNIII RTLLIKTFKG IDLDELNMVA YGDDVLASYP POLG_HE71M MPSGCSGTSI FNSMINNIII RTLLIKTFKG IDLDELNMVA YGDDVLASYP POLG_HE71B MPSGCSGTSI FNSMINNIII RTLLIKTFKG IDLDELKMVA YGDDVLASYP POLG_SVDVU MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP POLG_SVDVH MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP POLG_COXB5 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP POLG_COXB3 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP POLG_CXA9 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP POLG_COXB4 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP POLG_COXB1 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP POLG_EC11G MPSGYSGTSM FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP POLG_FMDV1 MPSGCSATSI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD POLG_FMDVO MPSGCSATSI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD POLG_FMDVZ MPSGCSATSI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD POLG_FMDVA MPSDCSATGI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD POLG_FMDVS MPSGCSATSI VNTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD POLG_TMEVB LPSGCAATSM LNTIMNNVII RAALYLTYSN FDFDDIKVLS YGDDLLIGTN POLG_TMEVG LPSGCAATSM LNTIMNNVII RAALYLTYSN FEFDDIKVLS YGDDLLIGTN POLG_TMEVD LLSGCAATSM LNTIMNNVII RAALYLTYSN FEFDDIKVLS YGDDLLIGTN POLG_EMCVD LPSGCAATSM LNTIMNNIII RAGLYLTYKN FEFDDVKVLS YGDDLLVATN POLG_EMCVB LPSGCAATSM LNTIMNNIII RAGLYLTYKN FEFDDVKVLS YGDDLLVATN POLG_EMCV LPSGCAATSM LNTIMNNIII RAGLYLTYKN FEFDDVKVLS YGDDLLVATN

! FINDPATTERNS on swp:* allowing 0 mismatches ! 1 (L,I,V)(S,A)YGDD(L,I,V){2} May 11, 1998 11:31 .. AMOB_NITEU ck: 4649 len: 420 ! Q04508 nitrosomonas europaea. ammonia monooxygenase (ec 1.13.12.-). 2/96 1 (L,I,V)(S,A)YGDD(L,I,V){2} (L)(A)YGDD(L){2} 225: RSRVL LAYGDDLL MDPMD POLG_BOVEV ck: 7260 len: 2,175 ! P12915 bovine enterovirus (strain vg-5-27) genome polyprotein (coat protei 1 (L,I,V)(S,A)YGDD(L,I,V){2} (I)(A)YGDD(L,V){2} 2,038: DDLKI IAYGDDVL ASYPY POLG_COXB1 ck: 4153 len: 2,182 ! P08291 coxsackievirus b1. genome polyprotein (coat proteins vp1 to vp4; co 1 (L,I,V)(S,A)YGDD(L,I,V){2} (I)(A)YGDD(I,V){2} 2,045: DQFRM IAYGDDVI ASYPW POLG_COXB3 ck: 7699 len: 2,185 ! P03313 coxsackievirus b3. genome polyprotein (coat proteins vp1 to vp4; co

FastA • Search nucleotide sequences with a nucleotide query • Search protein sequences with a peptide query

FastA Algorithm • Uses a word search algorithm • Breaks the search into steps • Only the sequences with the best scores are searched in subsequent steps • Relatively fast • Sensitive

Step 1 • Scan the sequence database for the best hits • Uses a word-match type search • Looks for runs of short, perfect matches • Essentially a dotplot-like search • Then find the 10 Best diagonals for each sequence pair

Initial Scan (DotPlot) DatabaseSequences Query Sequence(s)

Best Diagonals

Step 2 • Rescore the initial diagonals • Conservative replacements • Uses the Blosum50 symbol comparison table

Initial regions • Regions are an area of diagonals with the highest scores • Score reported as Init1

Diagonals with the Highest Scores

Step 3 • Join adjacent diagonals. • Find the optimal subset of initial regions which can be joined together. • Corrects for gaps • Score reported as Init n

Joined Diagonals

Step 4 • Align the sequences with the best matches • Uses BestFit Algorithm • Aligns the joined diagonals from step 3 with the query sequence • Score reported as Opt

FastA Summary SequenceAlignment

Specifying the Word Size • 1 to 6 for nt • 1 or 2 for aa • Smaller words • Increased sensitivity • Decreased stringency • Higher backgrounds • Increases cpu time

Output • Histogram • Shows number of sequences falling in a particular score range • Sequence names with scores • Alignment of sequences to the query

Features • More Sensitive than BLAST (?) • Slower than BLAST

analyze% fasta -check FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST. Minimal Syntax: % fasta [-INfile1=]ggamma.pep -Default Prompted Parameters: [-INfile2=]pir:* specifies the search set [-OUTfile=]ggamma.fasta specifies the output file name -BEGin=1 -END=148 sets the range of interest -WORdsize=2 sets the word size -EXPect=2.0 lists scores until E() value reaches 2.0 Local Data Files: -MATRix=fastadna.cmp assigns the scoring matrix for nucleic acids -MATRix=blosum50.cmp assigns the scoring matrix for proteins

Optional Parameters: -PROCessors=2 sets the number of threads devoted to the analysis on a multiprocessor computer Press q to quit or <Return> for more: -MINLength=1000 searches only sequences of 1000 or more residues -MAXLength=5000 searches only sequences of 5000 or fewer residues -SINce=6.90 limits search to sequences dated on or after June 1990 -ONEstrand searches using only the top strand of nucleotide queries -PAMfactor uses scoring matrix to calculate initial diagonal scores -GAPweight=16 sets gap creation penalty (12 is protein default) -LENgthweight=4 sets gap extension penalty (2 is protein default) -OPTall=20 computes opt score when the initn score is 20 or higher; sorts on opt score -NOOPTall doesn't compute opt score during search; sorts on initn -SWalign creates final alignment as unlimited Smith-Waterman for nuc -LIStsize=40 shows the best 40 scores (overrides EXPect) -ALIgn=20 shows the best 20 alignments -NOALIgn suppresses sequence alignments -SHOWall shows complete sequences in alignment, not just overlaps -MARKx=3 sets the alignment display mode -NOHIStogram suppresses printing the histogram -LINesize=60 sets number of sequence symbols per line of the alignment -NODOCLines suppresses sequence documentation in the alignment -BATch submits the program to run in the batch queue -NOMONitor suppresses the screen trace for each search set sequence Add what to the command line ?

FASTA with what query sequence ? pol.pep Begin (* 1 *) ? End (* 461 *) ? Search for query in what sequence(s) (* SwissProt:* *) ? What word size (* 2 *) ? Don't show scores whose E() value exceeds: (* 10.0 *): What should I call the output file (* pol.fasta *) ? ** fasta will run as a batch or at job. ** fasta was submitted using the command: " atnow " job class000.894911721.a at Mon May 11 13:35:21 1998

!!SEQUENCE_LIST 1.0 (Nucleotide) FASTA of: pol.seq from: 1 to: 1383 May 11, 1998 12:44 TO: GenEMBL:* Sequences: 436,425 Symbols: 769,709,871 Word Size: 6 Sequences too short to analyze: 25 (113 symbols) Databases searched: GenBank, Release 105.0, Released on 15Feb1998, Formatted on 19Feb1998 EMBL, Release 53.0, Released on 16Dec1997, Formatted on 20Feb1998 Searching with both strands of the query. Scoring matrix: GenRunData:fastadna.cmp Constant pamfactor used Gap creation penalty: 16 Gap extension penalty: 4 Histogram Key: Each histogram symbol represents 1771 search set sequences Each inset symbol represents 16 search set sequences z-scores computed from opt scores

z-score obs exp (=) (*) < 20 216 0 :* 22 57 0 :* 24 75 1 :* 26 114 18 :* 28 100 197 :* 30 433 1195 :* 32 1343 4619 := * 34 6372 12526 :==== * 36 17948 25726 :=========== * 38 37141 42515 :===================== * 40 65424 59305 :=================================*=== 42 82308 72494 :========================================*======

84 2119 1599 :*= 86 1556 1237 :* 88 1236 957 :* 90 1075 741 :* 92 755 573 :* :===================================*==== 94 635 443 :* :===========================*============ 96 442 343 :* :=====================*====== 98 395 266 :* :================*======== 100 276 205 :* :============*===== 102 218 159 :* :=========*==== 104 164 123 :* :=======*=== 106 121 95 :* :=====*== 108 105 74 :* :====*== 110 67 57 :* :===*= 112 61 44 :* :==*= 114 57 34 :* :==*= 116 73 26 :* :=*=== 118 50 20 :* :=*== >120 353 16 :* :*======================

z-score obs exp (=) (*) < 20 216 0 :* 22 57 0 :* 24 75 1 :* 26 114 18 :* 28 100 197 :* 30 433 1195 :* 32 1343 4619 := * 34 6372 12526 :==== * 36 17948 25726 :=========== * 38 37141 42515 :===================== * 40 65424 59305 :=================================*=== 42 82308 72494 :========================================*====== 44 106229 79967 :=============================================*============== 46 85717 81448 :=============================================*=== 48 80192 77978 :============================================*= 50 64551 71155 :===================================== * 52 54078 62557 :=============================== * 54 48895 53435 :============================ * 56 37786 44634 :====================== * 58 33037 36644 :=================== * 60 26505 29684 :=============== * 62 22308 23798 :=============* 64 18406 18926 :==========* 66 15179 14959 :========* 68 13671 11766 :======*= 70 10703 9221 :=====*= 72 7870 7205 :====* 74 6143 5618 :===* 76 5397 4372 :==*= 78 3856 3399 :=*= 80 3153 2639 :=* 82 2636 2019 :=* 84 2119 1599 :*= 86 1556 1237 :* 88 1236 957 :* 90 1075 741 :* 92 755 573 :* :===================================*==== 94 635 443 :* :===========================*============ 96 442 343 :* :=====================*====== 98 395 266 :* :================*======== 100 276 205 :* :============*===== 102 218 159 :* :=========*==== 104 164 123 :* :=======*=== 106 121 95 :* :=====*== 108 105 74 :* :====*== 110 67 57 :* :===*= 112 61 44 :* :==*= 114 57 34 :* :==*= 116 73 26 :* :=*=== 118 50 20 :* :=*== >120 353 16 :* :*======================

Results sorted and z-values calculated from opt score 1614 scores saved that exceeded 99 471579 optimizations performed Joining threshold: 62, optimization threshold: 47, opt. width: 16 The best scores are: init1 initn opt z-sc E(867084).. GB_VI:POL1 Begin: 5987 End: 7369 ! J02281 Poliovirus type 1 (Mahoney s... 6915 6915 6915 6938.1 0 GB_VI:POLIO1B Begin: 5987 End: 7369 ! V01149 Genome of human poliovirus t... 6915 6915 6915 6938.1 0 GB_VI:POL1B31B Begin: 889 End: 2271 ! M17494 Poliovirus type 1 (Mahoney) ... 6879 6879 6879 6908.0 0 GB_VI:POLIO1A Begin: 5979 End: 7361 ! V01148 Genome of human poliovirus t... 6879 6879 6879 6901.9 0 GB_VI:POLIOS1 Begin: 5987 End: 7369 ! V01150 Genome of human poliovirus, ... 6825 6825 6825 6847.6 0 GB_PAT:I00480 Begin: 1 End: 1227 ! I00480 Sequence 9 from Patent US 47... 6135 6135 6135 6162.7 0 GB_VI:PIPOLS2 Begin: 5986 End: 7368 ! X00595 Poliovirus type 2 genome (st... 5275 5275 5340 5353.8 0 GB_VI:CXA24CG Begin: 6010 End: 7392 ! D90457 Coxsackievirus A24, complete... 5142 5142 5142 5154.6 0 GB_PAT:I22065 Begin: 5978 End: 7360 ! I22065 Sequence 1 from patent US 55... 5124 5124 5124 5136.5 0 GB_VI:PIPO3119 Begin: 5978 End: 7360 ! X01076 Poliovirus type 3 complete s... 5115 5115 5115 5127.5 0 GB_VI:POL3L37 Begin: 5978 End: 7360 ! K01392 Poliovirus P3/Leon/37 (type ... 5115 5115 5115 5127.5 0

FastA Results (nt search)Polio Polymerase vs GenEMBL pdf file

FastA Results (protein search)Polio Polymerase vs SwissProt word=2 pdf file

FastA Results (protein search)Polio Polymerase vs SwissProt word=1 pdf file

TFastA • Translates the nucleotide sequence database in all 6 reading frames • Search the translated sequences with a peptide query • Algorithm is the same as for FastA

analyze% tfasta -check TFastA does a Pearson and Lipman search for similarity between a query peptide sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied peptide sequences in a nucleotide sequence database are similar to my peptide sequence?" Minimal Syntax: % tfasta [-INfile1=]ggamma.pep -Default Prompted Parameters: [-INfile2=]GenEMBL:* search set (all of GenEMBL) [-OUTfile=]ggamma.tfasta output file name -BEGin=1 -END=148 range of interest -WORdsize=2 word size -EXPect=2.0 lists scores until E() value reaches 2.0 Local Data Files: -MATRix=blosum50.cmp scoring matrix for peptides

Optional Parameters: -GAPweight=16 gap creation penalty -LENgthweight=4 gap extension penalty -SINce=6.90 limits search to sequences dated on or after June 1990 -THREEFrames translates and searches only the three forward reading frames -FRAme=1 translates and searches only the frame specified. -NOPAMfactor uses a constant factor to calculate initial diagonal scores -LIStsize=40 shows the best 40 scores (overrides EXPect) -NOATTRibutes suppresses writing the Begin, End, and Strand list attributes to the list of best scores -ALIgn=20 shows the best 20 alignments -NOALIgn suppresses sequence alignments -OPTall=20 immediately computes opt score when the initn score is 20 or higher; sorts on opt score -NOOPTall doesn't compute opt score during search; sorts on initn -SWalign does final alignment as Smith-Waterman -SHOWall shows complete sequences in alignment, not just overlaps -MARKx=3 determines the alignment display mode -NOHIStogram suppresses printing the histogram -LINEsize=60 number of sequence symbols per line of the alignment -NODOCLines suppresses sequence documentation in the alignment -NOMONitor suppresses the screen trace for each search set sequence -BATch submits the program to run in the batch queue -MINLength=1000 searches only sequences of 1000 or more residues -MAXLength=5000 searches only sequences of 5000 or fewer residues Add what to the command line ?

Database Searching