Database Searching

Database Searching Using alignment algorithms for finding similar sequences

Why do we want to compare sequences? • Evolutionary relationships • Phylogenetic trees can be constructed based on comparison of the sequences of a molecule (example: 16S rRNA) taken from different species • Residues conserved during evolution play an important role • Prediction of protein structure and function • Proteins which are very similar in sequence generally have similar 3D structure and function as well • By searching a sequence of unknown structure against a database of known proteins the structure and/or function can in many cases be predicted

Things to keep in mind when working with alignments • Pairwise alignment programs always find the optimal alignment of two sequences • They do so even if it does not make any sense at all to align the two sequences • ”Optimal” means optimal according to the substitution matrix and gap penalties you choose – also if you choose the wrong ones • Generally the underlying assumptions are wrong • The frequency of substitution is not the same at all positions • Nor is the frequencies of insertions and deletions the same • Affine gap penalties do not properly model indel events

Using sequence alignment to search databases • The most common usage of pairwise sequence alignment is searching databases for related sequences • Although the alignments themselves may be unreliable the alignment scores gives a lot of information about which sequences are related and which are not • Having a set of related sequences is a lot more informative than just one sequence - even if nothing is known about the related sequences

Requirements in addition to an alignment method • A very fast method to find potentially related sequences • Systematically searching through the databases with the alignment methods take too long even though dynamic programming is fast • Some method to initially identify possible matches is therefore needed to speed up the search • A method to evaluate which matches to trust • Statistics on the alignment score distributions can be used to calculate the significance of an alignment • This way we can not only rank which matches are better than others but also tell if any of them are good at all

Local or global alignment • Generally local alignment is used for performing database searches • For most cases you would be interested in knowing if any parts of you sequences looks like something else • The protein sequence databases have not been split into domains • It is not always the optimal thing to do but … • In the case where the complete sequence should match the local alignment score will be almost identical to the global one • If you really want a global alignment you can make it afterwards

Differences between global and local alignments • Extra constraint on scoring function: The expected score for a random alignment must be negative • Because you can to start a new alignment anywhere dynamic programming scores cannot become negative • The trace-back is started at the highest values rather than the lower right corner • The trace-back is stopped as soon as a zero is encountered

The Smith-Waterman algorithm(local alignment)

Alignment score distributions • The local similarity scores for ungapped alignment of random sequences can be shown to follow an extreme value distribution:P(Sx) = 1-exp(-Kmne-x),where m and n are the sequence lengths while K and  are free parameters • This turns out to be a very good approximation for gapped alignment as long as reasonably large gap penalties are used

Database searching • Positive reporting: When searching in a database we report only the few good matches • The expected number of database hits with a score of at least x can be calculated as: E(Sx) = DP(Sx),where D is the number of entries in the database • E-values are much better for evaluating alignments than raw alignment scores or ”percent identity”

A curse or a blessing? • Large databases are a blessing … • They are more likely to contain something similar to the query • … and a curse • Increasing the size of the database decreases the significance of the hits you get • Searching huge databases requires fast computers • What requirements this puts on software development • The programs must be speeded up or database searches will take longer and longer • The false positive rate must be reduced to not lose specificity

FASTA (Pearson 1995) Uses heuristics to avoid calculating the full dynamic programming matrix Speed up searches by an order of magnitude compared to full Smith-Waterman The statistical side of FASTA is still stronger than BLAST BLAST (Altschul 1990, 1997) Uses rapid word lookup methods to completely skip most of the database entries Extremely fast One order of magnitude faster than FASTA Two orders of magnitude faster than Smith-Waterman Almost as sensitive as FASTA Heuristic search algorithms

Coffee breakTop 10 ways to tell you drink too much coffee 10 Juan Valdez names his donkey after you 9 You get a speeding ticket even when you're parked 8 You grind your coffee beans in your mouth 7 You sleep with your eyes open 6 You watch videos in fast-forward 5 You lick your coffeepot clean 4 Your eyes stay open when you sneeze 3 The nurse needs a scientific calculator to take your pulse 2 You can type sixty words a minute with your feet 1 You can jump-start your car without jumper cables.

How BLAST works • The search is speeded up by indexing the sequence databases in a so-called suffix array • Three letter subsequences are used as keys to the sequences • Closely related substitutions are also included • This gives ~150 index keys for each sequence • This is used in two ways • To quickly discard sequences that are not similar at all before even beginning to align them • To constrain the alignment and thereby speed up the alignment procedure itself

BLASTN Nucleotide query sequence Nucleotide database BLASTP Protein query sequence Protein database BLASTX Nucleotide query sequence Protein database Compares all six reading frames with the database TBLASTN Protein query sequence Nucleotide database ”On the fly” six frame translation of database TBLASTX Nucleotide query sequence Nucleotide database Compares all reading frames of query with all reading frames of the database Variations on a theme

BLAST at NCBIhttp://www.ncbi.nlm.nih.gov/BLAST/ • Very fast computer dedicated to running BLAST searches • Many databases that are always up to date • Nice simple web interface • But you still need to knowledge about BLAST to use it properly

Performing a simple BLAST search • We will now do a small exercise together • The purpose of the exercise is simply to performing a simple BLAST search ”hands on” • Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise1.html

The most common and effective way to ruin your database search • What you should never ever do: take the nucleotide sequence of a gene and compare it with a database at the nucleotide level • Unfortunately this is a very intuitive thing to do • On the NCBI BLAST homepage nucleotide search methods are listed before protein search – making it even more intuitive • What you should do instead • Extract the coding part of the DNA sequence, translate it, and search with the resulting protein sequence • Use a search method (such as BLASTX or TBLASTX) which compares the sequences at the protein level

The limits of sequence similarity

Expectation values in BLAST • BLAST uses precomputed extreme value distributions to calculate E-values from alignment scores • For this reason BLAST only allows certain combinations of substitution matrices and gap penalties • This also means that the fit is based on a different data set than the one you are working on • A word of caution: BLAST tends to overestimate the significance of its matches • E-values from BLAST are fine for identifying sure hits • One should be careful using BLAST’s E-values to judge if a marginal hit can be trusted

Evaluating BLAST results • We will now do a second exercise together • The main point of this exercise is careful interpretation of the BLAST output • Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise2.html

Pairwise alignment of hemoglobin alpha chain and myoglobin 24.7% identity; Global alignment score: 130 10 20 30 40 50 HBA_HU VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--- ::: .. : .:.:: : .. .: . : :.: : : : : .: . :..:. MYG_PH VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 10 20 30 40 50 60 60 70 80 90 100 110 HBA_HU ---HGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF-KLLSHCLLVTLAAHL :: : :: . . :. :.. :: : .. :... ...:. .. .: .. MYG_PH LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI-PIKYLEFISEAIIHVLHSRH 70 80 90 100 110 120 130 140 HBA_HU PAEFTPAVHASLDKFLASVSTVLTSKYR------ :..: ......: : ...::. MYG_PH PGDFGADAQGAMNKALELFRKDIAAKYKELGYQG 120 130 140 150

A multiple sequence alignment of globins HBB_HUMAN --------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLST HBB_HORSE --------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN HBA_HUMAN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- HBA_HORSE ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- MYG_PHYCA ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT LGB2_LUPLU --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . HBB_HUMAN PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL HBB_HORSE PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL HBA_HUMAN ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL HBA_HORSE ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL MYG_PHYCA EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF GLB5_PETMA ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV LGB2_LUPLU VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : .

Why multiple alignment is better • More sequences contain more information • Multiple sequence alignment allows us to compare all related proteins simultaneously • It allows us to identify features that are conserved among the sequences • Using a multiple sequence alignment (a profile) one can find more related sequences than by simple pairwise comparison

Coffee break Coffee break quiz: Why is the lasT gene inE. coli called lasT? Did some researchers fail to get the joke?

An iterative scheme for using profiles in database searches • Search your sequence against a large database using a pairwise alignment method (often BLAST) to obtain a set of closely related sequences • Make a multiple sequence alignment (using ClustalW) and estimate a profile • Search the profile against the database in an attempt to find more distantly related sequences • Include these in the profile and redo the profile search

If only life was so simple … • In the databases one may find large cluster of almost identical sequences • These will heavily bias the profile towards ”their sequence” • To avoid this a sequence weighing scheme must be used during construction of the profile • How should one estimate the frequencies of rare mutations that have not been observed • A more general problem: What to do when you have too few observations to make a reliable estimate of a frequency • The solution is called regularization which involved using prior knowledge on mutations (such as substitution matrices)

Regularization by pseudo counts • In addition to the real counts actually observed in the sequences some extra pseudo counts are added • The simplest approach is to simply add 1 to all counters before calculating sequences • PSI-BLAST adds pseudo counts based on observations multiplied by a substitution matrix • This means that pseudo counts are mainly added to the amino acids which are similar to the observed ones • The number of pseudo counts is adjusted so that pseudo counts are mainly used when few real counts are available

An overview of PSI-BLAST • A fast heuristic method for doing profile searches which is almost as good as ”the real thing” • Outline of the algorithm • First ordinary BLAST is used to find close homologs • Rather than making a real multiple alignment the close homologs are all just aligned to the query sequence (a master-slave alignment) • A profile is constructed using a very simple empirical weighing scheme combined with substitution matrix pseudo-counts • Ignoring the positional variation of indels the profile is again searched against the database

PSI-BLAST’s E-values • BLAST generally tends to overestimated the significance of database hits • PSI-BLAST E-values are not the E-value of the query sequence matching the database sequence • Instead the E-values represent the expectation value of the profile matching the database sequence • The profile might be wrong due the spurious hits in earlier iterations!

Using PSI-BLAST • We will now use PSI-BLAST to find more homologs to our query sequence • Again the emphasis is on the interpretation of the results • Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise3.html

Conserved domain BLAST • PSI-BLAST attempts to build a profile for the query sequence and search it against a sequence database • CD-BLAST instead builds a database of profiles and searches the query sequence against this • This means that CD-BLAST is not iterative and thus faster • CD-BLAST works for sequences with no close homologs • The profiles come from the PFAM database which is checked by experts to make sure that no unrelated sequence are included in the profiles • However CD-BLAST can only identify conserved domains which are in the PFAM database

Is it really worth the trouble? • Yes! • Profile based search methods like PSI-BLAST have been shown to find ~3 times as many homologs without increasing the number of false positives • This essentially translates into three times higher chance of finding a homolog with known structure or function • Using profiles rather than single sequences improved secondary structure prediction by ~10%

Searching for conserved protein domains using CD-BLAST • We will now use CD-BLAST to see if the query sequence has matches to any known protein families • The results of this search should be compared to those found using PSI-BLAST • Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise4.html

Important things to remember when using alignment to search databases • When searching in databases, size does matter! • Searching large databases take very long time • The significance of matches drops when the database is expanded • Doing things differently can lead to different conclusions • Nucleotide comparison vs. protein comparison • CD-BLAST vs. PSI-BLAST • Think before and after you search • The obvious thing to do is not always the right thing to do • E-values should be interpreted with care • Conclusions based on matches should be drawn with greater care

Database Searching