BLAST algorithms Having a BLAST MLW2013, 2011 BiGCaT bioinformatics
Topics of this lecture • Introduction to BLAST • Details on the BLAST algorithm • Performing a BLAST • Pitfalls • Advanced BLAST • PSI-BLAST • PHI-BLAST
History of BLAST • Local alignment: • alignment may contain just a portion of either sequence • appropriate for finding matched domains (or limited regions of similarity) between sequences • local alignment is almost always used for databasesearches. • Smith & Waterman algorithm: • Advantage: guaranteed to find optimal local alignments • Disadvantage: computationally VERYexpensive
History of BLAST (2) • Myers and Miller (1988) sought to improve the alignment algorithms so local alignment required less time and memory • BLAST: Basic Local Alignment Search Tool • a heuristic approximation for the Smith & Waterman algorithm • allows rapid sequence comparison of a query sequence against a database or to align two sequences • Advantage: runs much faster than S&W(50 times faster) • Disadvantage: does not necessarily find optimal solution
Why BLAST? • BLAST searching is fundamental to understanding the relatedness of any query sequence to other known proteins or DNA sequences. • Applications include: • identifying orthologs and paralogs • discovering new genes or proteins • discovering variants of genes or proteins • investigating expressed sequence tags (ESTs) • exploring protein structure and function
The BLAST algorithm in a nutshell • Three phases: • Phase 1: compile a list of words from a query sequence to search for in the database. • Phase 2: scan the database for hits to the words in the list from phase 1 • Phase 3: extend the hits in either direction until the alignment score drops below a certain cut-off • The algorithm will be explained in more detail next time... • Now we will focus on how to apply it using the BLAST website
FSG | SGT | GTW | TWY | WYA • The query is split into subwords of a certain length • The length is determined by the word size parameter • For each of these words, find all words of equal length that are similar enough: • The pairwise alignment score threshold parameter T gives the minimum score for words to be put in the list • Combine all these words as input for Phase 2
Query wordlist Step 1: compile a list of words from the query sequence (for example word size w = 3) Example: for a human RBP query …FSGTWYA… FSG FSG SGT SGT GTW GTW TWY TWY WYA WYA
Changing word size w and threshold T better large w lower T slower Sensitivity Search speed faster worse small w higher T For proteins, default word size is 3 (This yields a more accurate result than 2)
GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW9 word hits < threshold Matching wordlist (T = 11) Step 2: Compile a list of matching words for each query word, given a pairwise alignment score threshold T. GTW for T = 11
(4) Pairwise alignment scores between words are determined using a scoring matrixsuch as BLOSUM62
Phase 1: compile a list of words (3) (5) • After matching words have been collected for each word in the original list, everything is combined as input for the database search . . . Search database . . . …FSGTWYA… FSG FSG SGT SGT GTW GTW TWY TWY WYA WYA GTW GSW ATW NTW GTY . . . . . .
Phase 2: scan the database FSG | SGT | GTW | TWY | WYA makivlcmvllafgrqMKGLDIQKVAGTWYSLAMAASDrrfilqailssfedvcdqlsklsfil Scan the database for entries that match the compiled list of words from phase 1. This is fast and relatively easy.
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit! When two hits are found in close proximity to each other, these hits are extended Extending is continued until the alignment is not strong enough any more
Phase 3: extend hits KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit! • When a match between a “word” and a database entry is found (a hit): • Extend the alignment of the hit in either direction to find high-scoring segment pairs (HSPs) • If score sufficiently high: gapped extension • Keep track of the score (again the scoring matrix is used) • Stop when the score drops below some cutoff
Phase 3: extend hits (2) extend Hit! Hit! Some history In the original (1990) implementation of BLAST, hits were extended in either direction. In a 1997 refinement of BLAST: Two independent hits are required The hits must occur in close proximity to each other With this modification, only one seventh as many extensions occur, greatly speeding the time required for a search.
Just as in other sequence alignment applications, matrices tuned to more and less divergent sequences can be used Recall that a higher PAM number corresponds to a lower BLOSUM number! Different matrices available
More on substitution matrices • For blastp several substitution matrices are available: • PAM30 • PAM70 • BLOSUM45 • BLOSUM62 (default) • BLOSUM80 • Others… • They are used for scoring local alignments in phase 1 (word list creation) and phase 3 (hit extension) of the BLAST algorithm
The expect value E P E E P The expect value E of a score S is the number of alignments with scores greater than or equal to S that are expected to occur by chance in a database search An E value is related to a probability value p: p = 1 - e-E
The expect value E (2) • Very small E values are very similar to p values. • E values of about 1 to 10 are far easier to interpret than corresponding p values. E p 10 0.99995460 5 0.99326205 2 0.86466472 1 0.63212056 0.1 0.09516258 (about 0.1) 0.05 0.04877058 (about 0.05) 0.001 0.00099950 (about 0.001) 0.0001 0.0001000 (identical!)
Results of changing expect value E (RBP sequence) threshold * * don’t confuse the expect value threshold with the threshold T mentioned before
BLAST in five steps Go to the BLAST website: • (1) Select the BLAST program • (2) Choose the query sequence • (3) Choose the database to search • (4) Choose the sub-program • (5) Choose optional parameters Then click “BLAST” and your off!
Step 1: Select the BLAST program blastn (nucleotide BLAST) blastp (protein BLAST) blastx (translated BLAST) tblastn (translated BLAST) tblastx (translated BLAST)
Step 1: Select the BLAST program (2) • blastn: • BLAST a nucleotide query sequence to a nucleotide database • blastp: • BLAST a protein query sequence to a protein database • blastx: • BLAST all six frame translations of a nucleotide query sequence to a protein database • tblastn: • BLAST a protein query sequence to all six frame translations of a nucleotide database • tblastx: • BLAST all six frame translations of a nucleotide query sequence to all six frame translations of a nucleotide database
Step 1: Select the BLAST program (3) Program Input Database 1 blastn DNA DNA 1 blastp protein protein 6 blastx DNA protein 6 tblastn protein DNA 36 tblastx DNA DNA
Step 2: Choose the query sequence This can be an accession number or A sequence in FASTA format
Step 2: Choose the query sequence (2) Recall the details of the FASTA format • First line is a description • Always starts with > • Next lines form the sequence • Layout, formatting, and invalid characters are ignored
Step 3: Choose the database nr = non-redundant (most general database) Refseq = all reference sequences for nucleotide BLAST est = database of expressed sequence tags for protein BLAST swissprot = protein database select organism
Step 4: Choose the sub-program Sub-program availability depends on selected main program
Step 5: Select optional parameters Furtherexplainednext time Expect Word size Scoringmatrix Filter
Step 5: Select optional parameters (2) Filter: low complexity regions (e.g. repeats)are not used in the BLAST search
Looking at BLAST output database query program reports(e.g. taxonomy) domains the hits
When close to 0, an E-value resembles a p-value More details are given next time Looking at BLAST output (2) High scores = low E-values Cut-off: 0.05? 0.00005? 0.000000005?
Looking at BLAST output (3) Clicking on a result shows the alignment
Looking at BLAST output (4) Format options can be changed after getting the results, without rerunning BLAST
BLAST format options:view multiple sequence alignment multiple sequence alignmentonly showing differences
Finding your settings in the output BLOSUM62 matrix Expect value threshold= 10 Threshold T = 11
Problem: match with high E Problem: Sometimes a real match has a high E value Possible solution: try to BLAST the resulting sequence to confirm their likeness
Example: RBP4 and PAEP Problem: Low score, E is 0.49 and only 24% identity… …but they are indeed homologous. Try a BLAST search with PAEP as a query and you will find many other lipocalins!
Problem: E and score don’t say everything Short exact match Long less exact match Sometimes a similar E value and score occurs for: • a short exact match (large number of identities/positives) • a long less exact match (low number of identities/positives)
Problem: multidomain proteins Problem: BLAST with a multi-domain protein may result in hits at just the domain(s) Example: searching bacterial sequences with the pol protein sequence
PSI-BLAST • PSI-BLAST: Position specific iterated BLAST • The purpose of PSI-BLAST is to look deeper into the database for matches to your query protein sequence by using results obtained so far in new rounds of BLASTing. • All results with an E-value below a certain threshold are included, but you can select/unselect hits by hand • Useful for finding distant relatives of a protein.