1 / 70

Having a BLAST

BLAST algorithms. Having a BLAST. MLW2013, 2011. BiGCaT bioinformatics. Topics of this lecture. Introduction to BLAST Details on the BLAST algorithm Performing a BLAST Pitfalls Advanced BLAST PSI-BLAST PHI-BLAST. Introduction to BLAST. History of BLAST. Local alignment:

dixie
Télécharger la présentation

Having a BLAST

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BLAST algorithms Having a BLAST MLW2013, 2011 BiGCaT bioinformatics

  2. Topics of this lecture • Introduction to BLAST • Details on the BLAST algorithm • Performing a BLAST • Pitfalls • Advanced BLAST • PSI-BLAST • PHI-BLAST

  3. Introduction to BLAST

  4. History of BLAST • Local alignment: • alignment may contain just a portion of either sequence • appropriate for finding matched domains (or limited regions of similarity) between sequences • local alignment is almost always used for databasesearches. • Smith & Waterman algorithm: • Advantage: guaranteed to find optimal local alignments • Disadvantage: computationally VERYexpensive

  5. History of BLAST (2) • Myers and Miller (1988) sought to improve the alignment algorithms so local alignment required less time and memory • BLAST: Basic Local Alignment Search Tool • a heuristic approximation for the Smith & Waterman algorithm • allows rapid sequence comparison of a query sequence against a database or to align two sequences • Advantage: runs much faster than S&W(50 times faster) • Disadvantage: does not necessarily find optimal solution

  6. Why BLAST? • BLAST searching is fundamental to understanding the relatedness of any query sequence to other known proteins or DNA sequences. • Applications include: • identifying orthologs and paralogs • discovering new genes or proteins • discovering variants of genes or proteins • investigating expressed sequence tags (ESTs) • exploring protein structure and function

  7. The BLAST algorithm in a nutshell • Three phases: • Phase 1: compile a list of words from a query sequence to search for in the database. • Phase 2: scan the database for hits to the words in the list from phase 1 • Phase 3: extend the hits in either direction until the alignment score drops below a certain cut-off • The algorithm will be explained in more detail next time... • Now we will focus on how to apply it using the BLAST website

  8. Details on the BLAST algorithm

  9. FSG | SGT | GTW | TWY | WYA • The query is split into subwords of a certain length • The length is determined by the word size parameter • For each of these words, find all words of equal length that are similar enough: • The pairwise alignment score threshold parameter T gives the minimum score for words to be put in the list • Combine all these words as input for Phase 2

  10. Query wordlist Step 1: compile a list of words from the query sequence (for example word size w = 3) Example: for a human RBP query …FSGTWYA… FSG FSG SGT SGT GTW GTW TWY TWY WYA WYA

  11. Changing word size w and threshold T better large w lower T slower Sensitivity Search speed faster worse small w higher T For proteins, default word size is 3 (This yields a more accurate result than 2)

  12. GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW9 word hits < threshold Matching wordlist (T = 11) Step 2: Compile a list of matching words for each query word, given a pairwise alignment score threshold T. GTW for T = 11

  13. (4) Pairwise alignment scores between words are determined using a scoring matrixsuch as BLOSUM62

  14. Results of changing threshold T (RBP sequence)

  15. Phase 1: compile a list of words (3) (5) • After matching words have been collected for each word in the original list, everything is combined as input for the database search . . . Search database . . . …FSGTWYA… FSG FSG SGT SGT GTW GTW TWY TWY WYA WYA GTW GSW ATW NTW GTY . . . . . .

  16. Phase 2: scan the database FSG | SGT | GTW | TWY | WYA makivlcmvllafgrqMKGLDIQKVAGTWYSLAMAASDrrfilqailssfedvcdqlsklsfil Scan the database for entries that match the compiled list of words from phase 1. This is fast and relatively easy.

  17. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit! When two hits are found in close proximity to each other, these hits are extended Extending is continued until the alignment is not strong enough any more

  18. Phase 3: extend hits KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit! • When a match between a “word” and a database entry is found (a hit): • Extend the alignment of the hit in either direction to find high-scoring segment pairs (HSPs) • If score sufficiently high: gapped extension • Keep track of the score (again the scoring matrix is used) • Stop when the score drops below some cutoff

  19. Phase 3: extend hits (2) extend Hit! Hit! Some history In the original (1990) implementation of BLAST, hits were extended in either direction. In a 1997 refinement of BLAST: Two independent hits are required The hits must occur in close proximity to each other With this modification, only one seventh as many extensions occur, greatly speeding the time required for a search.

  20. Just as in other sequence alignment applications, matrices tuned to more and less divergent sequences can be used Recall that a higher PAM number corresponds to a lower BLOSUM number! Different matrices available

  21. More on substitution matrices • For blastp several substitution matrices are available: • PAM30 • PAM70 • BLOSUM45 • BLOSUM62 (default) • BLOSUM80 • Others… • They are used for scoring local alignments in phase 1 (word list creation) and phase 3 (hit extension) of the BLAST algorithm

  22. Results of changing the matrix (RBP sequence)

  23. The expect value E P E E P The expect value E of a score S is the number of alignments with scores greater than or equal to S that are expected to occur by chance in a database search An E value is related to a probability value p: p = 1 - e-E

  24. The expect value E (2) • Very small E values are very similar to p values. • E values of about 1 to 10 are far easier to interpret than corresponding p values. E p 10 0.99995460 5 0.99326205 2 0.86466472 1 0.63212056 0.1 0.09516258 (about 0.1) 0.05 0.04877058 (about 0.05) 0.001 0.00099950 (about 0.001) 0.0001 0.0001000 (identical!)

  25. Results of changing expect value E (RBP sequence) threshold * * don’t confuse the expect value threshold with the threshold T mentioned before

  26. Results of changing E, T, Matrix (overview)

  27. Performing a BLAST

  28. BLAST in five steps Go to the BLAST website: • (1) Select the BLAST program • (2) Choose the query sequence • (3) Choose the database to search • (4) Choose the sub-program • (5) Choose optional parameters Then click “BLAST” and your off!

  29. Step 1: Select the BLAST program blastn (nucleotide BLAST) blastp (protein BLAST) blastx (translated BLAST) tblastn (translated BLAST) tblastx (translated BLAST)

  30. Step 1: Select the BLAST program (2) • blastn: • BLAST a nucleotide query sequence to a nucleotide database • blastp: • BLAST a protein query sequence to a protein database • blastx: • BLAST all six frame translations of a nucleotide query sequence to a protein database • tblastn: • BLAST a protein query sequence to all six frame translations of a nucleotide database • tblastx: • BLAST all six frame translations of a nucleotide query sequence to all six frame translations of a nucleotide database

  31. Step 1: Select the BLAST program (3) Program Input Database 1 blastn DNA DNA 1 blastp protein protein 6 blastx DNA protein 6 tblastn protein DNA 36 tblastx DNA DNA

  32. Step 2: Choose the query sequence This can be an accession number or A sequence in FASTA format

  33. Step 2: Choose the query sequence (2) Recall the details of the FASTA format • First line is a description • Always starts with > • Next lines form the sequence • Layout, formatting, and invalid characters are ignored

  34. Step 3: Choose the database nr = non-redundant (most general database) Refseq = all reference sequences for nucleotide BLAST est = database of expressed sequence tags for protein BLAST swissprot = protein database select organism

  35. Step 4: Choose the sub-program Sub-program availability depends on selected main program

  36. Step 5: Select optional parameters Furtherexplainednext time Expect Word size Scoringmatrix Filter

  37. Step 5: Select optional parameters (2) Filter: low complexity regions (e.g. repeats)are not used in the BLAST search

  38. Looking at BLAST output database query program reports(e.g. taxonomy) domains the hits

  39. When close to 0, an E-value resembles a p-value More details are given next time Looking at BLAST output (2) High scores = low E-values Cut-off: 0.05? 0.00005? 0.000000005?

  40. Looking at BLAST output (3) Clicking on a result shows the alignment

  41. Looking at BLAST output (4) Format options can be changed after getting the results, without rerunning BLAST

  42. BLAST format options:view multiple sequence alignment multiple sequence alignmentonly showing differences

  43. Finding your settings in the output BLOSUM62 matrix Expect value threshold= 10 Threshold T = 11

  44. Pitfalls

  45. Problem: match with high E Problem: Sometimes a real match has a high E value Possible solution: try to BLAST the resulting sequence to confirm their likeness

  46. Example: RBP4 and PAEP Problem: Low score, E is 0.49 and only 24% identity… …but they are indeed homologous. Try a BLAST search with PAEP as a query and you will find many other lipocalins!

  47. Problem: E and score don’t say everything Short exact match Long less exact match Sometimes a similar E value and score occurs for: • a short exact match (large number of identities/positives) • a long less exact match (low number of identities/positives)

  48. Problem: multidomain proteins Problem: BLAST with a multi-domain protein may result in hits at just the domain(s) Example: searching bacterial sequences with the pol protein sequence

  49. Advanced BLAST: PSI-BLAST

  50. PSI-BLAST • PSI-BLAST: Position specific iterated BLAST • The purpose of PSI-BLAST is to look deeper into the database for matches to your query protein sequence by using results obtained so far in new rounds of BLASTing. • All results with an E-value below a certain threshold are included, but you can select/unselect hits by hand • Useful for finding distant relatives of a protein.

More Related