1 / 22

CS 6990 Bioinformatics BLAST

CS 6990 Bioinformatics BLAST. Fall 2003 Dr. Susan Bridges. Overview. B asic L ocal A lignment S earch T ool BLAST is a collection of programs Developed by Altschul, et al. Simplification of the Smith Waterman Dynamic Programming algorithm

bryce
Télécharger la présentation

CS 6990 Bioinformatics BLAST

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 6990BioinformaticsBLAST Fall 2003 Dr. Susan Bridges Department of Computer Science and Engineering Bioinformatics

  2. Overview • Basic Local Alignment Search Tool • BLAST is a collection of programs • Developed by Altschul, et al. • Simplification of the Smith Waterman Dynamic Programming algorithm • Like FASTA, it looks for exact matches of short words • Unlike FASTA, it scores using all values in a similarity matrix. Department of Computer Science and Engineering Bioinformatics

  3. BLAST Terminology • Segment—a substring of a sequence • Segment pair of two sequences—pair of segments of the same length (no gaps), one from each sequence • w-mer—a substring (or word) of w characters Department of Computer Science and Engineering Bioinformatics

  4. Goal • Form a gapless alignment between pairs and score the alignment using an amino acid substitution matrix. • Example (using PAM 120) K A L M R V A K N S -4 3 -4 -3 -1 Total score of alignment = -9 Department of Computer Science and Engineering Bioinformatics

  5. Steps in the Algorithm • Compile a list of high-scoring words in the query sequence • Find matches in the db for each high-scoring word and synonyms • For each match in the db, extend the alignment in both directions Department of Computer Science and Engineering Bioinformatics

  6. Step 0 (optional) • Filter regions of low-complexity or repeats. • Filtering is applied to the query sequence, not the db sequences. These regions are marked with an X in protein sequences and an N in nucleotide sequences and are then ignored by BLAST. • Makes the search focus on more important parts of the sequence. Department of Computer Science and Engineering Bioinformatics

  7. Step 1 • Compile a list of high-scoring words in the query sequence • Defaults of w=3 for proteins, and w=11 for nucleic acid sequences • The total number of words will be n-w+1 • Each word has a score t toward the query sequence computed using scoring matrix • Threshold T: t-scores above T for any word pair indicates synonyms (T is called the neighborhood word score threshold) Department of Computer Science and Engineering Bioinformatics

  8. Step 1 Example (w=2) Adipokinetic hormone II of migratory locust q l n f s a g w q l l n n f f s s a a g g w Department of Computer Science and Engineering Bioinformatics

  9. Step 1 continued • Find all words in the db that are synonyms of the high scoring query words Department of Computer Science and Engineering Bioinformatics

  10. Example continued (T=8, PAM120 Scoring Matrix) Department of Computer Science and Engineering Bioinformatics

  11. Another Example Department of Computer Science and Engineering Bioinformatics

  12. Step 2 (original BLAST) • For each word or synonym from the query sequence , search each db sequence for hits • Each hit is considered a seed alignment and is extended in both directions as long as the cumulative score can be increased. Extension is halted when one of the following occurs: • The cumulative alignment score falls off by the quantity X from its maximum achieved value • The cumulative score goes to zero or below due to the accumulation of one or more negative-scoring residue alignments • The end of either sequence is reached. • High scoring segment pairs are called HSPs • The highest scoring segment pair for the pairwise comparison for the query sequence and the db sequence is referred to as the maximal-scoring segment pair (MSP) Department of Computer Science and Engineering Bioinformatics

  13. Step 2 (BLAST2 or Gapped Blast) • Uses a lower value for T in previous step to give a longer word list • Use short matched sequences on the same diagonal within distance A of each other as starting points for longer ungapped alignment. Joined regions are extended as before allowing small gaps. Database sequence x x x x x x x x x x Query sequence Department of Computer Science and Engineering Bioinformatics

  14. Step 3 • The HSP’s of the entire database are compared to a cutoff score S, and when greater than S, are listed. Department of Computer Science and Engineering Bioinformatics

  15. Step 4 • Statistical significance calculations are done for each HSP score. Department of Computer Science and Engineering Bioinformatics

  16. Step 5 • Alignment of the segments are done using an efficient version of dynamic programming that divides the task into subalignments based on HSPs in the sequences • The alignment score is obtained • The E() value for this score is calculated. • If the calculated E() for the database sequence meets the user given E() for the program, this score is reported. Department of Computer Science and Engineering Bioinformatics

  17. BLAST output • The list of hits • Database accession codes, name, description, general information about the hit. • Score in bits, the alignment score expressed in units of information. • Expectation value E() Department of Computer Science and Engineering Bioinformatics

  18. BLAST programs • BLASTN—nucleotide query searching nucleotide db • BLASTP—protein query searching a protein db • BLASTX—Translated nucleotide query sequence searching a translated nucleotide (6 frames) db. • TBLASTN—Protein query searching a translated nucleotide (6 frames) db. • TBLASTX—Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) db Department of Computer Science and Engineering Bioinformatics

  19. Scoring Schemes • Default scoring matrix for blastp, blastx, tblastn, and tblastx is BLOSUM62 • Blosum62 is considered a good general purpose scoring matrix. • PAM matrices are also provided in BLAST distribution. When distance is unknown, Altschul (1991, 1992) recommends trying searches with at least PAM40, PAM120, and PAM250. Department of Computer Science and Engineering Bioinformatics

  20. Scoring Schemes for BLASTN • Value used for w is 11. Only allows detection of moderately diverged homologs. • Two parameters, M and N are used for scoring nucleotide sequence matches • Reward for matches is M (must be positive) • Penalty for mismatches is N (must be negative) • Default values are M = 5 and N = -4 with a ratio of 1.25. • Values of M and N with a ratio of 3.0 or greater are not allowed. Department of Computer Science and Engineering Bioinformatics

  21. Genetic Codes • Default is Standard of Universal Code (1) • Other alternatives with dbgcode and C parameter • Vertebrate mitochondrial (2) • Yeast mitochondrial (3) • Mold, protozoan, coelenterate mitochondrial and mycoplasma/spiroplasma (4) • Invertebrate mitochondrial (5) • Ciliate macronuclear (6) • Echinodermate mitochondiral (9) • Alternative ciliate macronuclear (10) • Eubacterial (11) • Alternative yeast (12) • Ascidian mitochondrial (13) • Flatworm mitochondrial (14) Department of Computer Science and Engineering Bioinformatics

  22. References • Setubal and Meidanis, Introduction to Computational Molecular Biology • NCBI Education Pages, http://www.ncbi.nih.gov/Education/BLASTinfo/BLAST_algorithm.html • Weizmann Institute of Science, http://bioportal.weizmann.ac.il/course/introbioinfo/ • Computers and the Human Genome Project, http://www-cse.stanford.edu/classes/sophomore-college/projects-00/computers-and-the-hgp/BLAST.html • The BLAST Help Manual, http://www.ncbi.nlm.nih.gov/BLAST/blast_help.shtml Department of Computer Science and Engineering Bioinformatics

More Related