Overview

Overview • Repeats • …

BLAST - Basic Local Alignment Search Tool • Blast programs use a heuristic search algorithm. The programs use the statistical methods of Karlin and Altschul (1990,1993). • Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant related sequences.

BLAST - Basic Local Alignment Search Tool • BLAST programs search databases in a special compressed format. To use your own privat database with blast, you need to format it to the blast format.

BLAST Programs • BLAST is actually a family of programs • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database.

Where to find the BLAST programs? • BLAST searches can be done on the WWW BLAST server at NIH: http://www.ncbi.nlm.nih.gov/BLAST/ • On a stand alone computer such as dapsas1 at the Weizmann institute. • From the GCG software package.

Blast method • Compare query to each sequence in database • Use heuristic to speed pairwise comparison • Create 'sequence abstraction' by listing exact and similar words • on the fly for the query • in advance for the database • Find similar words between query and each database sequence • Extend such words to obtain high-scoring sequence pairs (HSPs) • Calculate statistics analytically

Gapped BLAST • BLAST 2.0 is a new version with new capabilities such as Gapped-Blast and Psi-Blast. • The Gapped Blast algorithm allows gaps to be introduces into the alignments. That means that similar regions are not broken into several segments (as in the older versions). • This method reflects biological relationships much better.

PSI - BLAST • PSI (Position Specific Iterated ) Blast provides a new automatic “profile like” search. • The program first performs a gapped blast search of the database. The information of the significant alignments are then used by the program to construct a “position specific” score matrix. This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found.

Blast output • The list of hits • Database accession codes, name, description, general information about the hit • Score in bits, the alignment score expressed in units of information. Usually 30 bits are required for significance • Expectation value E(), how many hits we expect to find by chance with this score, when comparing this query to the database.It is important to keep in mind that the E() value does not represent a measure of similarity between the two sequences.

Blast output • The information for each hit • A header including hit name, description, length • The same for all additional entries removed due to redundancy • Composite expectation value • Each hit may contain several HSPs • score and expectation value • how many identical residues • how many residues contributing positively to the score • The local alignment itself

The Smith-Waterman Tools • Smith-Waterman searching method: • Compare query to each sequence in database • Do full Smith-Waterman pairwise comparisons • Use search results to generate statistics

Where to find the SW programs? • Since SW searching is exhaustive, it is the slowest method we use a special hardware + software (Bioccelerator) to run the programs. • Bioccelerator is available here inTAU at the • at the Weizmann Institute http://dapsas1.weizmann.ac.il/bcd/bcd_parent/bcd_bioccel/bioccel.html • The Bioccelerator from the command line on dapsas1 or life2.

Comparison of programs • Concept: • SW and BLAST: local alignments • FASTA: global alignments BLAST can report more than one HSP per database entry, FASTA reports only one segment (match). • Speed: • BLAST > FASTA >> SW • Sensitivity: SW > FASTA > BLAST (old version!)

Comparison of programs • Sensitivity: • FASTA is more sensitive, misses less homologues, (the opposite can also happen - if there are no identical residues conserved, but this is infrequent). • FASTA gives a better separation between true homologues and random hits. • Usually when FASTA gives an unexpected hit, it is an even farther homologue.

Comparison of programs • Statistics: • BLAST calculates probabilities • sometimes fails entirely if some assumptions are invalid • FASTA calculates significance 'on the fly' from the given dataset • more relevant • problematic if the dataset is small

Tips for DB searches • Use latest database version • Run Blast first, then depending on your results run a finer tool (fasta, ssearch, SW, blocks, etc..) • Where possible use translated sequence. • E() < 0.05 is statistically significant, usually biologically interesting. Check also 0.05 < E() <10 because you might find interesting hits. • Pay attention to abnormal composition of the query sequence, it usually causes biased scoring.

Tips for DB searches • Split large query sequence ( if >1000 for DNA, >200 for protein). • If the query has repeated segments, remove them and repeat the search.

Overview

Overview

Presentation Transcript

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview