BLAST

BLAST • Basic Local Alignment Search Tool • developed at NCBI by Altschul and collaborators • also uses matching of words to table for each sequence as starting point • uses a quantitative cutoff rather than strict identity

So for a k-tuple of 3, each triplet of residues from the query sequence is compared to a table of all triplets, each scored with a numerical similarity value based on the scoring matrix • All triplets with scores above a cutoff value, T, are considered a match and all sequences that contain the above-threshold triplets are selected for alignment extension from the seed

once starting points are identified, alignment is extended (with gapping) • If the cumulative score starts dropping before reaching a cutoff, S, the match is discarded • If cumulative score exceeds S, the match is recorded as a HSP (High-scoring Segment Pair) • Alignment is extended until score drops by another cutoff, X, then alignment is truncated back to maximum score and reported

several HSP regions can be returned from a single database sequence • Scores are combined from all hits to yield a full pair-wise comparison score

Handling Different Data • BLAST has several options implemented to handle various combinations of protein and nucleotide sequence in the query and database • BLASTN - nucleotide vs nucleotide • BLASTP - protein vs protein

BLASTX - nucleic acid query is translated in all six frames and screened against a protein database • TBLASTN - a protein query is searched against a nucleic acid database translated in all six frames • TBLASTX - translate nucleic acid query and nucleic acid database in all six frames, search as protein sequences

BLAST Parameters • BLAST is implemented with several user-adjustable parameters that control the sensitivity of the search • E value = Expectation value, number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

Gap scores, opening and extension, higher the value the shorter and more compact the alignments that will be saved and the less sensitive the search will be • Filter, preprocesses the query sequence to mask out “low complexity” sequence stretches that would have abnormally high number of hits. This is usually sequence with lots of repeats

Multiple Alignments Finding homologous sites in many related proteins

Why Multiple Alignments • Usually there are more than two sequences that are related to each other at some arbitrary level of similarity • Pair-wise alignments between different pairs of sequences will usually yield inconsistencies • How do we align an arbitrary number of sequences in an optimal manner?

Dynamic Programming • The 2-dimensional solution for pair-wise alignments can be generalized to i alignments by generating an i-dimensional matrix • Computationally this problem is O(ni) in both time and memory where n is the length of an individual sequence and i is the number of sequences

There are program optimizations that allow one to decrease the memory requirement and to avoid some parts of the dynamic programming matrix that are demonstrably suboptimal, but there is as yet no algorithm that avoids the core problem of O(ni) size • Therefore, need some heuristic method that will usually yield a near-optimal solution

Heuristic Approach • Start with scores of alignments of all pair-wise combinations of sequences • Build a distance tree based on the pair-wise scores • Align from most similar to least similar • For each alignment also generate a consensus that can be aligned with either single sequences or other consensuses

CLUSTAL W • Freely available, fast and heuristic • Sequential pair-wise alignments rather than simultaneous alignment • Changes weighting of individual sequences, weighting of gap penalties, scoring matrices throughout the process to more closely reflect expected behaviour

Discussion • What elements of “reality” do the various CLUSTAL W features address? • Paper available in the library, Nucleic Acids Research (1994) vol 22: 4673-4680 • http://bimas.dcrt.nih.gov/clustalw/clustalw.html

Implicit Assumptions • the pair-wise and multiple alignment algorithms that we have talked about are based on the idea that the information inherent in the sequence is completely local • no way of handling a major known factor, that protein folding is based on non-local interactions

T-COFFEE 3D • Implements an objective function that compares consistency between multiple alignment and library of pair-wise alignments (original T-COFFEE v 2.0) • Also checks sequence alignment against known 3D structures by threading sequences onto best common matches and aligning residues that match up on the 3D model

MUSCLE • Newest high performance tool • Builds up multiple alignment in a heuristic sequence • Measures rapid k-tuple distances, builds a distance tree, does multiple alignment • Calculates Kimura distance from alignment, generates new tree, redoes alignment

Divide tree into two parts by removing one edge • Calculate a profile (PSSM) for sequences from each half • Align the two profiles • If new profile alignment is better than original, keep as new best • Otherwise, discard,go back and divide tree into different two parts • Iterate until all divisions have been tested

MUSCLE Advantages • Fast for large numbers of sequences • Implements probabilistic model for generating profiles • Most accurate, or tied for most accurate, in four standardized test sets of sequences

Machine Learning Approaches • a whole class of algorithmic approaches to sequence alignment incorporate the possibility of long range interactions • Hidden Markov models • Bayesian belief networks • implementation of these approaches is underway, but these are computationally intensive

Hidden Markov Model • The HMM implements a different formalism from the dynamic programming example • The HMM must be “trained” on a known data set to generate the internal parameters - generates PSSM • Quality of result depends on how well the training reflects reality

Extra Reading • Biological Sequence Analysis: probabilistic models of proteins and nucleic acids, R. Durbin, S. Eddy, A. Krogh and G. Mitchison, Cambridge University Press, 1998

Improving Sensitivity of Database Searching Probabilistic Analysis

Position Specific Scoring Matrices • multiple alignments give information about the underlying information content of a family of related sequences • can generate a scoring matrix where the score is based on: • amino acid • position of the residue being scored within a window

Use of Alignments for more Sensitive Searches • Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

PSI-BLAST • variant on BLAST, with iterative, user directed extraction of PSM from alignments • runs standard BLAST protein search and returns set of highest scoring alignments • user selects which sequences to include in PSsM generation • All gaps in query sequence discarded • Therefore targets PSSM to only the query

PSI-BLAST • runs BLAST search, but uses PSSM instead of normal hashing and dynamic programming • returns results, user can once again select the new sequences to add to the PSSM • repeat until nothing new hits

PSI-BLAST • repeated iteration will detect new proteins that match the PSSM, but did not match the original sequence • WARNING: Increased sensitivity leads to increased risk of false hits. • effective use of this tool requires knowledge of the subject problem

PSI-BLAST • this is a simple HMM • why? • also have the option of searching starting with a pre-existing PSSM

BLAST

BLAST

Presentation Transcript

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST:

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

Blast

BLAST

BLAST

BLAST

BLAST

BLAST