BLAST and searching sequence databases

BLAST and searching sequence databases Dr Alexei Drummond Department of Computer Science alexei@cs.auckland.ac.nz BIOSCI 359, Semester 2, 2006

Sequence Homology • Homologous protein or DNA sequences share common ancestry • A statement of homology is therefore an evolutionary hypothesis • Homology need not imply similar function • Homology is a binary property, a pair of sequences are either homologous or not homologous. • No such thing as degree of homology • Homology is often inferred by sequence similarity x a, b homologous t a b x y a, b not homologous a b

Orthology and paralogy "Where the homology is a result of gene duplication so that both copies have descended side by side during the history of an organism, (for example, alpha and beta hemoglobin) the genes should be called paralogous (para=in parallel). Where the homology is the result of speciation so that the history of the gene reflects the history of the species (for example alpha hemoglobin in man and mouse) the genes should be called orthologous (ortho=exact). " Fitch WM. Distinguishing homologous from analogous proteins. Systematic Zoology 1970 Jun;19(2):99-113.

Orthology and paralogy

Orthology, paralogy and multigene families Reproduced from NCBI education website

What are good scores for searching databases? • We want these scores to distinguish the related sequences from the unrelated sequences • So we select alignment parameters for database searching that give us the best distinguishing scores • These may not be the parameters that will give us the most accurate alignment

Database searching is an experiment • “Database searching is the application of knowledge gained from previous experiments to the problem of discovering the biochemistry and physiology of a newly discovered gene or its protein.” • It demands the same careful thought and execution as your bench or laboratory investigations! • Garbage in, garbage out

Sources of previous knowledge • Similarity scores - PAM, Blosum similarity matrices • Rescue us from having to assume that all amino acid changes are equally likely and equally harmful • Different similarity matrices are appropriate for different degrees of evolutionary divergence • More later

Second source of previous knowledge • Computer Algorithm • Dynamic programming is most sensitive and the least selective • Local, global, repeats, overlap • BLAST and FASTA are much faster and more selective which can be an advantage • No program is always best at finding distantly related sequences for all gene or protein families, but dynamic programming is guaranteed to give optimal alignment for given scores

Third source of previous knowledge • Database itself • Large store of previously acquired knowledge • Making the best use of this knowledge can save you many months of expensive laboratory experimentation • The size of this potential gain is the determining factor in deciding how much effort to devote to any particular database search

Database search assumptions • The sequences sought have an evolutionary ancestral sequence in common with the “query sequence” • All substitutions are not equally likely and should be weighted to account for this • Insertions and deletions are less likely than substitutions and should be weighted to account for this.

FASTA FASTA • 2 step algorithm • (1) word search, using a specific word size, finds regions with a high number of identical word matches • (2) Smith-Waterman alignment centered on these regions and bounded by a window size which limits the number of insertions or deletions one sequence can accumulate with respect to the other sequence

FASTA FASTA Heuristics • Heuristic approximation to Smith-Waterman • Runs faster • Loses some sensitivity • Restrictions on the model of sequence evolution • First Heuristic - Word size parameter -usually 2 for proteins and 6 for nucleic acids - FASTA constrains the evolution between a pair of sequences to preserve a number of unchanged dipeptides or hexanucleotides

FASTA FASTA Algorithm • divide query sequence into its constituent overlapping words of length two for proteins or six for nucleic acids • each sequence in the database is also broken up in the same way • Two word lists are compared to find all identical words in both sequences • An initial score is computed based on how many identities are based within a small region of the dot plot.

FASTA FASTA Algorithm cont. • If the initial score is high enough, a second score is computed by evaluating which initial identities can be joined into a consistent alignment using only gaps of less then the window size. • If the secondary score is high enough, then a Smith-Waterman alignment is performed within the same region of the dot plot defined by the concentrated identities and using the same window-size. • This third score is reported as the optimal score.

FASTA Creating a Word List

FASTA First Pathological Example • Two proteins that share 50% identity - but the proper alignment consists of alternating match and mismatches. • With a word size of two, there would be no matches along the main diagonal of the dot plot and the proper alignment would not be found.

FASTA Second Pathological Example • Two proteins that are almost identical, except the second protein has a 20 residue insertion into the middle of the sequence. • If the window size is 15, then the Smith-Waterman alignment phase of FASTA will align the protein to either the sequence prior to or following the insertion, thus missing the fact that the proteins were basically identical (with only one long insertion).

FASTA Second Heuristic • Window size is the second heuristic used by FASTA. • Its effect is more variable then word size. • If the best alignment, as defined by a full Smith-Waterman analysis, goes outside the window then a lower scoring alignment will be found by FASTA. This will lead users to conclude the sequences are not homologous when in fact they are and the homology could have been inferred from a full Smith-Waterman alignment. • In practice these pathological cases are very unlikely. However similar cases do occur and loss of sensitivity caused by the use of these heuristics will be seen.

BLAST BLAST • Approximates a simplification of Smith-Waterman known as the maximal segment pairs (MSP) algorithm. • MSP alignments do not allow gaps and are specified by an equation that includes only the first and fourth terms of the Smith-Waterman equation. • MSP alignment’s statistics are well understood and so we can compute a significance probability.

BLAST Significance Probabilities • Thus the evolutionary model requires that there be a fairly long stretch of sequence that has evolved without insertions or deletions, or at least with a complimentary pattern of insertions and deletions that do not significantly disrupt the alignment • Recent advances in MSP statistics allow the use of several independent segment alignments to be used in evaluating significance probability.

BLAST Brief Comparison of BLAST and FASTA • BLAST is less sensitive than Smith-Waterman but therefore more selective. • For proteins BLAST is more sensitive than FASTA even though BLAST uses a word size of 3 for proteins while FASTA uses a word size of 2. • BLAST uses a word size of 11 for nucleic acids. The recent modifications which make it more sensitive for proteins do not seem to work for nucleic acids. • So therefore FASTA should be used instead of BLAST when searching for nucleic acids.

BLAST BLAST Algorithm • It creates a word list (same as FASTA). • Then it expands this list in order to recover sensitivity lost by only using exact matches. • Any word that scores at least a minimum threshold (T) when aligned with any of the initial list of words is added to the list. • BLAST than examines the database for words that exactly match any word in the expanded word list. • Equivalent to looking for gapless alignments of score at least T in the database.

BLAST BLAST example • The example shows an expanded list of 47 words from the original 7. The expanded list contains any word that scores at least eight when aligned with the initial word and scored with the PAM 120 similarity table.

BLAST BLAST Paradox • Notice that there is no word that scores 8 or more when aligned with the initial word “sa”, even the word “sa” itself. • This situation does occur in actual BLAST searches. • The user has the option to force the initial word into the final list.

BLAST BLAST default • The default is to not include such low scoring words because they contain so little information that they are unlikely to be critical in finding a maximal segment pair alignment. • BLAST has a word length of 3 for protein searches with a threshold score of T=13 using the Blosum62 similarity scoring matrix.

BLAST Blast final step • The occurrence of a word hit is followed by an attempt to find a locally optimal ungapped alignment. • This is accomplished by accumulating the score as the alignment is extended in both directions. • When a run of mostly negative scores is encountered, the cumulative score will drop substantially. When this happens it is unlikely that the score will rebound. • This observation provides the basis for an additional heuristic whereby the extension of a hit is terminated when the reduction in score exceeds a dropoff threshold. • The local alignment with the highest score is returned.

BLAST Improvements to BLAST • The growth of the sequence database has raised the minimum score and hence the length of alignment that must be found by BLAST for a match to be significant. • Speed and sensitivity can be improved by requiring the algorithm find two matches above some (lower) threshold rather that one match. Both matches must be on the same diagonal.

BLAST New BLAST settings • The increase in speed results from fewer sequences which are completely evaluated. • BLAST now looks for 2 words of length 3, that each score at least 11 using Blosum62. The matches must be within 40 amino acids on the same diagonal. • As the database grows new techniques will need to be constantly devised.

BLAST Gapped BLAST • Builds the alignment out from a central high scoring pair of aligned amino acids analogous to the way BLAST extends the initial maximal segment pair alignment. • The initial pair of amino acids is chosen as the middle pair of the highest scoring window of 11 amino acids. • Smith-Waterman alignment is extended in all directions in the path graph until it falls below a fixed percentage of the highest score yet computed in the Smith-Waterman phase.

BLAST Guarantees on Gapped Blast • Will find the best scoring Smith-Waterman alignment if: • The calculation is extended until a score of 0 is reached. Stopping earlier accepts a small risk of not finding the complete alignment in return for a very large savings in computer resources. • The initial pair of amino acids selected as the midpoint must actually be part of the alignment the would be reported from a full Smith-Waterman alignment.

BLAST BLAST Warning • Before publishing an alignment: prudent to do a complete Smith-Waterman analysis. • Further it is suggested to make use of the Waterman-Eggert extensions to Smith-Waterman (MaxSegs algorithm) in order to look at the best several independent local alignments and to examine each sequence for repeated motifs.

BLAST Finding Distant Homologies • Many functionally and evolutionary important protein similarities are recognizable only through comparison of three-dimensional structures. • When not available, patterns of conservation identified from the alignment of related sequences can aid the recognition of distant similarities • These patterns are called motifs, profiles, position-specific score matrices, and Hidden Markov Models.

BLAST PSI-BLAST • Position-Specific Iterative BLAST • Designed to detect weak relationships by using a profile that is constructed automatically from the multiple alignment of the highest scoring hits in the initial BLAST search. • The profile is created by calculating position-specific scores for ever amino acid at every position in the alignment.

BLAST How it works • If a residue is highly conserved at a particular position, it will receive a high score, and others will be assigned high negative scores. • At weakly conserved positions all residues receive scores near 0. • Position specific scores can also be assigned to potential insertions and deletions

BLAST Iteration • The power of profile methods can be further enhanced through iteration of the search procedure. • After a profile is run against a database, new similar sequences can be detected. In each iteration: • A new multiple alignment, which includes these new sequences can be constructed. • A new profile abstracted. • A new database search performed. • The procedure can be iterated as often as desired or until convergence (when no new statistically significant sequences are detected).

BLAST Design Goals of PSI-BLAST • Speed, simplicity, automatic operation • Unlike most profile-based search methods, PSI-BLAST runs one program starting with a single protein sequence as input, and the intermediate steps of multiple alignment and profile construction are invisible to the user.

BLAST PSI-BLAST Details • It uses the gapped BLAST program for the database searches. A PSI-BLAST query is identical to a Gapped BLAST query with the addition of an expectation value cut-off for inclusion of a match in an iteration. • The E-value cut-off can be over-ridden by the user on a case-by-case basis if a sequence hit of interest is worse then the threshold. (default is 0.001) • The multiple alignment and profile will have lengths identical to that of the query

BLAST Notes on using PSI-BLAST • The WWW version requires the user to decide after each iteration whether to continue. It has the advantage that the user can hand-pick sequences used for each profile construction, regardless of E-value, by checking boxes next to the sequences descriptions. • A stand-alone version of PSI-BLAST, obtainable from NCBI, allows the user to run the program for a chosen number of iterations or until convergence. • This version also allows the user to save the profile produced and use it subsequently to search another database.

BLAST Warnings on using PSI-BLAST • PSI-BLAST is a powerful tool and it requires caution. • The sources of error are the same as for standard BLAST, but are easily amplified by iteration!

BLAST Sources of Errors • The major source of deceptive alignments is the presence within proteins of regions with highly biased amino acid composition - low complexity. • If such a region is included during production of a profile, otherwise unrelated sequences containing similarly biased regions will creep into subsequent iterations, rendering the search nearly worthless.

BLAST How to stop bias • PSI-BLAST filters out biased regions of query sequences by default, using the SEG program. • SEG parameters are set to avoid masking potentially important regions, so some bias may still persist. So PSI-BLAST can still generate compositionally rooted artifacts. • These cases can usually be identified by inspection - especially when sequences that have a known bias, such as myosins or collagens, are retrieved. • SEG can also be set to eliminate nearly all biased regions, or filtering procedures, such as COILS, can be used before submitting the appropriately masked sequence to PSI-BLAST.

BLAST PHI-BLAST • Pattern Hit Initiated BLAST searches for particular patterns in protein queries • It takes a protein query and a pattern contained in that sequence as input. • It searches the database for protein sequences that • contain the input pattern and also • have significant similarity to the query sequence in the region of the pattern occurrences.

BLAST PHI-BLAST and PSI-BLAST • The statistical significance of PHI-BLAST is reported using E-values like other forms of BLAST, but the method for computing the E-values is different. • PHI-BLAST is integrated with PSI-BLAST so the results of a PHI-BLAST can be used to initiate one or more iterations of PSI-BLAST searching. • PHI-BLAST is under development and may change substantially over time.

Similarity Matrices Which Similarity Matrix to Use? • Database searches or sequence alignments perform much better if the similarity matrix is based on replacement patterns that correspond to the degree of divergence of the sequences being aligned or discovered. • In database searching, a PAM or Blosum matrix corresponding to an inappropriate degree of divergence can cause you to fail to discover homologous sequences that are present in the database. • Therefore a thorough database search will involve using at least 2 and most likely 3 different matrices. • Using different matrices usually has a higher payoff than using different programs and search algorithms.

Similarity Matrices Comparable Blosum and PAM Matrices

Similarity Matrices What the Comparability Table Means • The comparability is based on matrix entropy. Entropy is defined by information theory as the average amount of information per position in a sequence alignment that is available to determine whether or not a sequence is homologous. • This amount of information is available only if the matrix used in the database search is matched for the appropriate degree of sequence divergence. • As will be shown later this can be used to get a rough indication of whether or not a specific database search result is significant.

Similarity Matrices Scores for nucleic acids • If possible use an amino acid sequence for a database search because: • There is redundancy in the genetic code with up to 6 codons translated as the same amino acid so there is more information in amino acid sequences once the sequences have diverged beyond about 50 PAMs (~60% identical) • Compositional bias found in many organisms and organelles • Some nucleic acid sequences are derived from messengers while others are genomic DNA with exons and the introns may be too short to give a significant alignment with a messenger derived sequence.

Similarity Matrices Search with Nucleic Acids • There are circumstances when there is no choice but to search with nucleic acid sequences. (for 98.9% of the human genome!) • BLAST uses a very long word size, 11, for nucleic acids and the modifications to the heuristic to improve sensitivity for protein sequences do not work as well for nucleic acids because they have only a four letter alphabet and the similarity scores are usually calculated with equal rates of replacement for all of the nucleotides. • Thus FASTA is more sensitive than BLAST for nucleic acid sequences and should be used instead of BLAST if you want to use one of the faster searching programs.

Similarity Matrices Nucleic Acid Matrices • There are matrix of replacements for nucleic acids just as has been recommended for proteins. • Commonly used are PAM 47 scores assuming equal rates of transitions and transversions. This assumption leaves us with only two scores, 5 for identities or matches and -4 for nonidentities or mismatches. • It is possible to create nucleic acid scores that do not assume equal rates of transitions and transversions. • For example assumes a three to one transition to transversion ratio might be more appropriate than the defaults

BLAST and searching sequence databases

BLAST and searching sequence databases

Presentation Transcript

Searching Molecular Databases with BLAST

Sequence Databases

Sequence Databases

Rationale for searching sequence databases

Sequence Databases

BLAST Sequence Searching in Registry

BLAST Similarity Searching

Advanced BLAST Searching

BLAST and Multiple Sequence Alignment

Searching Databases

Sequence databases

Searching Sequence Databases

Sequence Databases

Searching Sequence Databases

Sequence Searching Strategies

Problem: Max Blast Sequence

Rationale for searching sequence databases

Sequence Databases

Exercise: BIOINFORMATIC DATABASES and BLAST

Advanced BLAST Searching

Sequence Databases

Advanced BLAST Searching