Mathematics and computation behind BLAST and FASTA

Mathematics and computation behind BLAST and FASTA Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca

Bioinformatics-enabled research Sequence variation: UUCUCAACCAACCAUAAAGAUAU UUCUCUACAAACCACAAAGACAU UUCUCAACCAACCAUAAAGAUAU UUCUCAACCAACCACAAAGACAU UUCUCCACGAACCACAAAGAUAU UUCUCUACAAACCACAAAGAUAU UUCUCAACCAACCACAAAGACAU UUCUCUACUAACCACAAAGACAU • Difference in • coding sequences • Regulatory sequences • transcription • splicing • translation • translated sequences • Difference in • protein abundance • protein structure • cellular localization • protein interaction partners Difference in biochemical function • Difference in • susceptibility to diseases • response to medicine • Fitness (survival and reproductive success) • Difference in phenotype • morphological • physiological • behavioural Personalized medicine Conservation strategies Evolutionary mechanisms ... Nurturing environment Slide 2

Why string matching? • Efficient search against large sequence databases • Practical significance from early applications • Sequence similarity between an oncogene (genes in viruses that cause a cancer-like transformation of the infected cells), v-sis, and the platelet-derived growth factor (PDGF) • M. D. Waterfield et al. 1983. Nature 304:35-39 • R. F. Doolittle et al. 1983. Science 221:275-227 • Contig assembly • Functional annotation by homology search • Fast computational methods in string matching • FASTA • BLAST • Local pair-wise alignment by dynamic programming Slide 3

Basic stats in string matching • Given PA, PC, PG, PT in a target (database) sequence, the probability of a query sequence, say, ATTGCC, having a perfect match of the target sequence is:prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2 • Let M be the target sequence length and N be the query sequence length, the “matching operation” can be performed (M – N +1) times, e.g., Query: ATGTarget CGATTGCCCG • The probability distribution of the number of matches follows (approximately) a binomial distribution with p = prob and n = (M – N +1) Slide 4

Basic stats in string matching • Probability of having a sequence match: p • Probability of having no match: q = 1-p • Binomial distribution: • When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq • When np < 1 and n is very large, binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e.,  = 2 = np). Slide 5

From Binomial to Poisson Slide 6

Matching two sequences without gap • Assuming equal nucleotide frequencies, the probability of a nucleotide site in the query sequence matching a site in the target sequence is p = 0.25. • The probability of finding an exact match of L letters is a = pL = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST. • M: query length; N: target length, e.g., M = 8, N = 5, L = 3AACGGTTCCGGTT • A sequence of length L can move at (M – L +1) distinct sites along the query and (N – L +1) distinct sites along the target. • m = (M-L+1) and n = (N-L+1) are called effective lengths of the two sequences. • The expected number of matches with length L is mn2-S, which is called E-value in ungapped BLAST. • S is calculated differently in the gapped BLAST Slide 7

Blast Output (Nuc. Seq.) BLASTN 2.2.4 [Aug-26-2002] ... Query= Seq1 38 Database: MgCDS 480 sequences; 526,317 total letters Score E Sequences producing significant alignments: (bits) Value MG001 1095 bases 34 7e-004 Score = 34.2 bits (17), Expect = 7e-004 Identities = 35/40 (87%), Gaps = 2/40 (5%) Query: 1 atgaataacg--attatttccaacgacaaaacaaaaccac 38 |||||||||| ||||||||||| |||||| |||||||| Sbjct: 1 atgaataacgttattatttccaataacaaaataaaaccac 40 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 … effective length of query: 26 effective length of database: 520,557 Constant gap penalty vs affine function penalty Typically one would count only 1 GE here. Matches: 35*1 = 35 Mismatches: 3*(-3) = -9 Gap Open: 1*5 = 5 Gap extension: 2*2 =4 R = 35 - 9 - 5 - 4 = 17 S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34 E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04 x p(x) 0 0.999265217 1 0.000734513 2 0.000000270 3 0.000000000 Slide 8

Lambda () and K BLAST output includes lambda () and K. Mathematically,  is defined as follows: where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or mismatch (when i  j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to (for amino acid sequences) See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.

E-Value in BLAST • The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1. • It is NOT the probability of finding the reported match. • Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide). Slide 10

E-value and P(1) Slide 11

Gapped BLAST • Adapted from Crane & Raymer 2003 • Input sequence: AILVPTVIGCTVPT • Algorithm: • Break the query sequence into words:AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT • Discard common words (i.e., words made entirely of common amino acids) • Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming:AILVPTVIGCTVPTMVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC Slide 12

BLAST Programs Slide 13

FASTA • Another commonly used family of alignment and search tools • Generally considered to be more sensitive than BLAST. • Illustration with two fictitious sequences used in the Contig Assembly lecture:Seq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGASeq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGA Slide 14

String Match in FASTA Left and Right: -n means moving the query left by n sites and n means moving the query right by n sites. Slide 15

Alternative Matched Strings Query: ACCGCGATGACGAATA Target:GAATACGACTGACGATGGA From lecture on contig assembly: Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA From FASTA algorithm: Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA Which one is best based on YOUR judgment? One of the three 3rd best Best 2nd best Slide 16

Word length of 2 Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA One of the three 2nd best Best Slide 17

Comparison: BLAST and FASTA • BLAST starts with exact string matching, while FASTA starts with inexact string matching (or exact string matching with a shorter words). BLAST is faster than FASTA. • For the examples given, both BLAST and FASTA will find the same best match, i.e., shifting the query sequence by 2 sites to the right. • Both perform dynamic programming for extending the match after the initial match. Slide 18

Optional: BLAST Parameters • Lambda  and Karlin-Altschul (K) parameters are important because they directly affect the computation of E value. • Both  and K depend on • nucleotide (or aminon acid) frequencies • match-mismatch matrix • All BLAST implementations generally assume that nucleotide (or amino acid) sequences have roughly equal frequencies. • For nucleotide (or amino acid) sequences with strongly biased frequencies, BLAST E value obtained with the assumption can be quite misleading, i.e., one should use appropriate  and K.

Case 1: equal , (-3,1)

Case 2: Different , (-3, 1)

Case 3: Different , s/v

K: case 1

K: Case 2

Bioinformatics research workflow Accumulation of nucleotide and amino acid sequences: UUCUCAACCAACCAUAAAGAUAU UUCUCUACAAACCACAAAGACAU UUCUCAACCAACCAUAAAGAUAU UUCUCAACCAACCACAAAGACAU UUCUCCACGAACCACAAAGAUAU UUCUCUACAAACCACAAAGAUAU UUCUCAACCAACCACAAAGACAU UUCUCUACUAACCACAAAGACAU Functional genomics Systems biology Digital cells • Storage and annotation of the sequences • Structural annotation with homology search and de novo gene prediction • Functional annotation with gene ontologies Species-specific gene dictionaries, e.g., yeastgenome.org • Gene/Protein families (e.g., Pfam) • Cluster of orthologous genes (e.g., COG) • Supermatrix of gene presence/absence • Genome-based pair-wise distance distributions • Comparative genomics (the origin of new genes, new features and new species) • Phylogenetics (cladogenic process, dating of speciation and gene duplication events) • Phylogeny-based inference. Mutation Selection Adaptation Slide 25

Mathematics and computation behind BLAST and FASTA