1 / 25

Mathematics and computation behind BLAST and FASTA

Mathematics and computation behind BLAST and FASTA. Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca. Bioinformatics-enabled research. Sequence variation: UU C U C AA CC AA CC A U AAA G A U A U UU C U C U A C AAA CC A C AAA G A C A U UU C U C AA CC AA CC A U AAA G A U A U

desma
Télécharger la présentation

Mathematics and computation behind BLAST and FASTA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mathematics and computation behind BLAST and FASTA Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca

  2. Bioinformatics-enabled research Sequence variation: UUCUCAACCAACCAUAAAGAUAU UUCUCUACAAACCACAAAGACAU UUCUCAACCAACCAUAAAGAUAU UUCUCAACCAACCACAAAGACAU UUCUCCACGAACCACAAAGAUAU UUCUCUACAAACCACAAAGAUAU UUCUCAACCAACCACAAAGACAU UUCUCUACUAACCACAAAGACAU • Difference in • coding sequences • Regulatory sequences • transcription • splicing • translation • translated sequences • Difference in • protein abundance • protein structure • cellular localization • protein interaction partners Difference in biochemical function • Difference in • susceptibility to diseases • response to medicine • Fitness (survival and reproductive success) • Difference in phenotype • morphological • physiological • behavioural Personalized medicine Conservation strategies Evolutionary mechanisms ... Nurturing environment Slide 2

  3. Why string matching? • Efficient search against large sequence databases • Practical significance from early applications • Sequence similarity between an oncogene (genes in viruses that cause a cancer-like transformation of the infected cells), v-sis, and the platelet-derived growth factor (PDGF) • M. D. Waterfield et al. 1983. Nature 304:35-39 • R. F. Doolittle et al. 1983. Science 221:275-227 • Contig assembly • Functional annotation by homology search • Fast computational methods in string matching • FASTA • BLAST • Local pair-wise alignment by dynamic programming Slide 3

  4. Basic stats in string matching • Given PA, PC, PG, PT in a target (database) sequence, the probability of a query sequence, say, ATTGCC, having a perfect match of the target sequence is:prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2 • Let M be the target sequence length and N be the query sequence length, the “matching operation” can be performed (M – N +1) times, e.g., Query: ATGTarget CGATTGCCCG • The probability distribution of the number of matches follows (approximately) a binomial distribution with p = prob and n = (M – N +1) Slide 4

  5. Basic stats in string matching • Probability of having a sequence match: p • Probability of having no match: q = 1-p • Binomial distribution: • When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq • When np < 1 and n is very large, binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e.,  = 2 = np). Slide 5

  6. From Binomial to Poisson Slide 6

  7. Matching two sequences without gap • Assuming equal nucleotide frequencies, the probability of a nucleotide site in the query sequence matching a site in the target sequence is p = 0.25. • The probability of finding an exact match of L letters is a = pL = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST. • M: query length; N: target length, e.g., M = 8, N = 5, L = 3AACGGTTCCGGTT • A sequence of length L can move at (M – L +1) distinct sites along the query and (N – L +1) distinct sites along the target. • m = (M-L+1) and n = (N-L+1) are called effective lengths of the two sequences. • The expected number of matches with length L is mn2-S, which is called E-value in ungapped BLAST. • S is calculated differently in the gapped BLAST Slide 7

  8. Blast Output (Nuc. Seq.) BLASTN 2.2.4 [Aug-26-2002] ... Query= Seq1 38 Database: MgCDS 480 sequences; 526,317 total letters Score E Sequences producing significant alignments: (bits) Value MG001 1095 bases 34 7e-004 Score = 34.2 bits (17), Expect = 7e-004 Identities = 35/40 (87%), Gaps = 2/40 (5%) Query: 1 atgaataacg--attatttccaacgacaaaacaaaaccac 38 |||||||||| ||||||||||| |||||| |||||||| Sbjct: 1 atgaataacgttattatttccaataacaaaataaaaccac 40 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 … effective length of query: 26 effective length of database: 520,557 Constant gap penalty vs affine function penalty Typically one would count only 1 GE here. Matches: 35*1 = 35 Mismatches: 3*(-3) = -9 Gap Open: 1*5 = 5 Gap extension: 2*2 =4 R = 35 - 9 - 5 - 4 = 17 S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34 E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04 x p(x) 0 0.999265217 1 0.000734513 2 0.000000270 3 0.000000000 Slide 8

  9. Lambda () and K BLAST output includes lambda () and K. Mathematically,  is defined as follows: where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or mismatch (when i  j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to (for amino acid sequences) See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.

  10. E-Value in BLAST • The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1. • It is NOT the probability of finding the reported match. • Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide). Slide 10

  11. E-value and P(1) Slide 11

  12. Gapped BLAST • Adapted from Crane & Raymer 2003 • Input sequence: AILVPTVIGCTVPT • Algorithm: • Break the query sequence into words:AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT • Discard common words (i.e., words made entirely of common amino acids) • Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming:AILVPTVIGCTVPTMVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC Slide 12

  13. BLAST Programs Slide 13

  14. FASTA • Another commonly used family of alignment and search tools • Generally considered to be more sensitive than BLAST. • Illustration with two fictitious sequences used in the Contig Assembly lecture:Seq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGASeq1: ACCGCGATGACGAATASeq2: GAATACGACTGACGATGGA Slide 14

  15. String Match in FASTA Left and Right: -n means moving the query left by n sites and n means moving the query right by n sites. Slide 15

  16. Alternative Matched Strings Query: ACCGCGATGACGAATA Target:GAATACGACTGACGATGGA From lecture on contig assembly: Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA From FASTA algorithm: Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA Which one is best based on YOUR judgment? One of the three 3rd best Best 2nd best Slide 16

  17. Word length of 2 Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA Query: ACCGCGATGACGAATA Target: GAATACGACTGACGATGGA One of the three 2nd best Best Slide 17

  18. Comparison: BLAST and FASTA • BLAST starts with exact string matching, while FASTA starts with inexact string matching (or exact string matching with a shorter words). BLAST is faster than FASTA. • For the examples given, both BLAST and FASTA will find the same best match, i.e., shifting the query sequence by 2 sites to the right. • Both perform dynamic programming for extending the match after the initial match. Slide 18

  19. Optional: BLAST Parameters • Lambda  and Karlin-Altschul (K) parameters are important because they directly affect the computation of E value. • Both  and K depend on • nucleotide (or aminon acid) frequencies • match-mismatch matrix • All BLAST implementations generally assume that nucleotide (or amino acid) sequences have roughly equal frequencies. • For nucleotide (or amino acid) sequences with strongly biased frequencies, BLAST E value obtained with the assumption can be quite misleading, i.e., one should use appropriate  and K.

  20. Case 1: equal , (-3,1)

  21. Case 2: Different , (-3, 1)

  22. Case 3: Different , s/v

  23. K: case 1

  24. K: Case 2

  25. Bioinformatics research workflow Accumulation of nucleotide and amino acid sequences: UUCUCAACCAACCAUAAAGAUAU UUCUCUACAAACCACAAAGACAU UUCUCAACCAACCAUAAAGAUAU UUCUCAACCAACCACAAAGACAU UUCUCCACGAACCACAAAGAUAU UUCUCUACAAACCACAAAGAUAU UUCUCAACCAACCACAAAGACAU UUCUCUACUAACCACAAAGACAU Functional genomics Systems biology Digital cells • Storage and annotation of the sequences • Structural annotation with homology search and de novo gene prediction • Functional annotation with gene ontologies Species-specific gene dictionaries, e.g., yeastgenome.org • Gene/Protein families (e.g., Pfam) • Cluster of orthologous genes (e.g., COG) • Supermatrix of gene presence/absence • Genome-based pair-wise distance distributions • Comparative genomics (the origin of new genes, new features and new species) • Phylogenetics (cladogenic process, dating of speciation and gene duplication events) • Phylogeny-based inference. Mutation Selection Adaptation Slide 25

More Related