1 / 65

CS 5263 Bioinformatics

CS 5263 Bioinformatics. Lecture 7: Heuristic Sequence Alignment Algorithms (BLAST). Roadmap. Last lecture review Sequence alignment statistics Today Gene finding by alignment Heuristic alignment algorithms B asic L ocal A lignment S earch T ools. Sequence Alignment Statistics.

sahkyo
Télécharger la présentation

CS 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Algorithms (BLAST)

  2. Roadmap • Last lecture review • Sequence alignment statistics • Today • Gene finding by alignment • Heuristic alignment algorithms • Basic Local Alignment Search Tools

  3. Sequence Alignment Statistics • Substitution matrices • How is the BLOSUM matrix made • How to make your own substitution matrix • What’s the meaning of an arbitrary substitution matrix • Significance of sequence alignments • P-value estimation for • Global alignment scores • Local alignment scores

  4. What is a p-value? • The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. i.e., “How likely is this effect to occur by chance?” • Pr(x > S | null)

  5. What is a null-hypothesis? • A statistician’s way of characterizing “chance.” • Generally, a mathematical model of randomness with respect to a particular set of observations. • The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

  6. For sequence alignment • Your null hypothesis is “the two sequences are unrelated” • Your alternative hypothesis is “the two sequences are related” • You obtained a score S • how likely that you can obtain such a score if the null hypothesis is true?

  7. How to test that null-hypothesis? • Randomly generate some pairs of “unrelated” sequences • See what alignment scores you may get for those “unrelated” sequences • Must keep other factors in mind • Your random sequences must be as close as possible to your true sequences • Except that they are “unrelated” (i.e., not from a common ancestor)

  8. Possible ways to get unrelated sequences • Which is better? • Randomly pick some sequences from a database • Randomly pick some sequences from a database and truncate to the same length as your real sequences • Generate random sequences according to the frequency that each letter is used by your real sequences • Randomly shuffle your sequences

  9. Possible ways to get random sequences • Which is better? • Randomly pick some sequences from a database • Randomly pick some sequences from a database and truncate to the same length as your real sequences • Generate random sequences according to the frequency that each letter is used by your real sequences • Randomly shuffle your sequences

  10. Random shuffling is what we do to estimate p-values for global alignment

  11. Mouse HEXA Human HEXA Score = 732 ……………………………………………………

  12. 732 Distribution of the alignment scores between mouse HEXA and 200 randomly shuffled human HEXA sequences P-value: less than 1/200 = 0.005

  13. Advantages • Easy to implement • You don’t need to know a lot of theories to do this • Disadvantages • Slow • Cannot estimate small p-values • If we had repeated 1,000 times, would we get a score as high 732? • Based on what I’ve already seen, I would guess probably no • What about 1,000,000 times?

  14. When theory exists • It gets much better • You don’t really need to go there to know what’s there • (I know roughly how many times you can get a score as high as 732 if you repeat your experiments a billion times…) • That is what happened for local alignment

  15. For ungapped alignment between two sequences of lengths M and N E(S) = KMN exp[-S] (Expected value, E-value. ) • K,  depends on sequence composition and substitution matrix • Can be obtained either empirically or analytically

  16. when P is small.

  17. Distribution of alignment scores for 1000 random sequence pairs Extreme value distribution Theory says my score distribution should have this shape My experiment shows me that the theory seems correct

  18. Example • You are aligning two sequences, each has 1000 bases • m = 1, s = -1, d = -inf (ungapped alignment) • You obtain a score 20 • Is this score significant?

  19.  = ln3 = 1.1 • E(S) = K MN exp{- S} • E(20) = 0.1 * 1000 * 1000 * 3-20 = 3 x 10-5 • P-value = 3 x 10-5 << 0.05 • The alignment is significant

  20. 20 Distribution of 1000 random sequence pairs

  21. Multiple-testing problem • You are searching a 1000-base sequence against a database of 106 sequences (average length 1000 bases) • You get a score 20 • You are essentially comparing 1000 bases with 1000x106 = 109 bases (ignore edge effect) • E(20) = 0.1 * 1000 * 109 * 3-20 = 30 • By chance we would expect to see 30 matches • P-value = 1 – e-30 = 0.9999999999 • Not significant at all

  22. A better way to understand p-value • Your p-value is 0.99999 • You have very low confidence (<0.00001) to say that the null hypothesis is wrong • Is the null hypothesis true then (i.e., the two sequences are unrelated)? • You don’t know • Your test was not designed to tell you that

  23. In practice • You search the sequence against a database of 106 sequences • You get 35 matches • You expect to get about 30 by chance • It could be all 35 are real, or none, or some • You already reduced your target from 106 sequences to 35 sequences • Take all 35 sequences with caution. Look for other evidences

  24. Statistics for gapped local alignment • Theory not well developed • Extreme value distribution works well empirically • Need to estimate K and  empirically

  25. Exercising FSA • How do you make an FSA for the Needleman-Wunsch algorithm?

  26. Exercising FSA • How do you make an FSA for the Needleman-Wunsch algorithm? Ix (-, yj)/d (xi,yj) / (-, yj) / d (xi,yj) / F (xi,-)/d (-, yj) / d Iy (xi,-) / d (xi,-)/d (xi,yj) /

  27. Simplify (xi,yj) / (xi,-) / d (xi,-) / d F I (xi,yj) / (-, yj) / d (-, yj) / d

  28. Simplify more (xi,yj) / • F(i-1, j-1) + (xi, yj) • F(i,j) = max F(i-1, j) + d • F(i, j-1) + d F (xi,-) / d (-, yj) / d

  29. A more difficult alignment problem • (A gene finder indeed!) • X is a genomic sequence (DNA) • X encodes a gene • May contain introns • Y is an ORF from another species • Contains only exons • We want to compare X against Y • Conservation is on the level of amino acids

  30. DNA intron intron Pre-mRNA exon exon exon 3’ UTR 5’ UTR Splice Mature mRNA (mRNA) Open reading frame (ORF) Start codon Stop codon

  31. We have a predicted gene • We know the positions of the start codon and stop codon • But we don’t know where are the splicing sites • Not even the number of introns intron intron exon exon exon Start codon Stop codon

  32. Most splicing sites start at GT and end at AG • But there are lots of GT and AG in the sequence • Aligning to a orthologous gene with known ORF may help us determine the splicing sites • Orthologous genes: two genes evolved from the same ancestor • Coding region are likely conserved on amino acid level • UUA, UUG encode the same amino acid • So do UCA, UCU, UCG, UCC GT…………AG Mouse putative gene human ORF

  33. The Genetic Code Third letter

  34. If know where are the exons • Easy Mouse putative gene Remove introns Mouse putative ORF translate Global alignment translate human ORF

  35. Or directly align triplets Mouse putative gene Remove introns Mouse putative ORF Global alignment human ORF

  36. Codon substitution scores 64 x 64 substitution matrix

  37. FSA for aligning genomic DNA to ORF (xi-2xi-1xi, yj-2yj-1yj) /  (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e (xi-2xi-1xi, yj-2yj-1yj) /  A B (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d Considering only exons

  38. We don’t know exactly where are the splicing sites • Length of introns may not be a multiple of 3 • - If convert the whole seq into triplets, may result in ORF shift 17 bases? Mouse putative gene human ORF

  39. Model introns • Most splicing sites start at GT and end at AG • For simplicity, assume length of exon is a multiple of 3 • Not true in reality • Only a little more work without this assumption GT…………AG Mouse putative gene 126 nt = 42 aa human ORF 120 nt = 40 aa

  40. Aligning genomic DNA to ORF Fixed cost to have an intron Alignment with Affine gap penalty

  41. FSA for aligning genomic DNA to ORF (xi-2xi-1xi, yj-2yj-1yj) /  (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e (xi-2xi-1xi, yj-2yj-1yj) /  A B (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d Considering only exons

  42. FSA for aligning genomic DNA to ORF (xi-2xi-1xi, yj-2yj-1yj) /  (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e Start an intron (xi-2xi-1xi, yj-2yj-1yj) /  A C B (-, GT) / s (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d

  43. FSA for aligning genomic DNA to ORF Continue in intron (xi-2xi-1xi, yj-2yj-1yj) /  (-, yi) / 0 (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e Start an intron (xi-2xi-1xi, yj-2yj-1yj) /  A C B (-, GT) / s (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d

  44. FSA for aligning genomic DNA to ORF Continue in intron (xi-2xi-1xi, yj-2yj-1yj) /  (-, yi) / 0 (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e Start an intron (xi-2xi-1xi, yj-2yj-1yj) /  A C B (-, GT) / s (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d (-, AG) / s Close an intron

  45. (xi-2xi-1xi, yj-2yj-1yj) /  (-, yj) / 0 (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e (xi-2xi-1xi, yj-2yj-1yj) /  A C B (-, GT) / s (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d (-, AG) / s A(i-3,j-3) +  (xi-2xi-1xi, yj-2yj-1yj) A(i, j) = max B(i-3,j-3) +  (xi-2xi-1xi, yj-2yj-1yj) C(i, j-2) + s, if yj-1yj == ‘AG’

  46. (xi-2xi-1xi, yj-2yj-1yj) /  (-, yj) / 0 (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e (xi-2xi-1xi, yj-2yj-1yj) /  A C B (-, GT) / s (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d (-, AG) / s A(i, j-3) + d A(i-3, j) + d B(i, j) = max B(i, j-3) + e B(i-3, j) + e

  47. (xi-2xi-1xi, yj-2yj-1yj) /  (-, yj) / 0 (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e (xi-2xi-1xi, yj-2yj-1yj) /  A C B (-, GT) / s (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d (-, AG) / s B(i, j-2) + s, if yj-1yj == ‘GT’ C(i, j) = max C(i, j-1)

  48. ACGGATGCGATCAGTTGTACTACGAGCTGACGGTCCTCAGACTTGATTA

  49. There is a close relationship between dynamic programming, FSA, regular expression, and regular grammar • Using FSA, you can design more complex alignment algorithms • If you can draw the state diagram for a problem, it can be easily formulated into a DP problem • In particular, Hidden Markov Models • Will discuss more in a few weeks

  50. Heuristic Local AlignersBLAST and alike

More Related