html5-img
1 / 76

CS 5263 Bioinformatics

CS 5263 Bioinformatics. Lecture 5: Local Sequence Alignment Algorithms. Poll. Who have learned and still remember Finite State Machine/Automata, regular grammar, and context-free grammar?. Roadmap. Review of last lecture Local Sequence Alignment Statistics of sequence alignment

janet
Télécharger la présentation

CS 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 Bioinformatics Lecture 5: Local Sequence Alignment Algorithms

  2. Poll • Who have learned and still remember Finite State Machine/Automata, regular grammar, and context-free grammar?

  3. Roadmap • Review of last lecture • Local Sequence Alignment • Statistics of sequence alignment • Substitution matrix • Significance of alignment

  4. Bounded Dynamic Programming • O(kM) time • O(kM) memory • Possibly O(M+k) x1 ………………………… xM yN ………………………… y1 k

  5. Linear-space alignment • O(M+N) memory • 2MN time M/2 k* M/2 N-k*

  6. Graph representation of seq alignment (0,0) -1 -1 -1 -1 -1 -1 1 1 1 1 (3,4) An optimal alignment is a longest path from (0, 0) to (m,n) on the alignment graph

  7. Question • If I change the scoring scheme, will it change the alignment? • Match = 1, mismatch = gap = -2 || v • Match = 2, mismatch = gap = -1? • Answer: Yes

  8. Proof • Let F1 be the score of an optimal alignment under the scoring scheme • Match = m > 0 • Mismatch = s < 0 • Gap = d < 0 • Let a1, b1, c1 be the number of matches, mismatches, and gaps in the alignment • F1 = a1m + b1s + c1d

  9. Proof (cont’) • Let F2 be the score of a sub-optimal alignment under the same scoring scheme • Let a2, b2, c2 be the number of matches, mismatches, and gaps in the alignment • F2 = a2m + b2s + c2d • Let F1 = F2 + k, where k > 0

  10. Proof (cont’) • Now we change the scoring scheme, so that • Match = m + 1 • Mismatch = s + 1 • Gap = d + 1

  11. length of alignment 1 length of alignment 2 Proof (cont’) • The new scores for the two alignments become: F1’= a1 * (m+1) + b1 * (s + 1) + c1 * (d + 1) = a1m + b1s + c1d + (a1+b1+c1) = F1 + (a1+b1+c1) = F1 + L1 F2’ = a2 * (m+1) + b2 * (s + 1) + c2 * (d + 1) = F2 + (a2+b2+c2) = F2 + L2

  12. Proof (cont’) • F1’ – F2’ = F1 – F2 + (a1+b1+c1) – (a2+b2+c2) = k + (a1+b1+c1) – (a2+b2+c2) = k + L1 – L2 In order for F1’ < F2’, we need to have: k + L1 – L2 < 0, i.e. L2 – L1 > k Length of alignment 1 Length of alignment 2

  13. Proof (cont’) • This means, if under the original scoring scheme, F1 is greater than F2 by k, but the length of alignment 2 is at least (k+1) greater than that of alignment 1, F2’ will be greater than F1’ under the new scoring scheme. • We only need to show one example that it is possible to find such two alignments

  14. d F1 = 2m + 3s F2 = 3m + 4d m m s d m d m s s d

  15. d F1 = 2m + 3s F2 = 3m + 4d m m s d m = 1, s = d = –2 F1 = 2 – 6 = –4 F2 = 3 – 8 = –5 F1 > F2 m d m s s d

  16. d F1 = 2m + 3s F2 = 3m + 4d m m s d m = 2, s = d = – 1 F1’ = 4 – 3 = 1 F2’ = 6 – 4 = 2 F2’ > F1’ m d m s s d

  17. A A C G A AACAG | | ATCGT F1 = 2x1-3x2 = -4 F1’ = 2x2 – 3x1 = 1 m m A T m C AA-CAG- | | | -ATC-GT F2 = 3x1 – 4x2 = -5 F2’ = 3x2 – 4x1 = 2 m G T

  18. On the other hand, if we had doubled our scores, such that m’ = 2m, s’ = 2s d’ = 2d • F1’ = 2F1 • F2’ = 2F2 • Our alignment won’t be changed

  19. Today • How to model gaps more accurately? • Local sequence alignment • Statistics of alignment

  20. GACGCCGAACG |||| | | || GACG-C-A-CG What’s a better alignment? GACGCCGAACG ||||| ||| GACGC---ACG Score = 8 x m – 3 x d Score = 8 x m – 3 x d • However, gaps usually occur in bunches. • During evolution, chunks of DNA may be lost entirely • Aligning genomic sequence vs. cDNA (reverse complimentary to mRNA)

  21. Model gaps more accurately • Current model: • Gap of length n incurs penalty nd • General: • Convex function • E.g. (n) = c * sqrt (n)  n  n

  22. General gap dynamic programming Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O(N2M) (cubic) Space: O(NM) (linear-space algorithm not applicable)

  23. Compromise: affine gaps (n) (n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: 5 Gap extension: 1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1

  24. Additional states • The amount of state needed increases • In scoring a single entry in our matrix, we need remember an extra piece of information • Are we continuing a gap in x? (if no, start is more expensive) • Are we continuing a gap in y? (if no, start is more expensive) • Are we continuing from a match between xi and yj?

  25. Finite State Automaton e Xi aligned to a gap  d Xi and Yj aligned  d Yj aligned to a gap e 

  26. Dynamic programming • We encode this information in three different matrices • For each element (i,j) we use three variables • F(i,j): best alignment of x1..xi & y1..yj if xi aligns to yj • Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap • Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap

  27. d F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) – d Ix(i, j) = max Iy(i, j – 1) – d Ix(i, j – 1) – e F(i – 1, j) – d Iy(i, j) = max Ix(i – 1, j) – d Iy(i – 1, j) – e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

  28. F Iy Ix

  29. F Ix Iy

  30. If we stack all three matrices • No cyclic dependency • We can fill in all three matrices in order

  31. Algorithm • for i = 1:m • for j = 1:n • Fill in F(i, j), Ix(i, j), Iy(i, j) • end end • F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) • Time: O(MN) • Space: O(MN) or O(N) when combine with the linear-space algorithm

  32. To simplify F(i – 1, j – 1) + (xi, yj) F(i, j) = max I(i – 1, j – 1) + (xi, yj) F(i, j – 1) – d I(i, j) = max I(i, j – 1) – e F(i – 1, j) – d I(i – 1, j) – e I(i, j): best alignment between x1…xi and y1…yj if either xi or yj is aligned to a gap This is possible because no alternating gaps allowed

  33. To summarize • Global alignment • Basic algorithm: Needleman-Wunsch • Variants: • Overlapping detection • Longest common subsequences • Achieved by varying initial conditions or scoring • Bounded DP (pruning search space) • Linear space (divide-and-conquer) • Affine gap penalty

  34. Local alignment

  35. The local alignment problem Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y

  36. Why local alignment • Conserved regions may be a small part of the whole • “Active site” of a protein • Scattered genes or exons among “junks” • Don’t have whole sequence • Global alignment might miss them if flanking “junk” outweighs similar regions

  37. Genes are shuffled between genomes C D B A D B A C B C A D B D C A

  38. Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.

  39. Reminder • The overlap detection algorithm • We do not give penalty to gaps in the ends Free gap Free gap

  40. Similar here • We are free of penalty for the unaligned regions

  41. The big idea • Whenever we get to some bad region (negative score), we ignore the previous alignment • Reset score to zero

  42. The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max

  43. The Smith-Waterman algorithm Termination: • If we want the best local alignment… FOPT = maxi,j F(i, j) • If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

  44. The correctness of the algorithm can be proved by induction using the alignment graph 0 -10 0 100

  45. Match: 2 Mismatch: -1 Gap: -1

  46. Match: 2 Mismatch: -1 Gap: -1

  47. Match: 2 Mismatch: -1 Gap: -1

  48. Match: 2 Mismatch: -1 Gap: -1

  49. Match: 2 Mismatch: -1 Gap: -1

  50. Match: 2 Mismatch: -1 Gap: -1

More Related