html5-img
1 / 47

CS 5263 Bioinformatics

CS 5263 Bioinformatics. Lecture 4: Local Sequence Alignment, More Efficient Sequence Alignment Algorithms. Roadmap. Review of last lecture Local sequence alignment More efficient sequence alignment algorithms. Given a scoring scheme, Match: m Mismatch: -s Gap: -d

fathia
Télécharger la présentation

CS 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 Bioinformatics Lecture 4: Local Sequence Alignment, More Efficient Sequence Alignment Algorithms

  2. Roadmap • Review of last lecture • Local sequence alignment • More efficient sequence alignment algorithms

  3. Given a scoring scheme, • Match: m • Mismatch: -s • Gap: -d • We can easily compute an optimal alignment by dynamic programming

  4. Look at any column of an alignment between two sequences X = x1x2…xM, Y = y1y1…yN • Only three cases: • xi is aligned to yj • xi is aligned to a gap • yj is aligned to a gap F(i-1, j-1) +  (xi, yj) F(i, j) = max F(i-1, j) - d F(i, j-1) - d

  5. Example F(i,j) j = 0 1 2 3 4 i = 0 A A A A G - G - T T T T A A A A 1 2 3

  6. Equivalent graph problem S1 = G A T A (0,0) : a gap in the 2nd sequence : a gap in the 1st sequence : match / mismatch 1 1 S2 = A 1 T Value on vertical/horizontal line: -d Value on diagonal: m or -s 1 1 A (3,4) • Number of steps: length of the alignment • Path length: alignment score • Optimal alignment: find the longest path from (0, 0) to (3, 4) • General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

  7. Variants of Needleman-Wunsch alg • LCS: longest common subsequence • No penalty for gaps or mutations • Change: score function • Overlapping variants • No penalty for starting/ending gaps • Change: initial / termination step • Other variants • cDNA-genome alignment

  8. Local alignment

  9. The local alignment problem Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y

  10. Why local alignment • Conserved regions may be a small part of the whole • Global alignment might miss them if flanking “junk” outweighs similar regions • Genes are shuffled between genomes C D B A D B A C

  11. Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.

  12. Reminder • The overlap detection algorithm • We do not give penalty to gaps in the ends Free gap Free gap

  13. The local alignment idea • Do not penalize the unaligned regions (gaps or mismatches) • The alignment can start anywhere and ends anywhere • Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment • By resetting alignment score to zero

  14. The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max

  15. The Smith-Waterman algorithm Termination: • If we want the best local alignment… FOPT = maxi,j F(i, j) • If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

  16. Match: 2 Mismatch: -1 Gap: -1

  17. Match: 2 Mismatch: -1 Gap: -1

  18. Match: 2 Mismatch: -1 Gap: -1

  19. Match: 2 Mismatch: -1 Gap: -1

  20. Match: 2 Mismatch: -1 Gap: -1

  21. Match: 2 Mismatch: -1 Gap: -1

  22. Match: 2 Mismatch: -1 Gap: -1

  23. Trace back Match: 2 Mismatch: -1 Gap: -1

  24. Trace back Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde

  25. No negative values in local alignment DP array • Optimal local alignment will never have a gap on either end • Local alignment: “Smith-Waterman” • Global alignment: “Needleman-Wunsch”

  26. Analysis • Time: • O(MN) for finding the best alignment • Time to report all alignments depends on the number of sub-opt alignments • Memory: • O(MN) • O(M+N) possible

  27. More efficient alignment algorithms

  28. Given two sequences of length M, N • Time: O(MN) • ok • Space: O(MN) • bad • 1Mb seq x 1Mb seq = 1000G memory • Can we do better?

  29. Bounded alignment Good alignment should appear near the diagonal

  30. Bounded Dynamic Programming If we know that x and y are very similar Assumption: # gaps(x, y) < k xi Then, | implies | i – j | < k yj

  31. Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ (xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k F(i – 1, j) – d, if j < i + k Termination: same x1 ………………………… xM yN ………………………… y1 k

  32. Analysis • Time: O(kM) << O(MN) • Space: O(kM) with some tricks => M M 2k 2k

  33. Given two sequences of length M, N • Time: O(MN) • ok • Space: O(MN) • bad • 1mb seq x 1mb seq = 1000G memory • Can we do better?

  34. Linear space algorithm • If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (You only need one row, with a little trick) But how do we get the alignment?

  35. Linear space algorithm • When we finish, we know how we have aligned the ends of the sequences XM YN Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN))

  36. (0, 0) M/2 (M, N) Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

  37. (0,0) • Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k)) (3,2) (3,0) (3,4) (3,6) (6,6)

  38. Hirschberg’s idea • Divide and conquer! Y X Forward algorithm Align x1x2…xM/2 with Y M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk

  39. Backward Algorithm Y X Backward algorithm Align reverse(xM/2+1…xM) with reverse(Y) M/2 B(M/2, k) represents the best alignment between reverse(xM/2+1…xM) and reverse(ykyk+1…yN )

  40. Linear-space alignment Using 2 (4) rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2

  41. Linear-space alignment Now, we can find k* maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k* Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

  42. Linear-space alignment • Iterate this procedure to the two sub-problems! M/2 k* M/2 N-k*

  43. Analysis • Memory: O(N) for computation, O(N+M) to store the optimal alignment • Time: • MN for first iteration • k M/2 + (N-k) M/2 = MN/2 for second • … k M/2 M/2 N-k

  44. MN MN/2 MN/4 • MN + MN/2 + MN/4 + MN/8 + … • = MN (1 + ½ + ¼ + 1/8 + 1/16 + …) • = 2MN = O(MN) MN/8

More Related