Download Presentation
## CS 5263 Bioinformatics

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**CS 5263 Bioinformatics**Lecture 4: Local Sequence Alignment, More Efficient Sequence Alignment Algorithms**Roadmap**• Review of last lecture • Local sequence alignment • More efficient sequence alignment algorithms**Given a scoring scheme,**• Match: m • Mismatch: -s • Gap: -d • We can easily compute an optimal alignment by dynamic programming**Look at any column of an alignment between two sequences X =**x1x2…xM, Y = y1y1…yN • Only three cases: • xi is aligned to yj • xi is aligned to a gap • yj is aligned to a gap F(i-1, j-1) + (xi, yj) F(i, j) = max F(i-1, j) - d F(i, j-1) - d**Example**F(i,j) j = 0 1 2 3 4 i = 0 A A A A G - G - T T T T A A A A 1 2 3**Equivalent graph problem**S1 = G A T A (0,0) : a gap in the 2nd sequence : a gap in the 1st sequence : match / mismatch 1 1 S2 = A 1 T Value on vertical/horizontal line: -d Value on diagonal: m or -s 1 1 A (3,4) • Number of steps: length of the alignment • Path length: alignment score • Optimal alignment: find the longest path from (0, 0) to (3, 4) • General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.**Variants of Needleman-Wunsch alg**• LCS: longest common subsequence • No penalty for gaps or mutations • Change: score function • Overlapping variants • No penalty for starting/ending gaps • Change: initial / termination step • Other variants • cDNA-genome alignment**The local alignment problem**Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y**Why local alignment**• Conserved regions may be a small part of the whole • Global alignment might miss them if flanking “junk” outweighs similar regions • Genes are shuffled between genomes C D B A D B A C**Naïve algorithm**for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.**Reminder**• The overlap detection algorithm • We do not give penalty to gaps in the ends Free gap Free gap**The local alignment idea**• Do not penalize the unaligned regions (gaps or mismatches) • The alignment can start anywhere and ends anywhere • Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment • By resetting alignment score to zero**The Smith-Waterman algorithm**Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max**The Smith-Waterman algorithm**Termination: • If we want the best local alignment… FOPT = maxi,j F(i, j) • If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back**Match: 2**Mismatch: -1 Gap: -1**Match: 2**Mismatch: -1 Gap: -1**Match: 2**Mismatch: -1 Gap: -1**Match: 2**Mismatch: -1 Gap: -1**Match: 2**Mismatch: -1 Gap: -1**Match: 2**Mismatch: -1 Gap: -1**Match: 2**Mismatch: -1 Gap: -1**Trace back**Match: 2 Mismatch: -1 Gap: -1**Trace back**Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde**No negative values in local alignment DP array**• Optimal local alignment will never have a gap on either end • Local alignment: “Smith-Waterman” • Global alignment: “Needleman-Wunsch”**Analysis**• Time: • O(MN) for finding the best alignment • Time to report all alignments depends on the number of sub-opt alignments • Memory: • O(MN) • O(M+N) possible**Given two sequences of length M, N**• Time: O(MN) • ok • Space: O(MN) • bad • 1Mb seq x 1Mb seq = 1000G memory • Can we do better?**Bounded alignment**Good alignment should appear near the diagonal**Bounded Dynamic Programming**If we know that x and y are very similar Assumption: # gaps(x, y) < k xi Then, | implies | i – j | < k yj**Bounded Dynamic Programming**Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ (xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k F(i – 1, j) – d, if j < i + k Termination: same x1 ………………………… xM yN ………………………… y1 k**Analysis**• Time: O(kM) << O(MN) • Space: O(kM) with some tricks => M M 2k 2k**Given two sequences of length M, N**• Time: O(MN) • ok • Space: O(MN) • bad • 1mb seq x 1mb seq = 1000G memory • Can we do better?**Linear space algorithm**• If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (You only need one row, with a little trick) But how do we get the alignment?**Linear space algorithm**• When we finish, we know how we have aligned the ends of the sequences XM YN Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN))**(0, 0)**M/2 (M, N) Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.**(0,0)**• Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k)) (3,2) (3,0) (3,4) (3,6) (6,6)**Hirschberg’s idea**• Divide and conquer! Y X Forward algorithm Align x1x2…xM/2 with Y M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk**Backward Algorithm**Y X Backward algorithm Align reverse(xM/2+1…xM) with reverse(Y) M/2 B(M/2, k) represents the best alignment between reverse(xM/2+1…xM) and reverse(ykyk+1…yN )**Linear-space alignment**Using 2 (4) rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2**Linear-space alignment**Now, we can find k* maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k* Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2**Linear-space alignment**• Iterate this procedure to the two sub-problems! M/2 k* M/2 N-k***Analysis**• Memory: O(N) for computation, O(N+M) to store the optimal alignment • Time: • MN for first iteration • k M/2 + (N-k) M/2 = MN/2 for second • … k M/2 M/2 N-k**MN**MN/2 MN/4 • MN + MN/2 + MN/4 + MN/8 + … • = MN (1 + ½ + ¼ + 1/8 + 1/16 + …) • = 2MN = O(MN) MN/8