CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms

Roadmap • Review of last lecture • More global sequence alignment algorithms

Given a scoring scheme, • Match: m • Mismatch: -s • Gap: -d • We can easily compute an optimal alignment by dynamic programming

In a completed alignment between a pair of sequences X = x1x2…xM, Y = y1y1…yN • If we look at any column of the alignment, there are only three possibilities • xi is aligned to yj • xi is aligned to a gap • yj is aligned to a gap

Since the alignment score F(M, N) is a sum of all aligned columns, it can be broken down to: F(M-1, N-1) +  (xM, yN) F(M, N) = max F(M-1, N) - d F(M, N-1) - d

And recursively: F(i-1, j-1) +  (xi, yj) F(i, j) = max F(i-1, j) - d F(i, j-1) - d

Trace-back F(i,j) j = 0 1 2 3 4 i = 0 A A A A G - G - T T T T A A A A 1 2 3

Graph representation S1 = G A T A (0,0) : a gap in the 2nd sequence : a gap in the 1st sequence : match / mismatch -1 1 1 -1 -1 S2 = A -1 -1 1 T Values on vertical/horizontal line: -d Values on diagonal: m or -s -1 1 A (3,4) -1 -1 -1 -1 • Number of steps: length of the alignment • Path length: alignment score • Alignment: find the longest path from (0, 0) to (3, 4) • General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

Question • If we change the scoring scheme, will the optimal alignment be changed? • Original: Match = 1, mismatch = gap = -1 • New: match = 2, mismatch = gap = 0 • New: Match = 2, mismatch = gap = -2?

Number of alignments • Is equal to the number of distinct paths from (0, 0) to (m, n) A A A A A B C B C B C B C B C A- BC A-- -BC --A BC- -A- B-C -A BC

How to count? • Homework assignment • Hint: dynamic programming • Or analytically

However • Biologically meaningful “distinct” alignments may be much less • All three may be considered equivalent • A, B, and C all aligned to gaps A A A B C B C B C A-- -BC --A BC- -A- B-C

Number of alignments • We only care about who is aligned to whom, not the gaps • For two sequences of length m, n, there may be k matches, k = 0 to min(m, n) • Number of alignments:

Furthermore A A -d m or -s -d B C B C => A-- -BC A- BC • Alternating gaps are discouraged / prohibited. • With most scoring scheme, alternating gaps will never happen. (as long as 2d > s)

A A A A A B C B C B C B C B C • Special trick? • No. In most scoring scheme this is achieved automatically • 2d > s A- BC A-- -BC --A BC- -A- B-C -A BC

Number of alignments • Homework assignment • Dynamic programming • Multiple matrices • Three states: • Came from diagonal. Can go any of the three directions

Number of alignments • Homework assignment • Dynamic programming • Multiple matrices • Three states: • Came from diagonal. Can go any of the three directions • Came from left, cannot go down

Number of alignments • Homework assignment • Dynamic programming • Multiple matrices • Three states: • Came from diagonal. Can go any of the three directions • Came from left, cannot go down • Came from above, cannot turn right

Given two sequences of length M, N • Time: O(MN) • ok • Space: O(MN) • bad • 1Mb seq x 1Mb seq = 1000G memory • Can we do better?

In biology, this kind of alignment is unlikely to be meaningful abcde---- ----vwxyz

Good alignment should appear near the diagonal

Bounded Dynamic Programming If we know that x and y are very similar Assumption: # gaps(x, y) < k xi Then, | implies | i – j | < k yj

Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ (xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k F(i – 1, j) – d, if j < i + k Termination: same x1 ………………………… xM yN ………………………… y1 k

Analysis • Time: O(kM) << O(MN) • Space: O(kM) with some tricks => M M 2k 2k

What if we don’t know k? • Iterate: • For k = 2, 4, 8, 16, … • For each k, we can have an optimal bounded alignment with score Sk • Stop when ((min(N, M)-k) * m – 2kd) < Sk, since we will not be able to get a higher score with larger k

Given two sequences of length M, N • Time: O(MN) • ok • Space: O(MN) • bad • 1mb seq x 1mb seq = 1000G memory • Can we do better?

Linear space algorithm • If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (if you are crafty enough, you only need one row) But how do we get the alignment?

Linear space algorithm • When we finish, we know how we have aligned the ends of the sequences XM YN Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN))

Hirschberg’s idea • Divide and conquer! Y X Forward algorithm Align x1x2…xM/2 with Y M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk

Backward Algorithm Y X Backward algorithm Align reverse(xM/2xM/2+1…xM) with reverse(Y) M/2 B(M/2, k) represents the best alignment between reverse(xM/2xM/2+1…xM) and reverse(ykyk+1…yN )

Lemma • F(M/2, k) + B(M/2, k) is the best alignment under the constraint that xM/2 must be aligned to yk • F(M, N) = maxk=0…N( F(M/2, k) + B(M/2, k) ) M/2 x F(M/2, k) B(M/2, k) y k*

(0,0) • Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(3,k,6,6) (3,2) (3,0) (3,4) (3,6) (6,6)

Linear-space alignment Now, using 2 rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2

Linear-space alignment Now, we can find k* maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k* Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Linear-space alignment • Iterate this procedure to the two sub-problems! M/2 k* M/2 N-k*

Analysis • Memory: O(N) for computation, O(N+M) to store the optimal alignment • Time: • MN for first iteration • k M/2 + (N-k) M/2 = MN/2 for second • … k M/2 M/2 N-k

MN MN/2 MN/4 • MN + MN/2 + MN/4 + MN/8 + … • = MN (1 + ½ + ¼ + 1/8 + 1/16 + …) • = 2MN = O(MN) MN/8

CS 5263 Bioinformatics