1 / 61

Aligning Alignments

Aligning Alignments. Soni Mukherjee 11/11/04. Pairwise Alignment. Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches) * s - (#gaps) * d Optimal alignment is the alignment with the maximum score. Dynamic Programming. We want to align

caelan
Télécharger la présentation

Aligning Alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Aligning Alignments Soni Mukherjee 11/11/04

  2. Pairwise Alignment • Given two sequences, find their optimal alignment • Score = (#matches) * m - (#mismatches) * s - (#gaps) * d • Optimal alignment is the alignment with the maximum score

  3. Dynamic Programming • We want to align x1…xm and y1…yn • D(i,j) = optimal score of aligning x1…xi and y1…yj • Solution is D(m, n)

  4. Three possible cases for computing D(i,j): Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG

  5. Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG

  6. Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG D(i,j) = D(i-1, j-1) + m, if xi = yj -s, otherwise

  7. Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG D(i,j) = D(i-1, j-1) + m, if xi = yj -s, otherwise D(i,j) = D(i-1, j) - d

  8. Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG D(i,j) = D(i-1, j-1) + m, if xi = yj -s, otherwise D(i,j) = D(i-1, j) - d D(i,j) = D(i, j-1) - d

  9. Dynamic Programming • Inductive assumption: • D(i-1, j-1), D(i-1, j) and D(i, j-1) are optimal • D(i, j) = max Where s(xi, yj) = m if xi = yj; -s otherwise • D(i-1, j-1) + s(xi, yj) • D(i-1, j) - d • D(i, j-1) - d

  10. Dynamic Programming • Matrix D +s(X[i],Y[j]) -d -d

  11. Every non-decreasing path from (0,0) to (M,N) corresponds to an alignment of the two sequences Needleman-Wunsch y1 ……………………………… yN xM ……………………………… x1

  12. Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d

  13. Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d • Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1)

  14. Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d • Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1) D(i, j) = max D(i-1, j-1) + s(xi, yj) maxk=0…i-1 D(k, j) – p(i-k) maxk=0…j-1 D(i, k) – p(j-k)

  15. Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d • Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1) D(i, j) = max D(i-1, j-1) + s(xi, yj) maxk=0…i-1 D(k, j) – p(i-k) maxk=0…j-1 D(i, k) – p(j-k) 3 Running time = O(N )

  16. Affine Gaps • p(n) = d + n*e d = gap open penalty e = gap extend penalty e d

  17. Affine Gaps • p(n) = d + n*e d = gap open penalty e = gap extend penalty • Now we need three matrices: D(i, j) = score of alignment x1…xi to y1…yj ifxi aligns to yj H(i, j) = score of alignment x1…xi to y1…yj ifyj aligns to a gap V(i, j) = score of alignment x1…xi to y1…yj ifxi aligns to a gap e d

  18. Needleman-Wunsch with Affine Gaps • D(i,j) = max • H(i,j) = max • V(i,j) = max D(i-1, j-1) + s(xi, yj) H(i-1, j-1) + s(xi, yj) V(i-1, j-1) + s(xi, yj) D(i, j-1) - d H(i, j-1) - e V(i, j-1) - d D(i-1, j) - d H(i-1, j) - d V(i-1, j) - e

  19. Needleman-Wunsch with Affine Gaps • D(i,j) = max • H(i,j) = max • V(i,j) = max D(i-1, j-1) + s(xi, yj) H(i-1, j-1) + s(xi, yj) V(i-1, j-1) + s(xi, yj) D(i, j-1) - d H(i, j-1) - e V(i, j-1) - d Running time = O(MN) D(i-1, j) - d H(i-1, j) - d V(i-1, j) - e

  20. Affine Gaps • Essentially, when there is a gap, the algorithm looks back one space to determine whether or not this gap opened a gap or continued a previous one: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap

  21. Multiple Sequence Alignment • Given N sequences x1, x2,…, xN, insert gaps in each sequence xi such that: • All sequences have the same length L • Global score is maximum • Motivation: • Faint similarity between two sequences becomes significant if present in many • Multiple alignments can help improve pairwise alignments

  22. Induced Pairwise Alignments • Multiple alignment: x:AC_GCGG_C y:AC_GC_GAG z:GCCGC_GAG • Induces three pairwise alignments: x: ACGCGG_C x: AC_GCGG_C y: AC_GCGAG y: ACGC_GAC z: GCCGC_GAG z: GCCGCGAG

  23. Sum of Pairs • Sum of Pairs score of a multiple alignment is the sum of the scores of all induced pairwise alignments: S(m) = k<l s(mk, ml) wheres(mk, ml) = score of induced alignment (k, l)

  24. Multidimensional Dynamic Programming • Example in 3-D (3 sequences) • 7 neighbors per cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk), F(i-1,j-1,k )+S(xi, xj, - ), F(i-1,j ,k-1)+S(xi, -, xk), F(i-1,j ,k )+S(xi, -, - ), F(i ,j-1,k-1)+S( -, xj, xk), F(i ,j-1,k )+S( -, xj, -), F(i ,j ,k-1)+S( -, -, xk) }

  25. Multidimensional Dynamic Programming • L = length of each sequence • N = number of sequences • Size of matrix = LN • Neighbors per cell = 2N – 1 • Running time = O(2N LN)

  26. Progressive Alignment • Align two of the sequences xi and xj • Fix that alignment • Align a third sequence/alignment to the alignment xixj • Repeat until all sequences are aligned

  27. Progressive Alignment • When evolutionary tree is known: • Align closest first, in order of the tree: • Align (x, y) • Align (w, z) • Align (xy, wz) x y z w

  28. Multidimensional Dynamic Programming Progressive Alignment Alignment three sequences Y Y Z Z X X

  29. Multidimensional Dynamic Programming Progressive Alignment Aligning three sequences Y Y Z X X

  30. Multidimensional Dynamic Programming Progressive Alignment Aligning three sequences Y Y Z X X

  31. Score at each entry adds score of aligning the column in y to the column in the alignment xz Sequence vs Alignment x1 ……………………………… xM z1 ……………………………… zL yN ……………………………… y1

  32. Example • ith Ietter of y: A • jth column of xz: • D(i, j) = max - A D(i-1, j-1) – d + s(A, A) D(i-1, j) – d – d D(i, j-1) + 0 – d

  33. Affine Gaps • ith letter of y matched with jth column of xz • (j-1)th column of xz gapped y: - A x: - - z: A A • This induces the yx alignment: y: - A x: - -

  34. Affine Gaps • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap

  35. Affine Gaps • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap • When aligning a sequence and an alignment, a fourth case arises: - x Starts or continues - - a gap???

  36. Affine Gaps • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap • When aligning a sequence and an alignment, a fourth case arises: - x Starts or continues - - a gap???

  37. Aligning AlignmentsJohn D. Kececioglu and Weiqing Zhang, 1998 • Optimistic and pessimistic gap counts for sequence vs alignment • Exact gap counts for sequence vs alignment

  38. Sequence vs Alignment • A = a1 … am is a sequence of length m • B is a multiple alignment of length n of k sequences • represented by a k x n matrix • each entry bij is either a letter or gap

  39. Optimistic and Pessimistic Gap Counts • When we have - x - - • Optimistic gap count assumes that this continues a previous gap • Pessimistic gap count assumes this starts a new gap • Running time = O(kmn)

  40. Exact Gap Counts • Recall matrices: D(i, j) = score of alignment a1…ai to b1…bj if ai aligns to bj H(i, j) = score of alignment a1…ai to b1…bj if bj aligns to a gap V(i, j) = score of alignment a1…ai to b1…bj if ai aligns to a gap • Only ways to get are the cases HH, HV, and HD, generalized as HX - x - -

  41. Exact Gap Counts • Three possibilities: • … DH…HX • … VH…HX • H………HX

  42. Exact Gap Counts • Three possibilities: • … DH…HX • … VH…HX • H………HX • Is bij the first character in its row encountered during the run?

  43. Exact Gap Counts • Three possibilities: • … DH…HX • … VH…HX • H………HX • Is bij the first character in its row encountered during the run? • Algorithm with lots of matrices runs in O(kn + kmn + mn ) 2 2

  44. Sequence vs Alignment Alignment vs Alignment Comparison

  45. Sequence vs Alignment Only three types of paths can cause Alignment vs Alignment Comparison … - - x … - - -

  46. Sequence vs Alignment Only three types of paths can cause Alignment vs Alignment Any path can cause Comparison … - - x … - - - … - - x … - - -

  47. Aligning Alignments ExactlyJohn Kececioglu and Dean Starrett, 2003 • Aligning two alignments is NP-complete • Exact algorithm • Time and space complexity • Pruning • Results

  48. NP-Completeness • Reduction from the Maximum Cut Problem • Still NP-compete if: • Strings are of length at most 5 • Every row has at most 3 gaps • At most 1 gap in the interior of each string

  49. Exact Algorithm • Sufficient to know relative order of the rightmost element in the row for each pair: x: - A y: - - • If x’s rightmost element is to the right of y’s rightmost element, this is an extension • Otherwise, it is a startup

  50. Shapes A: -AGGCTATCACCTGACCTCCAGG B: TAG-CTATCAC--GACCGC---- C: CAG-CTATCAC--GACCGC---- D: CAGCCTATCACC-GAACGCCA--

More Related