1 / 38

CS 5263 Bioinformatics

CS 5263 Bioinformatics. Lecture 8: Multiple Sequence Alignment. Multiple Sequence Alignment. Definition Given N sequences x 1 , x 2 ,…, x N : Insert gaps (-) in each sequence x i , such that All sequences have the same length L Score of the alignment is maximum Two issues

natan
Télécharger la présentation

CS 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 Bioinformatics Lecture 8: Multiple Sequence Alignment

  2. Multiple Sequence Alignment • Definition • Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the alignment is maximum • Two issues • How to score an alignment? • How to find a (nearly) optimal alignment?

  3. Scoring function - first assumption • Columns are independent • Similar in pair-wise alignment • Therefore, the score of an alignment is the sum of all columns • Need to decide how to score a single column

  4. Scoring function (cont’d) • Ideally: • An n-dimensional matrix, where n is the number of sequences • E.g. (A, C, C, G, -) for aligning 5 sequences • Total number of parameters: (k+1)n, where k is the alphabet size • Direct estimation of such scores is difficult • Too many parameters to estimate • Even more difficult if need to consider phylogenetic relationships x y Phylogenetic tree or evolution tree (more detail in future lectures) z ? w v

  5. Scoring Function (cont’d) • Compromises: • Compute from pair-wise scores • Option 1: Based on sum of all pair-wise scores • Option 2: Based on scores with a consensus sequence • Other options • Consider tree topology explicitly • Information-theory based score • Difficult to optimize

  6. Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG - - - -

  7. Sum Of Pairs (cont’d) • The sum-of-pairs (SP) score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

  8. Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG (A,A) + (A,G) x 2 = -1 (G,G) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

  9. Sum Of Pairs (cont’d) • Drawback: no evolutionary characterization • Every sequence derived from all others • Heuristic way to incorporate evolution tree • Weighted Sum of Pairs: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance Human Mouse Duck Chicken

  10. Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Consensus sequence: • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)

  11. Multiple Sequence Alignments Algorithms • Can also be global or local • We only talk about global for now • A simple method • Do pairwise alignment between all pairs • Combine the pairwise alignments into a single multiple alignment • Is this going to work?

  12. Compatible pairwise alignments AAAATTTT AAAATTTT---- ----TTTTGGGG AAAATTTT---- AAAA----GGGG AAAATTTT---- ----TTTTGGGG AAAA----GGGG TTTTGGGG AAAAGGGG ----TTTTGGGG AAAA----GGGG

  13. Incompatible pairwise alignments AAAATTTT AAAATTTT---- ----TTTTGGGG ----AAAATTTT GGGGAAAA---- ? TTTTGGGG GGGGAAAA TTTTGGGG---- ----GGGGAAAA

  14. Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: • Find the longest path in a high-dimensional cube • As opposed to a two-dimensional grid • Uses a N-dimensional matrix • As apposed to a two-dimensional array • Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))

  15. Multidimensional Dynamic Programming (MDP) • Example: in 3D (three sequences): • 23 – 1 = 7 neighbors/cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(xi, xj, xk), F(i-1,j-1,k ) + S(xi, xj, -), F(i-1,j ,k-1) + S(xi, -, xk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, xj, -), F(i ,j ,k-1) + S(-, -, xk) (i,j,k-1) (i,j-1,k-1) (i,j-1,k) (i,j,k)

  16. Multidimensional Dynamic Programming (MDP) Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

  17. Faster MDP • Carrillo & Lipman, 1988 • Branch and bound • Other heuristics • Implemented in a tool called MSA • Practical for about 6 sequences of length about 200-300.

  18. Faster MDP • Basic idea: bounds of the optimal score of a multiple alignment can be pre-computed • Upper-bound: sum of optimal pair-wise alignment scores, i.e. S(m) = k<l s(mk, ml)  k<l s(k, l) • lower-bounded: score computed by any approximate algorithm (such as the ones we’ll talk next) • For any partial path, if Scurrent + Sperspective < lower-bound, can give up that path • Guarantees optimality Optimal msa Score of optimal alignment between k and l Score of the alignment between k and l induced by m

  19. Progressive Alignment • Multiple Alignment is NP-hard • Most used heuristic: Progressive Alignment Algorithm: • Align two of the sequences xi, xj • Fix that alignment • Align a third sequence xk to the alignment xi,xj • Repeat until all sequences are aligned Running Time: O(NL2) Each alignment takes O(L2) Repeat N times

  20. Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) y z w

  21. Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: • Find all dij: alignment dist (xi, xj) • High alignment score => short distance • Construct a tree (Neighbor-joining hierarchical clustering. Will discuss in future) • Align nodes in order of decreasing similarity + a large number of heuristics

  22. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD

  23. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD Distance matrix

  24. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD s1 s3 s2 s4

  25. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK s1 s3 s2 s4

  26. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -TNSD NT-SD s1 s3 s2 s4

  27. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -ALSK -TNSD NA-SK NT-SD -TNSD NT-SD s1 s3 s2 s4

  28. Iterative Refinement Problems with progressive alignment: • Depend on pair-wise alignments • If sequences are very distantly related, much higher likelihood of errors • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear: correct y should be GA-CTT

  29. Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Progressive alignment

  30. allow y to vary x,z fixed projection Iterative Refinement (cont’d) For each sequence y • Remove y • Realign y (while rest fixed) z x y Note: Guaranteed to converge (why?) Running time: O(kNL2), k: number of iterations

  31. Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

  32. Iterative Refinement • Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA Realigning any single yi changes nothing

  33. Restricted MDP • Similar to bounded DP in pair-wise alignment • Construct progressive multiple alignment m • Run MDP, restricted to radius R from m z x y Running Time: O(2N RN-1 L)

  34. Restricted MDP x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Within radius 1 of the optimal  Restricted MDP will fix it.

  35. Other approaches • Statistical learning methods • Profile Hidden Markov Models • Will discuss in future lectures • Consistency-based methods • Still rely on pairwise alignment • But consider a third seq when aligning two seqs • If block A in seq x aligns to block B in seq y, and both aligns to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable • Essentially: change scoring system according to consistency • Than applied DP as in other approaches • Pioneered by a tool called T-Coffee

  36. Multiple alignment tools • Clustal W (Thompson, 1994) • Most popular • T-Coffee (Notredame, 2000) • Another popular tool • Consistency-based • Slower than clustalW, but generally more accurate for more distantly related sequences • MUSCLE (Edgar, 2004) • Iterative refinement • More efficient than most others • DIALIGN (Morgenstern, 1998, 1999, 2005) • “local” • Align-m (Walle, 2004) • “local” • PROBCONS (Do, 2004) • Probabilistic consistency-based • Best accuracy on benchmarks • ProDA (Phuong, 2006) • Allow repeated and shuffled regions

  37. In summary • Multiple alignment scoring functions • Sum of pairs • Other funcs exist, but less used • Multiple alignment algorithms: • MDP • Optimal • too slow • Branch & Bound doesn’t solve the problem entirely • Progressive alignment: clustalW • Iterative refinement • Restricted MDP • Consistency-based Heuristic

More Related