CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 8: Multiple Sequence Alignment

Multiple Sequence Alignment • Definition • Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the alignment is maximum • Two issues • How to score an alignment? • How to find a (nearly) optimal alignment?

Scoring function - first assumption • Columns are independent • Similar in pair-wise alignment • Therefore, the score of an alignment is the sum of all columns • Need to decide how to score a single column

Scoring function (cont’d) • Ideally: • An n-dimensional matrix, where n is the number of sequences • E.g. (A, C, C, G, -) for aligning 5 sequences • Total number of parameters: (k+1)n, where k is the alphabet size • Direct estimation of such scores is difficult • Too many parameters to estimate • Even more difficult if need to consider phylogenetic relationships x y Phylogenetic tree or evolution tree (more detail in future lectures) z ? w v

Scoring Function (cont’d) • Compromises: • Compute from pair-wise scores • Option 1: Based on sum of all pair-wise scores • Option 2: Based on scores with a consensus sequence • Other options • Consider tree topology explicitly • Information-theory based score • Difficult to optimize

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG - - - -

Sum Of Pairs (cont’d) • The sum-of-pairs (SP) score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG (A,A) + (A,G) x 2 = -1 (G,G) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

Sum Of Pairs (cont’d) • Drawback: no evolutionary characterization • Every sequence derived from all others • Heuristic way to incorporate evolution tree • Weighted Sum of Pairs: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance Human Mouse Duck Chicken

Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Consensus sequence: • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)

Multiple Sequence Alignments Algorithms • Can also be global or local • We only talk about global for now • A simple method • Do pairwise alignment between all pairs • Combine the pairwise alignments into a single multiple alignment • Is this going to work?

Compatible pairwise alignments AAAATTTT AAAATTTT---- ----TTTTGGGG AAAATTTT---- AAAA----GGGG AAAATTTT---- ----TTTTGGGG AAAA----GGGG TTTTGGGG AAAAGGGG ----TTTTGGGG AAAA----GGGG

Incompatible pairwise alignments AAAATTTT AAAATTTT---- ----TTTTGGGG ----AAAATTTT GGGGAAAA---- ? TTTTGGGG GGGGAAAA TTTTGGGG---- ----GGGGAAAA

Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: • Find the longest path in a high-dimensional cube • As opposed to a two-dimensional grid • Uses a N-dimensional matrix • As apposed to a two-dimensional array • Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))

Multidimensional Dynamic Programming (MDP) • Example: in 3D (three sequences): • 23 – 1 = 7 neighbors/cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(xi, xj, xk), F(i-1,j-1,k ) + S(xi, xj, -), F(i-1,j ,k-1) + S(xi, -, xk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, xj, -), F(i ,j ,k-1) + S(-, -, xk) (i,j,k-1) (i,j-1,k-1) (i,j-1,k) (i,j,k)

Multidimensional Dynamic Programming (MDP) Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

Faster MDP • Carrillo & Lipman, 1988 • Branch and bound • Other heuristics • Implemented in a tool called MSA • Practical for about 6 sequences of length about 200-300.

Faster MDP • Basic idea: bounds of the optimal score of a multiple alignment can be pre-computed • Upper-bound: sum of optimal pair-wise alignment scores, i.e. S(m) = k<l s(mk, ml)  k<l s(k, l) • lower-bounded: score computed by any approximate algorithm (such as the ones we’ll talk next) • For any partial path, if Scurrent + Sperspective < lower-bound, can give up that path • Guarantees optimality Optimal msa Score of optimal alignment between k and l Score of the alignment between k and l induced by m

Progressive Alignment • Multiple Alignment is NP-hard • Most used heuristic: Progressive Alignment Algorithm: • Align two of the sequences xi, xj • Fix that alignment • Align a third sequence xk to the alignment xi,xj • Repeat until all sequences are aligned Running Time: O(NL2) Each alignment takes O(L2) Repeat N times

Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) y z w

Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: • Find all dij: alignment dist (xi, xj) • High alignment score => short distance • Construct a tree (Neighbor-joining hierarchical clustering. Will discuss in future) • Align nodes in order of decreasing similarity + a large number of heuristics

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD Distance matrix

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD s1 s3 s2 s4

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK s1 s3 s2 s4

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -TNSD NT-SD s1 s3 s2 s4

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -ALSK -TNSD NA-SK NT-SD -TNSD NT-SD s1 s3 s2 s4

Iterative Refinement Problems with progressive alignment: • Depend on pair-wise alignments • If sequences are very distantly related, much higher likelihood of errors • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear: correct y should be GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Progressive alignment

allow y to vary x,z fixed projection Iterative Refinement (cont’d) For each sequence y • Remove y • Realign y (while rest fixed) z x y Note: Guaranteed to converge (why?) Running time: O(kNL2), k: number of iterations

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

Iterative Refinement • Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA Realigning any single yi changes nothing

Restricted MDP • Similar to bounded DP in pair-wise alignment • Construct progressive multiple alignment m • Run MDP, restricted to radius R from m z x y Running Time: O(2N RN-1 L)

Restricted MDP x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Within radius 1 of the optimal  Restricted MDP will fix it.

Other approaches • Statistical learning methods • Profile Hidden Markov Models • Will discuss in future lectures • Consistency-based methods • Still rely on pairwise alignment • But consider a third seq when aligning two seqs • If block A in seq x aligns to block B in seq y, and both aligns to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable • Essentially: change scoring system according to consistency • Than applied DP as in other approaches • Pioneered by a tool called T-Coffee

Multiple alignment tools • Clustal W (Thompson, 1994) • Most popular • T-Coffee (Notredame, 2000) • Another popular tool • Consistency-based • Slower than clustalW, but generally more accurate for more distantly related sequences • MUSCLE (Edgar, 2004) • Iterative refinement • More efficient than most others • DIALIGN (Morgenstern, 1998, 1999, 2005) • “local” • Align-m (Walle, 2004) • “local” • PROBCONS (Do, 2004) • Probabilistic consistency-based • Best accuracy on benchmarks • ProDA (Phuong, 2006) • Allow repeated and shuffled regions

In summary • Multiple alignment scoring functions • Sum of pairs • Other funcs exist, but less used • Multiple alignment algorithms: • MDP • Optimal • too slow • Branch & Bound doesn’t solve the problem entirely • Progressive alignment: clustalW • Iterative refinement • Restricted MDP • Consistency-based Heuristic

CS 5263 Bioinformatics