Multiple Sequence alignment

Multiple Sequence alignment Chitta Baral Arizona State University

Motivation and Introduction • Need for multiple sequence alignment • We have the sequences of several proteins which have similar function in a number of different species • We may want to know which part of these sequences are similar and which parts are different. • What is multiple alignment? • Let s1, …, sk be a set of sequences over the same alphabet. • Spaces are inserted in s1, …, sk to make them all of same size. • When the extended sequences are aligned, no column can be made exclusively of spaces. • An example • M Q P I L L L • M L R - L L - • M K - I L L L • M P PV L I L • First important issue: defining the quality of an alignment.

The `sum-of-pairs’ (SP) measure • Requirement of a good quality of alignment measure • Additive function • Function must be independent of order of arguments • Should reward presence of many equal or strongly related symbols (in the same column) and penalize unrelated symbols and spaces. • SP function: sum of pairwise scores of all pairs of symbols in the column • SP-score(I, -, I, V) = s(I,-) + s(I,I) + s(I,V) + s(-, I) + s(-,V) + s(I,V). • s(-,-) = 0. • Theorem: Let alpha be a multiple alignment of the set of sequences s1, …, sk; and alpha(I,j) denote the pairwise alignment of si and sj as induced by alpha. Then SP-score(alpha) = Sum over i,j [score(alpha(i,j)] • The above is only true if we have s(-,-) = 0. • This is because in pairwise alignment the presence of two aligned spaces (–) in the two sequences are ignored.

Optimal alignment using dynamic programming • Consider k sequences, each of length n • Use a k-dimensional array A[] of length n+1 in each dimension • Initialize A[0,…,0] = 0. • A[i1, …, ik] max b {A[i-b] + SP-score(Column(s,i,b))} • Where b ranges over all non-zero binary vectors of k elements, and • Column(s,i,b) = (cj) 1<= j <= k • With cj = sj[ij] if bj=1 and cj=- if bj = 0. • Boldface indicates k-tuples. • A[i1,i2,i3] max of • A[i1, i2, i3-1] + SP-score(-,-,s3[i3]) • A[i1, i2-1, i3] + SP-score(-,s2[i2],-) • A[i1, i2-1, i3-1] + SP-score(-,s2[i2],s3[i3]) • A[i1-1, i2, i3] + SP-score(s1[i1],-,-) • A[i1-1, i2, i3-1] + SP-score(s1[i1],-,s3[i3]) • A[i1-1, i2-1, i3] + SP-score(s1[i1],s2[i2],-) • A[i1-1, i2-1, i3-1] + SP-score(s1[i1],s2[i2],s3[i3])

Complexity analysis of the dynamic programming algorithm • Running time: • (n+1)k number of entries in the table • For each entry we need to find the maximum of 2k -1 elements • Finding the SP-score corresponding to each element means adding O(k2) numbers • Total = O(k22knk) i.e., exponential w.r.t. k.

A heuristic based approach • Outline of the approach • We have k sequences of length n each and we want to compute the optimal alignments according to the SP measure • We use dynamic programming, but try to avoid filling all entries of the k-dimensional array, and fill only the `relevant’ ones. • Which cells are relevant and why • Idea: look at pairwise projections of cells. • Note: Optimal alignments may not lead to pairwise projections that are optimal. • A T • A – • - T • A T • A T • is optimal, but A _ and _ T are not optimal.

Heuristics based approach … cont • Recall F(i,j) meant the score of the best alignment between the initial segment x1…i and y1…j. Lets denote it by sim(x[1..i],y[1..j]), and refer to it as axy[i,j]. • I.e., axy[i,j] = sim(x[1..i],y[1..j]). • Let bxy[i,j] = sim(x[i+1..n],y[j+1..m]). • Computed like axy but backwards. • And cxy[i,j] = axy[i,j] + bxy[i,j]. • Means the highest score of an alignment that cuts at (i,j) • Using the c matrix it is very easy to find the alignment. • Find a path from [n,m] to [0,0] that has the value cxy[n,m] all through. • Suppose we know a lower bound Lxy for cxy. I.e. we know for sure that sim(x,y) >= Lxy. • In that case, cxy[i,j] < Lxy means the cut through (i,j) does not lead to the best alignment.

Heuristic based approach.. cont

H. B. A (cont) – A theorem • Theorem: Let a be an optimal alignment involving s1, …, sk. If SP-score(a) >= L then score(aij) > = Lij , where Lij = L – Sx<y & (x,y) =\= (i,j) (sim(sx,sy)). • Proof: • SP-score(a) >= L iff Sx<y score(axy) > = L • iff Sx<y & (x,y) =\= (i,j) score(axy) > = L - score(aij) • Implies Sx<y & (x,y) =\= (i,j) (sim(sx,sy)) > = L - score(aij) ##because sim(sx,sy) is the best score and hence is greater than or equal to score(axy). • iff score(aij) > = L – Sx<y & (x,y) =\= (i,j) (sim(sx,sy)). • Implication of this theorem: • Suppose we have a lower bound L of SP-score, over all possible alignments. • Then a cell with index (i1, …, ik) is relevant if the score of the best alignment (say a) that cuts through (i1, …, ik) > = L • By the theorem, this implies for all x, y, 1 <= x <y <= k we have score(axy) > = Lxy • Which means cxy (ix,iy) > = Lxy • This is because the best alignment will cut through ix iy. • Idea of the algorithm: • Pick a lower bound L; Compute cxy and Lxy for each pair x, y 1 < = x < y < = k. • Start with (0,…,0) and expand its influence to dependent relevant cells and continue until the final corner cell is reached.

The heuristic based algorithm • Input: s = (s1, …, sk) and lower bound L • Output: The value of an optimal alignment • For all x, y, 1 <=x<y<=k Compute cxy • For all x,y, 1 <=x<y<=k Lxy L - S(x,y) =\= (p,q) (sim(sp,sq)). • pool {0} • While pool not empty do • i the lexicographically smallest cell in the pool • pool pool \ {i} • If cxy[ix,iy]>= Lxy, forall x,y, 1 <= x<y<=k then • For all j dependent on i do • If j not in pool then pool  pool U {j}; a[j] a[i] + SP-score(Column(s,i,j-i)) • else a[j] max( a[j], a[i] + SP-score(Column(s,i,j-i)) • Return a[n1, …, nk]

Star alignment • Let s1, …, sk be k sequences that we want to align • Pick one of the sequences sc as the center • For each index i =\= c find optimal alignment between si and sc • Aggregate these alignment using ``once a gap always a gap principle’’ • Start with one pair of alignment and keep adding alignment with respect to another string using sc as a guide by adding gaps when necessary • One way to select sc is to try all possibilities and pick the one that results in the best score. • Another way is to compute all optimal pairwise alignments and select as the center the string that maximizes Si =\= c sim(si,sc).

Tree alignment • Motivation: Sometimes we have an evolutionary tree for the sequences involved. • In that case we can compute the overall similarity based on pairwise alignment along tree edges. • Input: k sequences and a tree with leaves as these sequences. • Goal: Find a sequence asignment to the internal nodes of the tree so that the sum of the similarity between the sequences along each edges is maximized. • Tree alignment is NP-hard, but approximation algorithms exist. • Note: Star alignment can be viewed as a special case of tree alignment.

Multiple Sequence alignment

Multiple Sequence alignment

Presentation Transcript

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment