Multiple Global Alignment and Phylogenetic tree

Multiple Global Alignment and Phylogenetic tree

Outline • Multiple sequence alignment—MSA • Motivation • The sum of pairs method (SP) • Phylogenetic tree • Clustering • Neighbour joining • Clustalw

What is a Multiple Sequence Alignment • MSA is the alignment of more than two sequences VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG— * * An example of MSA alignment

Dynamic Programming in 3D QUESTION: Which alignment would be generated For DQLF, DNVQ, QGL?

Dynamic Programming in 3D D--Q-LF DNVQ--- ---QGL-

si1 - si1si1 --si1 sj2 sj2 -sj2 - sj2 - sk3 sk3 sk3 - sk3 - - How many cases do we need to consider? • In standard dynamic programming we considered 3 cases, namely match/mismatch, insert, and delete • For three sequences s1, s2, s3 there are 7 possibilities: • For m sequences there are 2m -1 possibilities QUESTION: Why is it “2”?

Complexity • For m sequences each of length n the matrix has nm cells and for each we must check 2m -1 possibilities: That’s prohibitive! • Solution: Use pruning techniques (cut-offs) and heuristics to guide the search for the best solution

A little excursion to Romania:A* Search Further reading Russel/Norvig, Artificial Intelligence, Chapter 4. Prentice-Hall

Problem: Find the shortest path from Arad to Bucharest Neamt Oradea 71 Zerind 87 Iasi 75 151 92 Arad Optimal route is (140+80+97+101) = 418 miles 140 140 Vaslui 118 99 Sibiu Sibiu Faragas Timisoara 142 80 80 111 211 Lugoj Rimnicu Rimnicu 98 Urziceni Hirsova 70 86 97 97 Pitesti Pitesti Mehadia 146 101 101 Bucharest 86 75 138 Dobreta 90 120 Craiova Eforie Giurgui

Straight Line Distances to Bucharest

Greedy search Go to neighboring city v, which minimizes distance Fv to goal Oradea Zerind Arad Sibiu Faragas Timisoara Lugoj Rimnicu Urziceni Hirsova Pitesti Mehadia Bucharest Dobreta Craiova Eforie Giurgui

Greedy search Go to neighboring city v, which minimizes distance Fv to goal Oradea Zerind QUESTION: Any problems? Why is it called “greedy” search? Arad Sibiu Faragas Timisoara Lugoj Rimnicu Urziceni Hirsova Pitesti Mehadia Bucharest Dobreta Craiova Eforie Giurgui

Problems of greedy search • Not optimal • Greedy search from Arad to Bucharestvia Fagaras, optimum via Rimnicu • Problem: Greedy algorithm does not include distance already covered • A*: • Pursue best node first with scoring function of distance so far plus under estimate to goal (e.g. shortest line distance) • v is a node • SvBest score to go from start to node v • FvEstimate for going from v to goal • Tv = Sv + Fv Total score • Organize nodes to be visited sorted by total score(TODO list in next slides)

A* search of the Romanian map featured in the previous slide. Note: Nodes are labelled with Tv = Sv+ Fv. However,we will be using the abbreviations T, S and F to make the notation simpler Zerind Oradea T= 0 + 366 T= 366 Arad Fagaras Sibiu Bucharest(2) Rimnicu Timisoara Bucharest Bucharest Bucharest Pitesti Craiova

We begin with the initial state of Arad. The cost of reaching Arad from Arad (or S value) is 0 miles. The straight line distance from Arad to Bucharest (or F value) is 366 miles. This gives us a total value of ( T = S + F )366 miles. Expand the initial state of Arad. DONE = [] TODO = [Arad/366] Zerind Oradea T= 0 + 366 T= 366 Arad Fagaras Sibiu Bucharest(2) Rimnicu Timisoara Bucharest Bucharest Bucharest Pitesti Craiova

Once Arad is expanded we look for the node with the lowest cost. Sibiu has the lowest value for T. (The cost to reach Sibiu from Arad is 140 miles, and the straight line distance from Sibiu to the goal state is 253 miles. This gives a total of 393 miles). DONE = [Arad] TODO = [Sibiu/393, Timisoara/447, Zerind/449] Zerind Oradea T= 75 + 374 T= 449 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 140 + 253 T= 393 Bucharest(2) T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest Pitesti Craiova

We now expand Sibiu (that is, we expand the node with the lowest value of T). DONE = [Arad, Sibiu] TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest Pitesti Craiova

We now expand Rimnicu (that is, we expand the node with the lowest value of T ). DONE = [Arad, Sibiu] TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest Pitesti Craiova

Once Rimnicu is expanded we look for the node with the lowest cost. As you can see, Pitesti has the lowest value for T. (The cost to reach Pitesti from Arad is 317 miles, and the straight line distance from Pitesti to the goal state is 98 miles. This gives a total of 415 miles DONE = [Arad, Sibiu, Rimnicu] TODO = [Pitesti/415, Fagaras/417, Timisoara/447, Zerind/449, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

We now expand Pitesti (that is, we expand the node with the lowest value of T). DONE = [Arad, Sibiu, Rimnicu, Pitesti] TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

In actual fact, the algorithm will not really recognise that we have found Bucharest. It just keeps expanding the lowest cost nodes (based on T ) until it finds a goal state AND it has the lowest value of T. So, we must now move to Fagaras and expand it. DONE = [Arad, Sibiu, Rimnicu, Pitesti] TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

We have just expanded a node (Pitesti) that revealed Bucharest, but it has a cost of 418. If there is any other lower cost node (and in this case there is one cheaper node, Fagaras, with a cost of 417) then we need to expand it in case it leads to a better solution to Bucharest than the 418 solution we have already found. DONE = [Arad, Sibiu, Rimnicu, Pitesti] TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

We now expand Fagaras (that is, we expand the node with the lowest value of T ). DONE = [Arad, Sibiu, Rimnicu, Pitesti] TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 T= 450 + 0 T= 450 Rimnicu Timisoara Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

Once Fagaras is expanded we look for the lowest cost node. As you can see, we now have two Bucharest nodes. One of these nodes ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ) has an T value of 418. The other node (Arad – Sibiu – Fagaras – Bucharest(2) ) has an T value of 450. We therefore move to the first Bucharest node and expand it. DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras] TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 T= 450 + 0 T= 450 Rimnicu Timisoara Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

We have now arrived at Bucharest. As this is the lowest cost node AND the goal state we can terminate the search. If you look back over the slides you will see that the solution returned by the A* search pattern ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ), is in fact the optimal solution. DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras] TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 T= 450 + 0 T= 450 Rimnicu Timisoara Bucharest Bucharest Bucharest Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

Additional optimization • Let‘s assume we have an (over)-estimate K for the best solution, i.e. the optimal solution will be better than K • Do not consider any node with total score Tv worse than K • If Tv > K then remove v from TODO list

Additional optimization Zerind • Assume K = 430, then we can remove nodes Zerind, Oradea, Timisoara, Craiova Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 T= 450 + 0 T= 450 Rimnicu Timisoara Bucharest Bucharest Bucharest Bucharest T= 418 + 0 T= 418 Pitesti QUESTION: What if K is equal to optimum? What if K is poorely chosen? What if rule is “If Tv >= K then remove v“? Problem? T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

F must be under-estimate • For algorithm to work F must be an under-estimate • Example: Direct distance is always shorter than road QUESTION: What happens if F is not under-estimate?

F must be under-estimate • For algorithm to work F must be an under-estimate • Example: Direct distance is always shorter than road • Then it cannot be guaranteed that optimal solution is found • E.g. FRiminicu = 10.000 in example for Riminicu? • Then TRiminicu = 10.220 > K = 450, so Riminicu would be removed, and optimal solution would not be found QUESTION: What happens if F is not under-estimate?

From Romania to Dresden • So, what does that mean for multiple sequence alignment? QUESTIONS: What does a node (city) correspond to? What does an edge between nodes correspond to? What does the cost between two nodes correspond to? How could we define S? How could we define F? How could we define K?

The Sum of Pairs Method • As in the pairwise case, not all MSA’s are equally good. We need a scoring method to determine when one MSA is better than another one • The Sum of Pairs (SP) method: • For each column in the alignment, sum up the score of each pair of residues. • M: a MSA of the sequences of (s1, s2, ...sm) • s’iis the projection of si, i.e. the sequence siwith gaps • S(s’i,s’j): the score of the projections • The final score is

An Example of Using the SP Method • Example s1 = AVP s’1: A-VP- s2 = AVT s’2: A-V-T s3 = PSVPT s’3: PSVPT • Scores: • Match = 1 • Mismatch, insertion, deletion = -1 • S(-, -) = 0 to prevent the double counting of gaps. QUESTION: What is the score of the alignment?

An Example of Using the SP Method • Example s1 = AVP s’1: A-VP- s2 = AVT s’2: A-V-T s3 = PSVPT s’3: PSVPT • Scores: • Match = 1 • Mismatch, insertion, deletion = -1 • S(-, -) = 0 to prevent the double counting of gaps. • Then the SP score is S(s’1,s’2) + S(s’1,s’3) + S(s’2, s’3) = 0 + (-1) + (-1) = -2

1 MSA vs. n SA • What is the difference between making one multiple sequence alignment to making many pairwise sequence comparisons? • The score S(s’i,s’j) for the alignment s’i,s’j in a multiple sequence alignment is less than score S(si,sj) for aligning si,sj directly • S(s’i,s’j) <= S(si,sj)

Computing all cells in the dynamic programming solution is expensive, therefore we want to avoid computing as many cells as possible Can we rule out any cells? Let us assume that we know already that there is a known alignment of score K Let v = (i1,i2,….im) be a cell of the DP matrix for which want to determine whether we need to consider it (and its neighbours) or not Let Svbe the score of the best path from the start cell to cell v Let FVbe an upper bound for the highest-scoring alignment from v to the end of DP matrix, i.e. we can only find a path from v to the end which is less than FV Then we know the following: If Sv+ Fv< K, then v cannot lie on the path of the best alignment SV v <=Fv Pruning the search space

Dynamic Pruning with Forward Recursion • D(v,w) is the score to be added when moving from v to its forward (east, southeast, south) neighbor w. • I.e. the overall score Sv+D(v,w) is sent to w. • The value of Sw is the maximum of all values sent to w from its backward (west, north, northwest) neighbor cells. sj2 SV - g v SV - g SV + R(si1,sj2) si1 From cell v values are sent to all its neighbor cells

One more thing: A queue • We need a data structure before we list the algorithm • A queue is a list of elements with two special operators • Push: to add an element at the end of the queue • Pop: to remove an element from the top of a queue

proc F(v, hN) a procedure which finds an upper bound of the score of the alignment from a cell v to the end-cell hN begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from its neighbours IfS(v) + F(v, hN) >= K then for all forward neighbours w of vdo ifw doesn’t belong to Q then push (w,Q); P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end for end while end const h0 the start cell of the DP matrix (H0,0…0) hN the end cell of the DP matrix (Hn1,n2…nm) K a lower bound for the score of the whole alignment var u, v,w denote cells S(u) the best score of an alignment from h0 to u P(u) the score of the best alignment from h0 to u found so far D(u, v) the score for extending the alignment from cell u to cell v Q a queue of the cells u for which a value for P(u) is found but u is not visited yet Algorithm: Forward-recursion with pruning

Finding upper limits for scores • For any multiple sequence alignment M of sequences {s1,s2,….sm} we know that • the score for the multiple sequence alignment S(M) is less then the sum of pairwise comparisons of the sequences {s1,s2,….sm} • The procedure F should find an upper bound for the alignment of the subsequences s1i1+1…n1 , s2i2+1…n2 , ….. smim+1…nm • This can be done as follows: (4.6)

Questions QUESTION: What is the score of the multiple sequence alignmentwhen the algorithm is done? QUESTION: How can we get alignment from algorithm?

Answers • The score for the multiple sequence alignment is S(hN) • How can we get an alignment from the algorithm? • We need another variable Dir to store the direction from which we were coming • Let‘s assume we are at node v and its neighbour w is not pruned • If w is new in queue then Dir(w)={v} • If w is already in queue and S(v)+D(v,w)>P(w) then • P(w) = S(v)+D(v,w) and Dir(w) = {v} • If w is already in queue and S(v)+D(v,w)=P(w) then • Add v to Dir(w)

proc F(v, hN) a procedure which finds an upper bound of the score of the alignment from a cell v to the end-cell hN begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from its neighbours IfS(v) + F(v, hN) >= K then for all forward neighbours w of vdo ifw doesn’t belong to Q then push (w,Q); P(w) = S(v) + D(v,w) Dir(w) = {v} else if S(v)+D(v,w) > P(w) then P(w) = S(v)+D(v,w) Dir(w) = {v} else if S(v)+D(v,w) = P(w) then Add v to Dir(w) end for end while end const h0 the start cell of the DP matrix (H0,0…0) hN the end cell of the DP matrix (Hn1,n2…nm) K a lower bound for the score of the whole alignment var u, v,w denote cells S(u) the best score of an alignment from h0 to u P(u) the score of the best alignment from h0 to u found so far D(u, v) the score for extending the alignment from cell u to cell v Q a queue of the cells u for which a value for P(u) is found but u is not visited yet Dir(w) stores nodes v from which best scores were obtained Algorithm: Forward-recursion with pruning

Printing the alignment: printMSA(hN,0) • printMSA is recursive function, which takes a node v and a position k in the alignment to be generated as input • B is a matrix, which contains the aligment • printMSA(v,k): • If v = h0then print B • Else • Let i1,…,im be the indices of v • For all u in Dir(v) do • Let i‘1,…,i‘m be the indices of w • For j from 0 to m-1 do • If ij = i‘j then Bk,j = „-“ • Else Bk,j = sequence j at position ij • printMSA(u,k+1)

Questions QUESTION: Why is Dir a set and not a single node? QUESTION: Does printMSA print one multiple sequence alignmentor all possible ones?

Example Let’s align DQLF, DNVQ, QGL with match = 3 and insertion, deletion, mismatch = -1 <0,0,0> <3,3,2>

Example • We need a lower bound for the overall result.Let’s assume we have got already the following alignment • What is K, the sum of pairs for this alignment? DQ-LF DNVQ- -QGL-

Example • We need a lower bound for the overall result.Let’s assume we have got already the following alignment • K = -1 -4 + 3 = -2 DQ-LF DNVQ- -QGL-

Example • Upper bound for the score from <0,0,0> to <3,3,2> (match = 3 and insertion, deletion, mismatch = -1) • F( <0,0,0>, <3,3,2> ) = +2 +3 -2 = +3 D--QLF DQ-LF DNVQ-- DNVQ-- -QGL- ---QGL +2 +3 -2

begin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) IfS(v) + F(v, hN) >= K then for all forward neighbours w of vdo ifw doesn’t belong to Q then push (w,Q); P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end end end Q: <0,0,0> P( <0,0,0> ) = 0 S( <0,0,0> ) = 0 Example

begin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) IfS(v) + F(v, hN) >= K then for all forward neighbours w of vdo ifw doesn’t belong to Q then push (w,Q); P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end end end S( <0,0,0> ) + F( <0,0,0>, <3,3,2>) = 0+3 >= -2 Q: <0,0,1>, <0,1,0>, <0,1,1>, … , <1,1,1> P( <0,0,1> ) = 0 + -2 --Q P( <0,1,0> ) = 0 + -2 -D- P( <0,1,1> ) = 0 + -3 -DQ P( <1,0,0> ) = 0 + -2 D-- P( <1,0,1> ) = 0 + -3 D-Q P( <1,1,0> ) = 0 + 1 DD- P( <1,1,1> ) = 0 + 1 DDQ Example

Multiple Global Alignment and Phylogenetic tree