Lab 3

Lab 3 Rhys Price JonesAnne Haake Rochester Institute of Technology

What is an alignment? In a previous lab, we studied exact pattern matching – finding an exact occurrence of a target sequence (say AGTAA) within a search sequence (say CGAAGTAACG) A solution identifies an occurrence of the target as in CGAAGTAACG The new problem seeks to align two sequences. As with pattern matching, we look for common occurrences of the same symbol. But we also allow for an occasional mismatch, an insertion, or a deletion. Here are two possible alignments between the sequences CGAAT and CCGATA -CGAATC-GA-AT |||XX| || | CCGATACCGATA- One has three matches, two mismatches and a single indel. “indel” is shorthand for insertion/deletion. This instance has an insertion in the lower sequence or a deletion in the upper sequence. The second alignment has four matches, no mismatches, and three indels. Which alignment is better?

Rating alignments -CGAATC-GA-AT |||XX| || | CCGATACCGATA- Which alignment is better? This is a BIG question. To answer it, we need to specify a way to assign a score to an alignment. The alignment with the higher score will be considered the better. Suppose we assign a score of +2 for a match, -2 for a mismatch and –1 for an indel. Then the first of our alignments scores +6 for three matches, -4 for two mismatches and –1 for one indel. For a total of 1 point. Before you move on to the next slide, see if you can figure the score of the second of our two alignments using this scoring scheme.

Rating alignments (2) -CGAATC-GA-AT |||XX| || | CCGATACCGATA- Which alignment is better? This is a BIG question. To answer it, we need to specify a way to assign a score to an alignment. The alignment with the higher score will be considered the better. Suppose we assign a score of +2 for a match, -2 for a mismatch and –1 for an indel. Then the first of our alignments scores +6 for three matches, -4 for two mismatches and –1 for one indel. For a total of 1 point. The second of our two alignments scores +8 for four matches and –3 for three indels for a total of 5 points. So the second alignment is better than the first. Is it really?

Rating alignments (3) -CGAATC-GA-AT |||XX| || | CCGATACCGATA- Assigning a score of +2 for a match, -2 for a mismatch and –1 for an indel, we saw that our first alignment scores 1 and the second scores 5, suggesting the second is better. But suppose our scoring system gives +4 for a match, -1 for a mismatch, and –4 for an indel. Can you figure out the scores for the two alignments before you move on to the next slide?

Rating alignments (4) -CGAATC-GA-AT |||XX| || | CCGATACCGATA- Now the first scores 12 – 2 – 4 = 6, and the second scores 16 - 12 = 4. This makes the first alignment better. So one rating scheme rated the second alignment better than the first. But another rating scheme says the first alignment beats the second. How do we decide on a useful rating scheme?

Rating the rating schemes Let’s think about why we want to align two sequences. Typically it’s because we have sequences from different organisms and we want to measure the degree of homology between the two. Biological conditions must prevail. We want the rating scheme for alignments to reflect the biological likelihoods of occurrence of single point mutations (mismatches) and other mutations leading to various kinds of indel sequence. We will return to the biological considerations later in the lab. For now, let us adopt a scheme that gives 2 points for a match, 0 points for a mismatch, and –1 points for an indel. Before moving on to the next slide, cast your mind back to what you remember about exact pattern matching algorithms. Do you think they have much relevance now that we are turning our attention to alignments?

Exact Pattern Matching Recall that a program to search for occurrences of AGGT within any other sequence might start by constructing a “virtual machine”. C G C T A G G T C G T C T A A A Run through the next few slides quickly to see how this virtual machine would handle search sequence AGCAGGTA

Demo 1 C G C T A G G T C G T C T A A A A GCAGGTA

Demo 2 C G C T A G G T C G T C T A A A AG CAGGTA

Demo 3 C G C T A G G T C G T C T A A A AGC AGGTA

Demo 4 C G C T A G G T C G T C T A A A AGCA GGTA

Demo 5 C G C T A G G T C G T C T A A A AGCAG GTA

Demo 6 C G C T A G G T C G T C T A A A AGCAGG TA

Demo 7 C G C T A G G T C G T C T A A A AGCAGGT A

Demo 8 C G C T A G G T C G T C T A A A AGCAGGTA Nowhere to go. Stop and report “found”.

Picture->Program The virtual machine easily translates into a program. Each state is a method. C G C T A G G T q4 q0 q1 q2 q3 void q4(){ this.accept(); } C G T C T A A A void q0(){ switch (read()) { A: this.q1(); C,G,T: this.q0(); } } void q1(){ switch (read()) { A: this.q1(); C,T: this.q0(); G: this.q2(); } } void q3(){ switch (read()) { A: this.q1(); C,G: this.q0(); T: this.q4(); } } void q2(){ switch (read()) { A: this.q1(); C,T: this.q0(); G: this.q3(); } }

VM for Alignment?? Can we use a similar technique for building a virtual machine to investigate alignment? C G -6 C T -4 A2 G2 G2 T2 q4 q0 q1 q2 q3 C G T 0 C T -2 A 0 A -2 A -4 Step 1: We have added a score to each transition edge, so that if a perfect match is found, it will score +8

VM for Alignment 2 Step 2 A C G -1 C G -6 C T -4 A2 G2 G2 T2 q4 q0 q1 q2 q3 C G T 0 C T -2 A 0 A -2 A -4 C G T -1 A C T -1 A C T -1 Step 2: Allow for mismatches – with appropriate scores – Let’s look at this resulting machine for a moment:

VM for Alignment 3 Do you notice something that makes this machine difficult to translate into a program? A C G -1 C G -6 C T -4 A2 G2 G2 T2 q4 q0 q1 q2 q3 A C T -1 C G T 0 C T -2 A 0 A -2 A -4 A C T -1 C G T -1 What makes it hard to write a program equivalent to this machine? Think it through before you move on to the next slide

VM for Alignment 4 The machine is NONDETERMINISTIC. Given a current state and a current symbol, it is no longer the case that there is only one possible destination. For example, the code for the method q0 would have to be something like: A C G -1 C G -6 C T -4 A2 G2 G2 T2 q4 q0 q1 q2 q3 A C T -1 void q0() { switch (read()) { A: {score 2; q1();} C,G,T: EITHER {score 0; q0();} OR {score –1; q1();} } } C G T 0 C T -2 A 0 A -2 A -4 A C T -1 C G T -1 Is it possible to implement such programs?

VM for Alignment 5 It is possible to implement such nondeterministic machines as programs. The key is to use backtracking. A simple way to implement backtracking is to write your program so that when it encounters a possible fork, it evaluates all possible branches and selects the best In our present case, the “best” means the branch that leads to the highest scoring alignment. We are now ready for your first exercise!

Exercise 1 a. Implement the derived virtual machine as a backtracking program. Remember the alignment scheme that gives 2 points for a match, 0 points for a mismatch, and –1 points for an indel. With input AGGT you should get the value 8 With input CCTAGGT you should get the value 8 With input CCTAGTT you should get the value 5 With input AGTAGTT you should get the value 5 With input AGTAGGT you should get the value 8 With input AGTAGCAGTAA you should get the value 2 b. Add the arcs and scores necessary so that the virtual machine correctly scores indels. You will find it is “even more nondeterministic”. Implement the resulting highly nondeterministic virtual machine as a program. Demonstrate its correctness. c. If you were attempting to use this technique to align a sequence of n nucleotides with a target sequence, show that the running time of the derived program is exponential in n.

Exercise 2 Do any of the techniques you know for exact pattern matching adapt easily for use in finding sequence alignments? Write a paragraph detailing how you’ve thought about adapting each of three pattern-matching algorithms. Include a discussion of the expected running time (for two sequences of length n) of the adapted algorithm. Describe why you think it unlikely that a O(n) pattern matching algorithm could be adapted to give an alignment algorithm with similar running time. (Alternatively, you could achieve fame and fortune by answering this question with just such an adapted algorithm!)

A Better Algorithm We are going to develop a recurrence relation for the score of the best alignment between two sequences a1,a2,..,an and b1,b2,..,bm using the principle of Mathematical Induction. This is a common, useful and reliable way to develop algorithms. We will write A(k,l) to denote the best alignment score between the prefix sequences a1,a2,..,ak and b1,b2,..,bl . So, assume that we know A(k,l) for every pair k,l that precedes pair i,j. Note that k,l precedes i,j if either (k<i and lj) or (l<j and ki). We will proceed to derive an expression for A(i,j). What does the best alignment for a1,a2,..,ai and b1,b2,..,bj look like? There are three possible scenarios: Alignment for a1,a2,..,ai-1 and b1,b2,..,bj-1 followed by ai matched or mismatched with bj Alignment for a1,a2,..,ai and b1,b2,..,bj-1 followed by a gap with bj Alignment for a1,a2,..,ai-1 and b1,b2,..,bj followed by a gap with ai In the first case, we would get a value of A(i-1,j-1)+match/mismatch depending on whether or not ai and bj match. In the second case, we would get A(i,j-1) + gap score. In the third case, we would get A(I-1,j) + gap score. Since we want the highest possible score, this leads to the recurrence A(i,j) = Max{A(i-1,j-1)+s(ai, bj), A(i,j-1)+g, A(i-1,j)+g} where s(ai, bj) is the matching score between the two symbols and g is the gap score.

gives a better program The recurrence A(i,j) = Max{A(i-1,j-1)+s(ai, bj), A(i,j-1)+g, A(i-1,j)+g} suggests the program portion: (define A (lambda (as bs) ..... ; termination conditions needed here (max (+ (A (cdr as) (cdr bs)) (s (car as) (car bs))) (+ (A as (cdr bs)) g) (+ (A (cdr as) bs) g)))) but, of course, we need base cases. Because of the double induction, we need two sets of base cases. Base cases in inductive proofs correspond to termination conditions in related recursive programs. What are the base cases? You’ll need to know this in order to complete Exercise 3. Think about it now. Hint: You need to know the best alignment between an empty sequence and a potentially nonempty sequence. How many matches/mismatches can there be? OK, so the rest are all indels (gaps).

Exercise 3 Using (define A (lambda (as bs) (cond ((null? as) ???) ((null? bs) ???) (else (max (+ (A (cdr as) (cdr bs)) (s (car as) (car bs))) (+ (A as (cdr bs)) g) (+ (A (cdr as) bs) g)))) as your guide, write a program for figuring the best alignment score between two sequences. You need to figure out the base cases to resolve the ???s You also need to write the match-value function s and define a value for g. Remember that we adopted the scheme that gives 2 points for a match, 0 points for a mismatch, and –1 points for an indel. This means that g is –1 and s is a function of two arguments; if they are the same, s returns 2, otherwise s returns 0. Test your program. BUT DO NOT test on any sequences of length exceeding 8!!!

Discussion of Inefficiency There is undoubtedly a striking elegance about programs derived, as was our alignment program above, from an inductive proof of its essential recurrence. Unfortunately, such elegantly derived programs are often inefficient. We’ll demonstrate this by looking at a simpler but equally elegant and similarly-derived program to compute the n’th Fibonacci number. The Fibonacci sequence, as you know, begins 1 1 2 3 5 8 13 21 ... Each term is the sum of its two immediate predecessors. And the base cases are that the 0’th and 1’th Fibonacci numbers are both 1. This paragraph gives an inductive definition of the Fibonacci numbers. The inductive definition immediately suggests a recursive program: (define fib (lambda (n) (if (<= n 1) 1 (+ (fib (- n 2)) (fib (- n 1)))))) Take a moment to study this, and note how it is an almost exact translation of the inductive definition into a programming language. So close is the translation that a proof of the correctness of our program would be superfluous.

Discussion of Inefficiency (2) Our program (define fib (lambda (n) (if (<= n 1) 1 (+ (fib (- n 2)) (fib (- n 1)))))) was derived from an inductive definition of the Fibonacci numbers. As such, it is elegant and clearly correct. Unfortunately, the table below indicates how the running time for (fib n) starts to grow on my computer once n gets to be up to 25: n 25 26 27 28 29 30 31 ms to compute (fib n) 110 170 280 390 710 1100 1810 Can you make a reasonable guess as to the growth rate of the running time for (fib n)? How long, at this rate, would it take to compute (fib 100)? Take a moment with pencil and paper to figure this out before you move on to the next slide where you will see my solution.

Discussion of Inefficiency (3) The ratios of successive times is the key to figuring out the relationship. The ratio of time for (fib 26) and (fib 25) is (/ 170 110) or about 1.5.The ratio for times of (fib 27) and fib 26) is (/ 280 170) or about 1.6.Succeeding ratios are about 1.4, 1.8, 1.5 and 1.6. This looks very much like exponential growth to a base close to 1.6 or so. A formula that will approximate tn – the running time for (fib n) is: tn = 110 * 1.6(t-25) If this formula is correct (and it is actually somewhat of an underestimate), then the calculation of (fib 100) with this program will take 110 * 1.675 milliseconds. This is about 224 073 957 396 794 460 milliseconds, which amount to about 7 100 475 years. Yes, more than seven million years! Just to calculate (fib 100)! I could do it quicker with paper and pencil!

Discussion of Inefficiency (4) To figure out why the fib program takes so long, consider the following “talking” version of fib: (define talkfib (lambda (n) (printf "I am computing (fib ~s)~%" n) (if (<= n 1) 1 (+ (talkfib (- n 2)) (talkfib (- n 1)))))) Let’s run it to calculate (fib 5): > (talkfib 5) I am computing (fib 5) I am computing (fib 3) I am computing (fib 1) I am computing (fib 2) I am computing (fib 0) I am computing (fib 1) I am computing (fib 4) I am computing (fib 2) I am computing (fib 0) I am computing (fib 1) I am computing (fib 3) I am computing (fib 1) I am computing (fib 2) I am computing (fib 0) I am computing (fib 1) 8 See how often (fib 0) is computed?

Discussion of Inefficiency (5) Let’s look at the tree of recursive calls: 5 4 3 3 2 2 1 2 1 1 0 1 0 See how many calls there are to (fib 1)? Imagine the tree for a call to (fib 100)! 1 0

Discussion of Inefficiency (6) That’s the root of the problem. The Fibonacci program we wrote makes exponentially large numbers of identical recursive calls. And it’s exactly the same with our alignment program from a few slides back. It makes many-times-repeated calls with exactly the same values. A solution is:

Dynamic Programming We can change the order in which the program does its computations. Instead of working “top down” from large numbers to smaller, we make it fill in an array of computed values from the “bottom up”. Here’s the dynamicized version of fib: (define fibd (lambda (n) (let ((array (make-vector (1+ n)))) (letrec ((loop (lambda (k) (cond ((<= k 1) (vector-set! array k 1) (loop (1+ k))) ((> k n) 'done) (else (vector-set! array k (+ (vector-ref array (- k 2)) (vector-ref array (- k 1)))) (loop (1+ k))))))) (loop 0) (vector-ref array n))))) OK: Study it. See how we use an array? See how we reverse the order of computation? Are you sure the program is correct?

Dynamic Programming Are you sure the program is correct? Absolutely not!! We have made changes to the program that detract from its elegance. It is no longer clear that the program fibd is equivalent to the inductive definition of the Fibonacci sequence. We’ve lost a lot. But we’ve gained in efficiency. Instead of taking 7 million years to calculate (fib 100), (fibd 100) takes 0 milliseconds (I.e. an unmeasurably small time) on my computer. In fact, (fibd 10000) takes much less than a second, and most of that time is preparing the output. Exercise 4a(do this or 4b): Use this technique to come up with a dynamic programming version of the alignment algorithm you developed for Exercise 3. You will need a two-dimensional array and you will need to loop on both indexes. But you can do it! Soon, we will discuss a way in which you can cause the programming language you are using to automatically create a version of your recursive programming that is almost as efficient as the version you can produce using dynamic programming. Exercise 4b will ask you to do that. You need only complete one of Exercise 4a and Exercise 4b.

Memoize We noted that the process of converting an inefficient recursion using the techniques of dynamic programming gives us an incredible speedup. But that is at the cost of an elegant simplicity of derivation of a clearly-correct program. It is no longer a trivial matter to prove our programs correct. Wouldn’t it be nice if we could automate the process of converting an elegant but inefficient program into an equally elegant but efficient program. One way to do this is to automatically convert functions so that, before embarking on a long computation, they consult a hash table to see if they have already made that computation. If so, they just return the value found in the hash table. Otherwise, they do the computation and store the result in the hash table before returning it. If you look in the file memolambda.ss given to you in lectures, you will see how I managed this by a syntax extension. All I do is (define fibm (memolambda (n) (if (<= n 1) 1 (+ (fibm (- n 2)) (fibm (- n 1)))))) and (fibm 10000) is only about five times slower than (fibd 10000). The explanation for the five times factor is that there is some overhead in hashing arguments in the very general way that our syntax extension manages it. How would you do memoization in other programming languages, such as Java?

Exercise 4b (do this or 4a) Rewrite the recursive algorithm we developed above in Exercise 3, but this time you may use memoization. Remember, you need only do one of Exercises 4a and 4b. Once you have done either Exercise 4a or 4b, you will have programmed the Needleman-Wunsch algorithm for local alignment. Be sure that the scoring system is easily changeable.

Global and Local Alignments So far in this lab, we have developed programs for performing Global Alignment between two sequences. This means that we have looked for the best alignment between the whole of the first sequence X and the whole of the second sequence Y. Perhaps it is more “biologically useful” to find a subsequence of X and a subsequence of Y such that the global alignment score between the two subsequences is as large as possible. We call that a local alignment between X and Y. It takes very little to modify our alignment program to be give the best local alignment between two sequences. As long as your alignment scoring method is such that two empty strings align with score 0 and mismatches and indels both contribute negative scores and matches contribute positive scores, all you need do is ensure that you never insert a negative number into your matrix. After generating the matrix, search it for the largest score. From that, you can deduce the best local alignment within your sequences. Can you see why that works? Try to come up with a single sentence explaining why? If you cannot, then please consult with an instructor.

For example with match 2, mismatch –1, indel –2, a global alignment of ACTACT and GTAC is - A C T A C T - 0 -2 -4 -6 -8 -10 -12 G -2 -1 -3 -5 -7 -9 -11 T -4 -3 -2 -1 -3 -5 -7 A -6 -2 -4 -3 1 -1 -3 C -8 -4 0 -2 -1 3 1 indicating the alignment ACTACT -X|||- -GTAC- with global score 1 but with the same scoring, an optimal local alignment is given by: - A C T A C T - 0 0 0 0 0 0 0 G 0 0 0 0 0 0 0 T 0 0 0 2 0 0 2 A 0 2 0 0 4 2 0 C 0 0 4 2 2 6 4 indicating the alignment acTACt ||| gTAC (lower case letters indicate nucleotides not involved in the alignment)

Exercise 5 Rewrite your program from Exercise 4a or 4b so it finds the best local alignment between the two sequences. If you wish, you may use this program as a guide. (define score (lambda (s1 s2) (cond ((null? s1) (* g (length s2))) ; left column values ((null? s2) (* g (length s1))) ; top row values (else (max 0 (+ (score (cdr s1) (cdr s2)) (s (car s1) (car s2))) (+ (score (cdr s1) s2) g) ; indel from s2 (+ (score s1 (cdr s2)) g)))))) ; indel from s1 You will need to ensure that you never create a negative score (that’s teh essential difference for local alignment vs global). And you’ll need to memoize it (or otherwise convert to dynamic programming). When you have completed this exercise, you will have programmed the Smith-Waterman algorithm for local sequence alignment.

Exercise 6 Analyse your Smith-Waterman program. You may do your analysis either mathematically or empirically. We need to know if your program runs in time O(n), O(n2), O(2n), or ...

Exercise 7 This is a research paper exercise. Find out what you can about the use of dotplots for sequence alignment. Write a five to six page amply illustrated paper about your discoveries. It has been suggested that the use of dotplots is a convincing example of the utility of scientific visualization. Humans are very good at quick pattern recognition, and can make useful deductions very quickly by looking at a dotplot. A computer can very quickly produce a dotplot, but the algorithms for detecting patterns within the dotplot are not particularly fast. So, a “collaboration” between human and computer – with the computer producing the dotplot and the human analysing it offers a way to exploit the strengths of both humans and computers. Comment on the above paragraph. Do you agree/disagree? How do you feel about the possibilities of collaborative problem-solving algorithms that involve both human and computer?

To be Continued • Lab 4 will continue from here. We will look at: • BLOSUM • BLAST

Exercises will include Smith-Waterman with BLOSUM Some investigations of BLAST Comparative analysis of BLAST and Smith-Waterman on some local database Adaptation for nonlinear gap penalties Modifications for multiple alignment

Lab 3

Lab 3

Presentation Transcript

LAB#3

Lab 3

Lab 3

Lab 3

Lab 3

Lab 3

Lab #3

LAB # 3

Lab 3

LAB 3

Lab#3

Lab #3

LAB #3

Lab 3

Lab 3

LAB#3

Lab 3

Lab 3

Lab 3

Lab 3