Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

A simple, practical and complete O(n3/log(n))-time Algorithmfor RNA foldingusing the Four-Russians Speedup Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

Background -Problem of computationally predicting the secondary structure of RNA molecules was first introduced more than thirty years ago by: Nussinov et. Al. (1978, 1980) Waterman and Smith (1978) Zuker M. and Stiegler (1981) They presented Dynamic programming solutions with an asymptotic runtime of O(n3). Where n is the length of the RNA sequence.

There have been several improvements to the running time of this problem: • A complex worse case speed up • O(n3*(loglogn)/(logn) 1/2)– (Akutsu 1999) • Practical heuristic speed up O(nP) where P in [n, n2] (Wexler 2007, Backofen, 2009). This is still O(n3) in terms of n. It is important to note that P is empirically much less then n2. Our approach has an O(n3/log(n)) running time using the FOUR RUSSIANS speedup.

Four Russians is well known • The Four Russians method is a general technique for speeding up Dynamic programs. - The method involves extensive preprocessing of a wide-range of possible inputs, before any actual input is seen. However, it has not been applied to RNA folding, despite an ‘expectation’ that it could be.

Possible reasons for difficulty in applying the FOUR RUSSIANS speedup • Widely exposed version of the original dynamic-programming algorithm does not lend itself to application of the Four-Russians technique. (Solution: choose a different order of evaluation.) • It doesn’t seem possible to separate the preprocessing and the computation. (Solution: interleave the two.) We made use of the insights presented in Graham et. al. (1980) paper which gives a Four-Russians solution to the problem of Context-Free Language recognition, where similar problems were encountered.

Basic RNA-folding problem - Find a maximum number of non-crossing matches of complimentary nucleotides in an RNA sequence of length n. Enhancements: -Richer scoring schemes -Minimum energy calculations DP.

Input to the basic RNA-folding problem consists of a string K of length n over a four-letter alphabet {A,U,C,G} A matching consists of non-crossing disjoint pairs of sites in K. i i’ j j’ i

B(i,j) B(i,j) is the score given to the match between nucleotides in sites i and j of K. In a simple scoring scheme a score of 1 is given for complementary pairs ( A, U or C, G), otherwise 0.

Introduction to the original O(n3) DP Let S(i,j) represent the score for the optimal folding of the subsequence consisting of the sites in K from i to j>i inclusively. The DP recurrence relations are based on the different possibilities of pairing nucleotides.

S(i,j) will equal the maximum of : S(i, j-1) S(i+1,j-1)+B(i,j)

Tail Head S(i+1, j)

Head Tail As a result the recurrences for the optimal fold score S(i,j) are:

j …… i Usually evaluated in increasing length of subsequence i.e. distance between i and j.

j …… i -We choose an alternative permissible order. -The algorithm will evaluate in order of increasing j and decreasing i.

forj= 2 to n fori= 1 to j-1 • S(i+1,j-1)+B(i,j) Rule a • S(i,j-1) Rule b • for i = j-1 to 1 • S(i,j) = max( S(i,j), S(i+1,j) ) Rule c • for k=j-1 to i+1 {Rule d loop} • S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) ) Rule d O(n2) S(i,j)=max

Bottleneck forj= 2 to n fori= 1 to j-1 • S(i+1,j-1)+B(i,j) Rule a • S(i,j-1) Rule b • for i = j-1 to 1 • S(i,j) = max( S(i,j), S(i+1,j) ) Rule c • for k=j-1 to i+1 {Rule d loop} • S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) ) O(n) S(i,j)=max O(n3) O(n2) O(n)

Four Russians speedup: first hint Idea: Change the decrement by one to a constant number of operations in each group of size q. forj= 2 to n fori= 1 to j-1 • S(i+1,j-1)+B(i,j) Rule a • S(i,j-1) Rule b • for i = j-1 to 1 • S(i,j) = max( S(i,j), S(i+1,j) ) Rule c • for k=j-1 to i+1 {Rule d loop} • S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) ) S(i,j)=max

What will Rule d loop look like then?

Computation components for speeding up Rule d loop There are several components for speeding up Rule d loop. • Rgroup • Vg vector • Little vg vector • k* • R Table At first these may seem unclear and extraneous but will lead to a speed-up.

Rgroup g Conceptually divide each column in the S matrix into groups of q rows. Each is called an Rgroup, indexed by g.

j Rgroup 0 …… i Rgroup 1 Rgroup 2 Rgroup example where q=3

Computation components for speeding up Rule d loop • Rgroup • Vg vector • Little vg vector • k* • R Table

Vg vector For a fixed j, consider an Rgroup g consisting of rows z, z-1, … z-q+1 for some z<j. Let Vg = {S(z,j), S(z-1, j) ,… S(z-q+1,j)} for all Rgroups g. V0 V1

Computation components for speeding up Rule d loop • Rgroup • Vg vector • Little vg vector • k*

Little vg vector Observe that for the simple scoring scheme: B(i,j)=1 when (i,j) is a permitted match B(i,j)=0 when (i,j) is not a permitted match S(z-1,j) – S(z,j)= {1 or 0}. Hence, the difference between consecutive values in any Vg vector are either 0 or 1. 5-5=0 V0 V1 5-4=1

Little vg vector The changes in consecutive values of Vg could therefore be encoded into little vg ,a vector of length q-1. encode* 5-5=0 v0 V0 V1 5-4=1 Base of V0

Little vector vg - Encoding each Vg into little vg takes O(q) time. - Each column has j/q different Rgroups. - O(n) time to do encoding for the entire column. V0 encode 4-3=1 v1 V1 3-3=0

Decoding the offset- non formal definition Decode: vgV`; where - V’ is a vector of length q-1 - if V`[l]=4 then Vg[l]=x+4 where x is the value in Vg(0) or the base of the Vg vector.

decode formal definition and example It will be a running sum of the values in little vector vg. decode: vg ->V’ ; where Example q=7 V’ Vg decode vg encode 3+1=4 8-7=1 3+0=3 7-7=0 Offset from the base 2+1=3 7-6=1 1+1=2 6-5=1 1+0=1 5-5=0 1 5-4=1 base

Computation components for speeding up Rule d loop • Rgroup • Vg vector • Little vg vector • k* • Table R

Defining k* for an Rgroup g In column j, let k*(i,g,j) be the index k such that the sum S(i,k-1)+S(k,j) is maximized over the indices in Rgroup g.

Example For example, assuming column j=11 and row i=3: To directly compute k*(i,g,j) for Rgroup1 (g=1) we would have to find max of S(3,4)+S(5,11) S(3,5)+S(6,11) S(3,6)+S(7,11) and store the k that corresponds to that max bipartition. This would require O(q) time operations.

Computation components for speeding up Rule d loop Rgroup Vg vector Little vg vector k* Table R 33

Table R Introducing Table R (currently a black box) that given (i,g,vg) returns k*(i,g,j) in O(1) operations.

Modified Rule d loop using Table R forj= 2 to n …. for i = j-1 to 1 for k=j-1 to i+1 • get vg given Rgroup g • retrieve k*(i,g,j) from R(i,g,vg) • S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) ) forg=(j-1)/q to (i+1)/q {Rule d loop}

Run-time of Modified Rule d loop using Table R The asymptotic run-time for Rule d loop is O(n2/q) forj= 2 to n …. for i = j-1 to 1 • for g=(j-1)/q to (i+1)/q {Rule d loop: where Rgroup g is fully complete} • get vg given Rgroup g • retrieve k*(i,g,j) from R(i,g,vg) • S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) ) O(n) O(n/q) q*n2

Modified Rule d loop using Table R Having k* speeds up the Rule d loop. How can we know k* when we need it? Answer: Four-Russians preprocessing. forj= 2 to n …. for i = j-1 to 1 • for g=(j-1)/q to (i+1)/q { where Rgroup g is fully complete} • get vg given Rgroup g • retrieve k*(i,g,j) from R(i,g,vg) • S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )

Preprocessing

Preprocessing Table R: Cgroups We conceptually divide the columns into groups of size q. Example with q=3.

Preprocessing Table R Assume that we run the ‘Second Dynamic Programming Algorithm’ until j=q.  That means that all the S(i,j) values in Cgroup 0 have been computed. At this point we can compute the following: for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, 0, v) is set to the index k in Rgroup 0 such that S(i,k-1) + V’[k] is maximized.

Preprocessing in general In general, for Cgroup g > 0, we could do a similar preprocessing after all the entries in columns of Cgroup g have been computed. That is, R(i,g,v) could be found and stored for all i < g*q. Total for pre-processing for all Cgroups is O(qn*2q-1) * O(n/q)=O(n2*2q-1) time O(qn 2q-1) • once Cgroupg=(j/q) is complete • for each binary vector v of length q-1 V’=decode(v) • for each i such that i < q-1 • R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized.

RNA folding algorithm with Four-Russians Speedup

RNA folding algorithm with Four-Russians Speedup • forj= 2 to n • fori= 1 to j-1 • S(i,j) = max( S(i+1,j-1)+B(i,j) , S(i,j-1)) • for i = j-1 to 1 • for g=(j-1)/q to (i+1)/q {Rule loop d} • get vg given Rgroup g • retrieve k*(i,g,j) from R(i,g,vg) • S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) ) • once Cgroup g=(j/q) is complete • for each binary vector v of length q-1 V’=decode(v) • for each i such that i < q-1 • R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized. O(n3/q) O(n2*2q-1)

Run-Time Setting q= log(n) the total run-time of O(n2*2q-1) + O(n3/q)+ O(n2q) simplifies to O(n3/log(n)) time total. Computation Pre-processing

Empirical Results - We compare our Four-Russians algorithm to the original O(n3)-time algorithm. • The empirical results shown in Table 1 give the average time for 50 tests of randomly generated RNA sequences and 25 downloaded sequences from genBank, for each size between 1,000bp and 6,000bp. • The algorithm performs identically for randomly generated and gen-Bank sequences of equal length. • This is to be expected because the algorithm's run time is sequence character independent.

* *seconds

Variations on scoring scheme B The Four Russians Speed-Up could be extended to any B(i,j) for which no differences between S(i,j) and S(i+1,j) depend on n. Let C denote the size of the set of all possible differences. Then asymptotic time is: when and

Parallel Computing When running this algorithm in parallel on n processors we can reduce the total computation to O(log(n)*n2). (Could be reduced to O(n2) time by a re-ordering method; space trade off)

Conclusion - A practical and easy to implement Four-Russians Speed Up algorithm for prediction of RNA secondary structure, could be applied to a wide variety of scoring schemes. -The asymptotic time of O(n3/log(n)) is achieved by interleaving computation and preprocessing. -This time can be further lowered to O(log(n)*n2) when using n processors in parallel.

Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

Presentation Transcript

Department of Computer Science

UC Davis

Department of Computer Science

Department of Computer Science

Department of computer science

Department of Computer Science and Computer Engineering

Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Department of Computer Science

Bertram Ludäscher Dept. of Computer Science, UC Davis UC Davis Genome Center

D.L. Cox, Department of Physics, UC Davis

UC Davis Student Computer Purchase Program

DEPARTMENT OF COMPUTER SCIENCE

UC DAVIS

Department of Computer Science

UC Davis

Department of Computer Science

UC Davis

Department of Computer Science

Dan Gusfield UC Davis

Department of Computer Science