InCoB2007 - August 30, 2007 - HKUST

“Speedup Bioinformatics Applications on Multicore-based Processor using Vectorizing & Multithreading Strategies” InCoB2007 - August 30, 2007 - HKUST • Kridsadakorn Chaichoompu • kridsadakorn.cha@biotec.or.th Dr. Sissades Tongsima • Dr. Surin Kittitornkun • National Center for Genetic Engineering and Biotechnology, Thailand King Mongkut’s Institute of Technology, Ladkrabang, Thailand

Outline • Introduction • Case Study • Existing works • Speedup of our approach • Comparison • Discussion • Our strategies • Limitation • Conclusion

Motivation • New modern processors are launched • How to make a use of new technologies? Quad-core CPU Dual-core CPU

Motivation [2] • What is the difference between old and new CPUs? Dual-core, Max. speedup ~2x Quad-core, Max. speedup ~4x

Problems • Old sequential software is still used? • Yes, especially the science and bioinformatics tools • Why do the scientists still use? • Mostly they care about novel algorithms and knowledge. They don't care about speed • Why don't we use the PC cluster? • Very expensive, consume much more electric power. You don't need the PC cluster if you want to use a small software for searching, matching or grouping data

Our Contribution • The hardware was changed, Old sequential software should be changed. To harness the power of the new multicore architecture certain compiler techniques must be considered • Using a popular ClustalW application as our case study, the optimization and multithreading techniques were applied to speedup ClustalW

Case Study: ClustalW ClustaW is a general purpose multiple alignment program for DNA or proteins.

ClustalW example S1 ALSK S2 TNSD S3 NASK S4 NTSD Multiple Alignment Steps -ALSK -TNSD NA-SK NT-SD -ALSK NA-SK • 1. Align S1 with S3 • 2. Align S2 with S4 • 3. Align (S1, S3) with (S2, S4) -TNSD NT-SD Multiple Alignment All pairwise alignments Neighbor Joining Distance Matrix

Existing works • ClustalW-MPI: ClustalW analysis using distributed and parallel computing • K.B. Li, Bioinformatics 19, 2003 • Parallel MSA: Parallel Multiple Sequence Alignment with Dynamic Scheduling • J. Luo, I. Ahmad, M. Ahmed and R. Paul, ITCC’05 • SGI: Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL • D. Mikhailov, Haruna C., and R. Gomperts, SGI ChemBio

Running mode* Elapsed times (ms)‏ Overall speedup Distance Matrix Neighbor Joining Progressive Alignment Test data - 800 sequences, 1000 amino acids I 11,918,672 932,718 333,110 - II 10,387,046 881,125 338,016 1.14 III 9,656,750 880,969 327,985 1.21 IV 7,009,875 511,047 252,984 1.70 V 5,900,891 473,359 253,188 1.98 VI 5,472,407 474,109 244,672 2.12 *Note: Running mode defines as follows: (I) ClustalW without optimization (II) ClustalW with optimization (III) ClustalW with optimization and our assist (IV) MT-ClustalW without optimization (V) MT-ClustalW with optimization (VI) MT-ClustalW with optimization and our assist Speedup of our approach Data set  Protein sequences from NCBI Run time: from 3 h. 40 m. down to 1 h. 43 m.

ClustalW Speedup of the optimized versions of ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.

Multithreaded ClustalW • Speedup of the optimized versions of MT-ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.

Comparison • Why does the speedup is over 2x? • Because of the special unit in the new CPU • Does the special unit normally work with common software? • No, we have to activate it.

Speedup > 2x for dual-CPU? [1] Amdahl’s Law S  Speedup

Speedup > 2x for dual-CPU? [2] Speedup 1.21 Speedup 1.70 Data set  800 sequences, 1000 amino acids

Our strategies • Step 1: Analyzing and Profiling • To find the software structure and where the bottle neck is • Step 2: Applying the methodologies • Multithreading & Vectorizing (one of the optimization method) • Step 3: Validating • To compare the result with the original one. For sure, the result is not changed

Strategy: Multithreading • The Proposed Multithreading Strategy • To improve the bottle neck of the software which is non-threaded part • To rise the throughput of the program by applying multithreading strategy • Reduce the overhead of thread creation

Profile the software Profiled by Intel Thread Profiler Distance matrix Neighbor joining Progressive alignment

Implementation Apply the Thread library for this loop

Trick T1 T2 T2 T4 Reduce Thread Creation Overhead 4 Threads P1 P2 P3 P4 P5 P6 P7 P8 Parameters P9 P10 P11 P12

Strategy: Vectorizing • Proposed Optimizing and Vectorizing Methodology • Find the frequent used functions in the program • Applying the Loop Optimizing Methodologies • Use the advantage of Intel C++ Compiler to optimize the code, also enable vectorizing option

Frequent used functions Profiled by Intel VTune

Loop Reversal • That is to run a loop backward. Reversal of for loops is always legal, since the execution is not defined in terms of the order of the index set.

Loop Fission • A single loop can be broken into two or more smaller loops. Loop fission can break up the block of conditionally executed statements.

Limitation • Available compliers and programming languages • C/C++  Intel C++ complier (Windows, Linux, Mac) • Fortran  Intel Fortran complier (Windows, Linux, Mac) • Available processors • CPU with Hyper-thread technology or above (Intel, AMD)

Conclusion • Generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++ • Proposed framework: multithreading and vectorizing strategies • Higher speedup by taking the advantage of multicore architecture technology • Proposed optimization could be more appropriate than making use of parallelization on a small cluster computer

Thank you Questions?

InCoB2007 - August 30, 2007 - HKUST