Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009

CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009 Speaker : De Yu Chen Data: 2011/4/20

Outline • Introduction • Smith-Waterman Algorithm • CUDA Programming Model • Methods • Results and discussion

Introduction In this paper, the compute power of CUDA-enabled GPUs is further explored to accelerate SW sequence database searches. Two versions of CUDASW++ are implemented: a single-GPU version and a multi-GPU version. Our CUDASW++ implementations provide better performance guarantees for protein sequence database searches compared to • SWPS3: PlayStation 3, 2008 • CBESW: PlayStation 3, 2008 • SW-CUDA: CUDA-enabled GPU, 2008 • NCBI-BLAST: Basic Local Alignment Search Toolprogram was designed by National Center for Biotechnology Information, USA, 1997

Smith-Waterman Algorithm

CUDA Programming Model CUDA execution model CPU (host) GPU (Device)

CUDA Programming Model (count.) CUDA hardware model

CUDA Programming Model (count.) ※Shared memory access patterns 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23 8 24 9 25 10 26 11 27 12 28 13 29 14 30 15 31 __shared__ int data[32]

CUDA Programming Model (count.) FPGA implement GPU implement Forward array backward array

Methods Considering the optimal local alignment of a query sequence and a subject sequence as a task, we have investigated two approaches for parallelizing the sequence database searches using CUDA. • Inter-task parallelization Each task is assigned to exactly one thread and dimBlock tasks are performed in parallel by different threads in a thread block. subject query

Methods (count.) • Intra-task parallelization Each task is assigned to one thread block and all dimBlock threads in the thread block cooperate to perform the task in parallel, exploiting the parallel characteristics of cells in the minor diagonals. subject query

Methods (count.) Inter-task parallelization occupies more device memory but achieves better performance than intra-task parallelization. However, intra-task parallelization occupies significantly less device memory and therefore can support longer query/subject sequences. In our implementation, two stages are used: the first stage exploits inter-task parallelization and the second intra-task parallelization. For subject sequences of length less than or equal to threshold, the alignments with a query sequence are performed in the first stage in order to maximize the performance. The alignments of subject sequences of length greater than threshold, are carried out in the second stage. In our implementation, the threshold is set to 3,072.

Methods (count.) ※Device memory access patterns for coalescing： (1) Placing matrix elements into linear order M M

Methods (count.) (2) A coalesced access pattern M Access Direction By threads Load iteration1 Load iteration1 T(0) T(2) T(3) T(1) T(0) T(1) T(2) T(3) M

Methods (count.) (2) An uncoalesced access pattern M Access Direction By threads Load iteration2 T(0) T(1) T(2) T(3) Load iteration1 T(0) T(1) T(2) T(3) M

Methods (count.) Our implementation uses three techniques to improve performance: (I) Coalesced subject sequence arrangement. (II) Coalesced global memory access. • Cell block division method. I. Coalesced subject sequence arrangement

Methods (count.) II. Coalesced global memory access

Methods (count.) III. Cell block division method To maximize performance and to reduce the bandwidth demand of global memory, we propose a cell block division method for the inter-task parallelization, where the alignment matrix is divided into cell blocks of equal size. Query sequences Subject sequences n × n n × n n × n n × n n × n n × n n × n n × n n × n

Results and discussion To remove the dependency on the query sequences and the databases used for the different tests, Cell Updates Per Second (CUPS) is a commonly used performance measure in bioinformatics. Given a query sequence of size Q and a database of size D. the GCUPS(billion cell updates per second) value is calculated by: • |Q| is the total number of symbols in the query sequence • |D| is the total number of symbols in the database • t is the runtime in. In this paper, for the single-GPU version, the runtime t includes the transfer time of the query sequences from host to GPU, the calculation time of the SW algorithm, and the transfer-back time of the scores;

Results and discussion (count.) (1) Database : Swiss-Prot release 56.6 (2) Number of query sequences: 25 (3) Query Length: 144 ~ 5,478 (4) Single-GPU: NVIDIA GeForce GTX 280 ( 30M, 240 cores, 1G RAM) (5) Multi-GPU: NVIDIA GeForce GTX 295 (60M, 480 cores, 1.8G RAM )

Results and discussion (count.)

Results and discussion (count.) • Number of query sequences: 1 • Number of database sequences: 30 • (3) DatabaseLength: 256 ~ 5,000 • (4) Single-GPU: NVIDIA GeForce C1060 ( 30M, 240 cores, 4G RAM) • (5) CPU: Intel E6420 1.8GHz

Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009