GPU computing and CUDA

GPU computing and CUDA Marko Mišić (marko.misic@etf.rs) Milo Tomašević (mvt@etf.rs) YUINFO 2012 Kopaonik, 29.02.2012.

Introduction to GPU computing (1) Graphics Processing Units (GPUs) have been used for non-graphics computation for several years This trend is called General-Purpose computation on GPUs (GPGPU) The GPGPU applications can be found in: Computational physics/chemistry/biology Signal processing Computational geometry Database management Computational finance Computer vision

Introduction to GPU computing (2) The GPU is a highly parallel processor good at data-parallel processing with many calculations per memory access The same computation executed on many data elements in parallel with high arithmetic intensity Same computation means lower requirement for sophisticated flow control High arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches

CPU vs. GPU trends (1) CPU is optimized to execute tasks Big caches hide memory latencies Sophisticated flow control GPU is specialized for compute-intensive, highly parallel computation More transistors can be devoted to data processing rather than data caching and flow control Control ALU ALU ALU ALU DRAM Cache DRAM CPU GPU

CPU vs. GPU trends (2) The GPU has evolved into a very flexible and powerful processor Programmable using high-level languages Computational power: 1 TFLOPS vs. 100 GFLOPS Bandwidth: ~10x bigger GPU is found in almost every workstation

CPU vs. GPU trends (3) 197x CUDA Advantage 47x 20x 10x Rigid Body Physics Solver Matrix Numerics BLAS1: 60+ GB/s BLAS3: 100+ GFLOPS Wave Equation FDTD: 1.2 Gcells/s FFT: 52 GFLOPS (GFLOPS as defined by benchFFT) BiologicalSequence Match SSEARCH: 5.2 Gcells/s Finance Black Scholes: 4.7 GOptions/s

History of GPU programming The fast-growing video game industry puts strong pressure that forces constant innovation GPUs evolved from fixed-function pipeline processors to the more programmable, general-purpose processors Programmable shaders (2000) Programmed through OpenGL and DirectX API Lots of limitations Memory access, ISA, floating-point support, etc. NVIDIA CUDA (2007) AMD/ATI (Brook+, FireStream, Close-To-Metal) Microsoft DirectCompute (DirectX 10/DirectX 11) OpenCompute Language, OpenCL (2009)

CUDA overview (1) Compute Device Unified Architecture (CUDA) A new hardware and software architecture for issuing and managing computations on the GPU Started with NVIDIA 8000 (G80) series GPUs General-purpose programming model SIMD / SPMD User launches batches of threads on the GPU GPU could be seen as dedicated super-threaded, massively data parallel coprocessor Explicit and unrestricted memory management

CUDA overview (2) The GPU is viewed as a compute device that is a coprocessor to the CPU (host) Executes compute-intensive part of the application Runs many threads in parallel Has its own DRAM (device memory) Data-parallel portions of an application are expressed as device kernels which run on many threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multicore CPU needs only a few

CUDA overview (3) Dedicated software stack Runtime and driver C-language extension for easier programming Targeted API for advanced users Complete tool chain Compiler, debugger, profiler Libraries and 3rd party support GPU Computing SDK cuFFT, cuBLAS... FORTRAN, C++, Python, MATLAB, Thrust, GMAC… CPU Application CUDA Libraries (FFT, BLAS) CUDA Runtime CUDA Driver GPU

Programming model (1) CUDA application consists of two parts Sequential parts are executed on the CPU (host) Compute-intensive parts are executed on the GPU (device) The CPU is responsible for data management, memory transfers, and the GPU execution configuration . . . . . . Serial Code (host)‏ Parallel Kernel (device)‏ KernelA<<< nBlk, nTid >>>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB<<< nBlk, nTid >>>(args);

Programming model (2) A kernel is executed as a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: Efficiently sharing data through shared memory Synchronizing their execution Two threads from two different blocks cannot cooperate Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (0, 2) Thread (0, 0) Thread (1, 1) Thread (1, 2) Thread (1, 0) Thread (2, 1) Thread (2, 2) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 1) Thread (4, 2) Thread (4, 0)

Programming model (3) Threads and blocks have IDs So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 2) Thread (0, 0) Thread (0, 1) Thread (1, 0) Thread (1, 2) Thread (1, 1) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 2) Thread (4, 1) Thread (4, 0)

Memory model (1) Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory

Memory model (2) The host can read/write global, constant, and texture memory All stored in device DRAM Global memory accesses are slow Around ~200 cycles Memory architecture optimized for high bandwidth Memory banks Transactions Device Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Global Memory (DRAM) Host Global Memory (DRAM)

Memory model (3) Shared memory is a fast on-chip memory Allows threads in a block to share intermediate data Access time ~ 3-4 cycles Could be seen as user-managed cache (scratchpad) Threads are responsible to bring the data to and move it from the shared memory Small in size (up to 48KB) ALU ALU ALU ALU ALU ALU ... ... Control Control Cache Cache … Shared memory Shared memory d0 d4 d1 d5 d7 d3 d6 d2 DRAM … d0 d4 d1 d5 d7 d3 d6 d2

A common programming strategy Local and global memory reside in device memory (DRAM) Much slower access than shared memory A common way of performing computation on the device is to block it up (tile) to take advantage of fast shared memory Partition the data set into subsets that fit into shared memory Handle each data subset with one thread block by: Loading the subset from global memory to shared memory Performing the computation on the subset from shared memory Each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory

Matrix Multiplication Example (1) P = M * N of size WIDTH x WIDTH Without blocking: One thread handles one element of P M and N are loaded WIDTH times from global memory N WIDTH M P WIDTH WIDTH WIDTH

Matrix Multiplication Example (2) P = M * N of size WIDTH x WIDTH With blocking: One thread block handles one BLOCK_SIZE x BLOCK_SIZE sub-matrix Psub of P M and N are only loaded WIDTH / BLOCK_SIZE times from global memory Great saving of memory bandwidth! N BLOCK_SIZE BLOCK_SIZE WIDTH BLOCK_SIZE M P Psub BLOCK_SIZE WIDTH BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE WIDTH WIDTH

CUDA API (1) The CUDA API is an extension to the C programming language consisting of: Language extensions To target portions of the code for execution on the device A runtime library split into: A common component providing built-in vector types and a subset of the C runtime library in both host and device codes A host component to control and access one or more devices from the host A device component providing device-specific functions

CUDA API (2) Function declaration qualifiers __global__, __host__, __device__ Variable qualifiers __host__, __device___, __shared__, etc. Built-in variables gridDim, blockDim, blockIdx, threadIdx Mathematical functions Kernel calling convention (execution configuration) myKernel<<< DimGrid, DimBlock >>>(arg1, … ); Programmer explicitly specifies block and grid organization 1D, 2D or 3D

Hardware implementation (1) The device is a set of multiprocessors Each multiprocessor is a set of 32-bit processors with a SIMD architecture At each clock cycle, a multiprocessor executes the same instruction on a group of threads called a warp Including branches Allows scalable execution of kernels Adding more multiprocessors improves performance Device Multiprocessor N … Multiprocessor 2 Multiprocessor 1 Instruction Unit Processor 1 Processor 2 … Processor M

Hardware implementation (2) Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory

Hardware implementation (3) Each thread block of a grid is split into warps that get executed by one multiprocessor Warp consists of threads with consecutive thread IDs) Each thread block is executed by only one multiprocessor Shared memory space resides in the on-chip shared memory Registers are allocated among the threads A kernel that requires too many registers will fail to launch A multiprocessor can execute several blocks concurrently Shared memory and registers are allocated among the threads of all concurrent blocks Decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently

Memory architecture (1) In a parallel machine, many threads access memory Memory is divided into banks Essential to achieve high bandwidth Each bank can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bankresult in a bank conflict Conflicting accesses are serialized Shared memory is organized in similar fashion Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15

Memory architecture (2) When accessing global memory, accesses are combined into transactions Peak bandwidth is achieved when all threads in a half warp access continuous memory locations “Memory coalescing” In that case, there are no bank conflicts Programmer is responsible to optimize algorithms to access data in appropriate fashion

Performance considerations CUDA has a low learning curve It is easy to write a correct program Performance can vary greatly depending on the resource constraints of the particular device architecture Performance concerned programmers still need to be aware of them to make a good use of a contemporary hardware It is essential to understand hardware and memory architecture Thread scheduling and execution Suitable memory access patterns Shared memory utilization Resource limitations

Conclusion Highly multithreaded architecture of modern GPUs is very suitable for solving data-parallel problems Vastly improves performance in certain domains It is expected that GPU architectures will evolve to further broaden application domains We are in the dawn of heterogeneous computing Software support is developing rapidly Mature tool chain Libraries Available applications OpenCL

References David Kirk, Wen-mei Hwu, Programming Massively Parallel Processors: A Hands on Approach, Morgan Kaufmann, 2010. Course ECE498AL, University of Illinois, Urbana-Champaignhttp://courses.engr.illinois.edu/ece498/al/ Dann Connors, OpenCL and CUDAProgramming for Multicoreand GPU Architectures, ACACES 2011, Fiuggi, Italy, 2011. David Kirk, Wen-mei Hwu, Programming and tUnining Massively Parallel Systems, PUMPS 2011, Barcelona, Spain, 2011. NVIDIA CUDA C Programming Guide 4.0, 2011. Mišić, Đurđević,Tomašević, “Evolution and Trends in GPU Computing”, MIPRO 2012, Abbazia, Croatia, 2012. (to be published) NVIDIA Developer zone, http://developer.nvidia.com/category/zone/cuda-zone http://en.wikipedia.org/wiki/GPGPU http://en.wikipedia.org/wiki/CUDA GPU training wiki,https://hpcforge.org/plugins/mediawiki/wiki/gpu-training/index.php/Main_Page

GPU computing and CUDAQuestions? Marko Mišić (marko.misic@etf.rs) Milo Tomašević (mvt@etf.rs) YUINFO 2012 Kopaonik, 29.02.2012.

GPU computing and CUDA