GPU Optimization using CUDA Framework

GPU Optimization using CUDA Framework Anusha Vidap Divya Sree Yamani Raffath Sulthana

Talk Structure • GPU????? • History !!! • CUDA. • Is optimization really needed??? • Techniques to optimize. • Project research goal.

Why GPU??? What is SIMD??

GPU is just more than a graphic processing in this era. • But then, what is the need of CPUs???

CUDA

Why only CUDA???

Automatic Scalability

DYNAMIC PARALLELISM

CUDA driving GPU growth

HOST CODE vs CUDA CODE

CUDA –Memories Overview • Global • Local • Constant • Texture • Shared • Registers

Threads, Blocks,Grid and Warp

Representation :

Global Memory • Global memory resides in Device DRAM. • The name global refers to scope, can be accessed and modified from host and device. • Declared as __device__declaration • CudaMalloc() is dynamically allocated and is assigned to a regular C pointer variable. • Allocated explicitly by host(CPU) thread

Global Memory _device_ • For data available to all threads in device. • Declared outside function bodies • Scope of Grid and lifetime of application

Basic Operation

Local Memory • Resides in Device memory • Can’t read other threads local memory • Much slower than register memory • Used to hold array if they are not indexed with a constant value • Hold variable when no more registers available for them

Constant Memory • Step 1: A constant memory request for a warp is first split into two requests, one for each half-warp, that are issued independently. • Step 2: A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests. • Final Step: The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.

Constant memory _constant_ • For data not altered by device. • Although stored in global memory, cached and has fast access • Declared outside function bodies

Textures: • The GPU’s sophisticated texture memory may also be used for general-purpose computing. • Although NVIDIA designed the texture units for the classical OpenGL and DirectX rendering pipelines • Texture memory has some properties that make it extremely useful for computing. • Texture memory is cached on chip • Texture caches are designed for graphics applications.

Textures.. • It is designed for addressing called texture fetching • All threads share texture memory • Texture addressed as one –dimensional / two dimensional array. • Elements of the array are called texels, short for “texture elements.

Shared Memory • Each block has its own shared memory • Enables fast communication between threads in a block • Used to hold data that will be read and written by multiple threads • Reduces memory latency • This is very fast in accessing

Shared memory_shared_ Shared memory is on the GPU chip and very fast Separate data available to all threads in one block. Declared inside function bodies So each block would have its own array A[N]

Registers: • Registers are the fastest memory on GPU • Unlike CPU , there are thousand of registers on GPU. • Registers can be considered instead of 50 threads per blocks

Registers • Compiler will place variables declared in kernel in registers when possible • Limit to the number of registers • Registers divided across “warps” (group of 32 threads that will operate in the SIMT mode) and have the lifetime of the warps __global__ kernel() { int x, y, z; … }

Overview of all Memories

Declaration :

So. How does this GPU actually work??? • Data transfer and communication between host and GPU

Processing Flow :

Techniques to optimize • Coalescing data transfers to and from global memory • Shared memory bank conflicts • Performance of L1 cache settings • Optimization through CUDA Streams

Coalescing data transfers to and from global memory

Memory Coalescing • A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. • This is over simple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.

Shared memory bank conflicts

Memory Banks Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at the same time, where there are B banks.

GPU Optimization using CUDA Framework