CUDA Lecture 4 CUDA Programming Basics

CUDA Lecture 4CUDA Programming Basics Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Parallel Programming Basics • Things we need to consider: • Control • Synchronization • Communication • Parallel programming languages offer different ways of dealing with above CUDA Programming Basics – Slide 2

Overview • CUDA programming model – basic concepts and data types • CUDA application programming interface - basic • Simple examples to illustrate basic concepts and functionalities • Performance features will be covered later CUDA Programming Basics – Slide 3

Outline of CUDA Basics • Basic kernels and execution on GPU • Basic memory management • Coordinating CPU and GPU execution • See the programming guide for the full API CUDA Programming Basics – Slide 4

CUDA – C with no shader limitations! • Integrated host + device application program in C • Serial or modestly parallel parts in host C code • Highly parallel parts in device SPMD kernel C code • Programming model • Parallel code (kernel) is launched and executed on a device by many threads • Launches are hierarchical • Threads are grouped into blocks • Blocks are grouped into grids • Familiar serial code is written for a thread • Each thread is free to execute a unique code path • Built-in thread and block ID variables CUDA Programming Basics – Slide 5

. . . . . . CUDA – C with no shader limitations! Serial Code (host)‏ Parallel Kernel (device)‏ KernelA<<< nBlk, nTid >>>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB<<< nBlk, nTid >>>(args); CUDA Programming Basics – Slide 6

CUDA Devices and Threads • A computedevice • Is a coprocessor to the CPU or host • Has its own DRAM (device memory)‏ • Runs many threads in parallel • Is typically a GPU but can also be another type of parallel processing device • Data-parallel portions of an application are expressed as device kernels which run on many threads CUDA Programming Basics – Slide 7

CUDA Devices and Threads • Differences between GPU and CPU threads • GPU threads are extremely lightweight • Very little creation overhead • GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few CUDA Programming Basics – Slide 8

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB G80 – Graphics Mode • The future of GPUs is programmable processing • So – build the architecture around the processor CUDA Programming Basics – Slide 9

Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory G80 CUDA Mode – A Device Example • Processors execute computing threads • New operating mode/hardware interface for computing CUDA Programming Basics – Slide 10

High Level View SMEM SMEM SMEM SMEM PCIe Global Memory CPU Chipset CUDA Programming Basics – Slide 11

Thread Memory Threadblock Per-blockShared Memory Blocks of Threads Run on a SM Streaming Processor Streaming Multiprocessor SMEM Registers Memory CUDA Programming Basics – Slide 12

Many blocks of threads . . . Whole Grid Runs on GPU SMEM SMEM SMEM SMEM Global Memory CUDA Programming Basics – Slide 13

Extended C CUDA Programming Basics – Slide 14

Extended C Mark Murphy, “NVIDIA’s Experience with Open64,” www.capsl.udel.edu/conferences/open64/2008/Papers/101.doc Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly foo.s CPU Host Code foo.cpp OCG gcc / cl G80 SASS foo.sass CUDA Programming Basics – Slide 15

threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Arrays of Parallel Threads • A CUDA kernel is executed by an array of threads • All threads run the same code (SPMD) • Each thread has an ID that it uses to compute memory addresses and make control decisions CUDA Programming Basics – Slide 16

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 threadID … float x = input[threadID]; float y = func(x); output[threadID] = y; … … float x = input[threadID]; float y = func(x); output[threadID] = y; … … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Blocks: Scalable Cooperation • Divide monolithic thread array into multiple blocks • Threads within a block cooperate via shared memory, atomic operations and barrier synchronization • Threads in different blocks cannot cooperate Thread Block 1 Thread Block N - 1 Thread Block 0 … CUDA Programming Basics – Slide 17

Thread Hierarchy • Threads launched for a parallel section are partitioned into thread blocks • Grid = all blocks for a given launch • Thread block is a group of threads that can • Synchronize their executions • Communicate via shared memory CUDA Programming Basics – Slide 18

Blocks Must Be Independent • Any possible interleaving of blocks should be valid • Presumed to run to completion without preemption • Can run in any order • Can run concurrently OR sequentially • Blocks may coordinate but not synchronize • Shared queue pointer: OK • Shared lock: BAD … can easily deadlock • Independence requirement gives scalability CUDA Programming Basics – Slide 19

Basics of CUDA Programming • A CUDA program has two pieces • Host code on the CPU which interfaces to the GPU • Kernel code which runs on the GPU • At the host level, there is a choice of 2 APIs (Application Programming Interfaces): • Runtime: simpler, more convenient • Driver: much more verbose, more flexible, closer to OpenCL • We will only use the Runtime API in this course CUDA Programming Basics – Slide 20

Basics of CUDA Programming • At the host code level, there are library routines for: • memory allocation on graphics card • data transfer to/from device memory • constants • texture arrays (useful for lookup tables) • ordinary data • error-checking • timing • There is also a special syntax for launching multiple copies of the kernel process on the GPU. CUDA Programming Basics – Slide 21

Block IDs and Thread IDs • Each thread uses IDs to decide what data to work on • Block ID: 1-D or 2-D • Unique within a block • Thread ID: 1-D, 2-D or 3-D • Unique within a block • Dimensions set at launch • Can be unique for each grid CUDA Programming Basics – Slide 22

Block IDs and Thread IDs • Built-in variables • threadIdx, blockIdx • blockDim, gridDim • Simplifies memory addressing when processing multidimensional data • Image processing • Solving PDEs on volumes • … CUDA Programming Basics – Slide 23

Basics of CUDA Programming • In its simplest form launch of kernel looks like: kernel_routine<<<gridDim, blockDim>>>(args); where • gridDim is the number of copies of the kernel (the “grid” size”) • blockDim is the number of threads within each copy (the “block” size) • args is a limited number of arguments, usually mainly pointers to arrays in graphics memory, and some constants which get copied by value • The more general form allows gridDim and blockDim to be 2-D or 3-D to simplify application programs CUDA Programming Basics – Slide 24

Basics of CUDA Programming • At the lower level, when one copy of the kernel is started on a SM it is executed by a number of threads, each of which knows about: • some variables passed as arguments • pointers to arrays in device memory (also arguments) • global constants in device memory • shared memory and private registers/local variables • some special variables: • gridDim size (or dimensions) of grid of blocks • blockIdx index (or 2-D/3-D indices) of block • blockDim size (or dimensions) of each block • threadIdx index (or 2-D/3-D indices) of thread CUDA Programming Basics – Slide 25

Basics of CUDA Programming • Suppose we have 1000 blocks, and each one has 128 threads – how does it get executed? • On current Tesla hardware, would probably get 8 blocks running at the same time on each SM, and each block has 4 warps => 32 warps running on each SM • Each clock tick, SM warp scheduler decides which warp to execute next, choosing from those not waiting for • data coming from device memory (memory latency) • completion of earlier instructions (pipeline delay) • Programmer doesn’t have to worry about this level of detail, just make sure there are lots of threads / warps CUDA Programming Basics – Slide 26

Basics of CUDA Programming • In the simplest case, we have a 1-D grid of blocks, and a 1-D set of threads within each block. • If we want to use a 2-D set of threads, then blockDim.x, blockDim.y give the dimensions, and threadIdx.x, threadIdx.y give the thread indices • To launch the kernel we would use somthing like dim3 nthreads(16,4); my_new_kernel<<<nblocks,nthreads>>>(d_x); where dim3 is a special CUDA datatype with 3 components .x, .y, .z each initialized to 1. CUDA Programming Basics – Slide 27

For Example • Launch with dim3 dimGrid(2, 2); dim3 dimBlock(4, 2, 2); kernelFunc<<<dimGrid,dimBlock>>>(…); • Zoomed in on block with blockIdx.x = blockIdx.y = 1, blockDim.x = 4, blockDim.y = blockDim.z = 2 • Each thread in block has coordinates (threadIdx.x, threadIdx.y, threadIdx.z) CUDA Programming Basics – Slide 28

Basics of CUDA Programming • A similar approach is used for 3-D threads and/or 2-D grids. This can be very useful in 2-D / 3-D finite difference applications. • How do 2-D / 3-D threads get divided into warps? • 1-D thread ID defined by threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y and this is then broken up into warps of size 32. CUDA Programming Basics – Slide 29

CUDA Memory Model Overview • Global memory • Main means of communicating R/W data between host and device • Contents visible to all threads • Long latency access • We will focus on global memory for now • Constant and texture memory will come later Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory CUDA Programming Basics – Slide 30

Kernel 0 Per-deviceGlobal Memory Sequential Kernels Kernel 1 . . . . . . Memory Model CUDA Programming Basics – Slide 31

CUDA API Highlights: Easy and Lightweight • The API is an extension to the ANSI C programming language • Low learning curve • The hardware is designed to enable lightweight runtime and driver • High performance CUDA Programming Basics – Slide 32

Memory Spaces • CPU and GPU have separate memory spaces • Data is moved across the PCIe bus • Use functions to allocate/set/copy memory on GPU • Very similar to corresponding C functions • Pointers are just addresses • Can’t tell from the pointer value whether the address is on CPU or GPU • Must exercise care when dereferencing • Dereferencing CPU pointer on GPU will likely crash and vice-versa CUDA Programming Basics – Slide 33

CUDA Device Memory Allocation • cudaMalloc() • Allocates object in the device global memory • Requires two parameters • Address of a pointer to the allocated object • Size of allocated object • cudaFree() • Frees objects from device global memory • Pointer to freed object Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory CUDA Programming Basics – Slide 34

CUDA Device Memory Allocation • Code example • Allocate a 64-by-64 single precision float array • Attach the allocated storage to Md • “d” is often used to indicate a device data structure CUDA Programming Basics – Slide 35

CUDA Host-Device Data Transfer • cudaMemcpy() • Memory data transfer • Requires four parameters • Pointer to destination • Pointer to source • Number of bytes copied • Type of transfer • Host to host • Host to device • Device to host • Device to device • Asynchronous transfer Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory CUDA Programming Basics – Slide 36

Host memory Device 0memory cudaMemcpy() Device 1memory Memory Model • cudaMemcpy() • Returns after the copy is complete • Blocks CPU thread until all bytes have been copied • Doesn’t start copying until previous CUDA calls complete • Non-blocking copies are also available CUDA Programming Basics – Slide 37

CUDA Host-Device Data Transfer • Code example • Transfer a 64-by-64 single precision float array • M is in host memory and Md is in device memory • cudaMemcpyHostToDevice , cudaMemcpyDeviceToHost and cudaMemcpyDeviceToDevice are symbolic constants CUDA Programming Basics – Slide 38

First Simple CUDA Example CUDA Programming Basics – Slide 39

Code Executed on GPU • C/C++ with some restrictions • Can only access GPU memory • No variable number of arguments • No static variables • No recursion • No dynamic polymorphism • Must be declared with a qualifier • __global__ : launched by CPU, cannot be called from GPU • __device__ : called from other GPU functions, cannot be called by the CPU • __host__ : can be called by the CPU CUDA Programming Basics – Slide 40

CUDA Function Declarations • __global__ defines a kernel function • Must return void • __device__ and __host__ can be used together • Sample use: overloading operators CUDA Programming Basics – Slide 41

CUDA Function Declarations __device__ intreduction_lock = 0; • The __device__ prefix tells nvcc this is a global variable in the GPU, not the CPU. • The variable can be read and modified by any kernel • Its lifetime is the lifetime of the whole application • Can also declare arrays of fixed size • Can read/write by host code using special routines cudaMemcpyToSymbol, cudaMemcpyFromSymbol or with standard cudaMemcpy in combination with cudaGetSymbolAddress CUDA Programming Basics – Slide 42

CUDA Function Declarations • __device__ functions cannot have their address taken • For functions executed on the device • No recursion • No static variable declarations inside the function • No variable number of arguments CUDA Programming Basics – Slide 43

Calling a Kernel Function – Thread Creation • As seen a kernel function must be called with an execution configuration: • Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking CUDA Programming Basics – Slide 44

Basics of CUDA Programming • The kernel code looks fairly normal once you get used to two things: • code is written from the point of view of a single thread • quite different to OpenMP multithreading • similar to MPI, where you use the MPI “rank” to identify the MPI process • all local variables are private to that thread • need to think about where each variable lives • any operation involving data in the device memory forces its transfer to/from registers in the GPU • no cache on old hardware so a second operation with the same data will force a second transfer • often better to copy the value into a local register variable CUDA Programming Basics – Slide 45

Next CUDA Example: Vector Addition CUDA Programming Basics – Slide 46

Next CUDA Example: Vector Addition • __global__ identifier says its a kernel function • Each thread sets one element of C[]array • Within each block of threads, threadIdx.x ranges from 0 to blockDim.x-1, so each thread has a unique value for i CUDA Programming Basics – Slide 47

Kernel Variations and Output CUDA Programming Basics – Slide 48

Next CUDA Example: Kernel with 2-D Addressing CUDA Programming Basics – Slide 49

A Simple Running ExampleMatrix Multiplication • A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs • Leave shared memory usage until later • Local, register usage • Thread ID usage • Memory data transfer API between host and device • Assume square matrix for simplicity CUDA Programming Basics – Slide 50

CUDA Lecture 4 CUDA Programming Basics