700 likes | 878 Vues
Introduction to GPU Programming. Volodymyr ( Vlad ) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT). Tutorial Goals. Become familiar with NVIDIA GPU architecture
E N D
Introduction to GPU Programming Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT)
Tutorial Goals • Become familiar with NVIDIA GPU architecture • Become familiar with the NVIDIA GPU application development flow • Be able to write and run a simple NVIDIA GPU application in CUDA
Tutorial Outline • Introduction (15 minutes) • Why use Graphics Processing Units (GPUs) for general-purpose computing • Modern GPU architecture • NVIDIA • GPU programming • Libraries, CUDA C, OpenCL, PGI x64+GPU • Hands-on: getting started with NCSA GPU cluster (15 minutes) • Cluster architecture overview • How to login and check out a node • How to compile and run an existing application • Hands-on: Anatomy of a GPU application (25 minutes) • Host side • Device side • CUDA programming model • Hands-on: Porting matrix multiplier to GPU (25 minutes) • More on CUDA programming (40 minutes)
Introduction • Why use Graphics Processing Units (GPUs) for general-purpose computing • Modern GPU architecture • NVIDIA • GPU programming • Libraries • CUDA C • OpenCL • PGI x64+GPU
Why GPUs?Raw Performance Trends 7800 GTX 6800 Ultra 5950 Ultra 5800 Graph is courtesy of NVIDIA
Why GPUs?Memory Bandwidth Trends GTX 285 GTX 280 8800 Ultra 8800 GTX 7800 GTX 7900 GTX 5950 Ultra 6800 Ultra 5800 Graph is courtesy of NVIDIA
GPU vs. CPU Silicon Use Graph is courtesy of NVIDIA
NVIDIA GPU Architecture • A scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of • 8 Scalar Processor (SP) cores • 2 special function units for transcendentals • A multithreaded instruction unit • On-chip shared memory • GDDR3 SDRAM • PCIe interface Figure is courtesy of NVIDIA
NVIDIA GeForce9400M G GPU • 16 streaming processors arranged as 2 streaming multiprocessors • At 0.8 GHz this provides • 54 GFLOPS in single-precision (SP) • 128-bit interface to off-chip GDDR3 memory • 21 GB/s bandwidth TPC Geometry controller SMC SM SM I cache I cache MT issue MT issue C cache C cache SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SFU SFU SFU SFU Shared memory Shared memory Texture units Texture L1 128-bit interconnect L2 ROP ROP L2 DRAM DRAM
NVIDIA Tesla C1060 GPU • 240 streaming processors arranged as 30 streaming multiprocessors • At 1.3 GHz this provides • 1 TFLOPS SP • 86.4 GFLOPS DP • 512-bit interface to off-chip GDDR3 memory • 102 GB/s bandwidth TPC 1 TPC 10 Geometry controller Geometry controller SMC SMC SM SM SM SM SM SM I cache I cache I cache I cache I cache I cache MT issue MT issue MT issue MT issue MT issue MT issue C cache C cache C cache C cache C cache C cache SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU Shared memory Shared memory Shared memory Shared memory Shared memory Shared memory Texture units Texture units Texture L1 Texture L1 512-bit memory interconnect ROP L2 L2 ROP DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
NVIDIA Tesla S1070 Computing Server 4 GB GDDR3 SDRAM • 4 T10 GPUs 4 GB GDDR3 SDRAM Tesla GPU Tesla GPU Power supply NVIDIA SWITCH PCI x16 Thermal management PCI x16 NVIDIA SWITCH Tesla GPU Tesla GPU System monitoring 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM Graph is courtesy of NVIDIA
GPU Use/Programming • GPU libraries • NVIDIA’s CUDA BLAS and FFT libraries • Many 3rd party libraries • Low abstraction lightweight GPU programming toolkits • CUDA C • OpenCL • High abstraction compiler-based tools • PGI x64+GPU
Getting Started with NCSA GPU Cluster • Cluster architecture overview • How to login and check out a node • How to compile and run an existing application
GPU Cluster Architecture • Servers: 32 • CPU cores: 128 • Accelerator Units: 32 • GPUs: 128 ac (head node) ac01 (compute node) ac02 (compute node) ac32 (compute node)
GPU Cluster Node Architecture • HP xw9400 workstation • 2216 AMD Opteron 2.4 GHz dual socket dual core • 8 GB DDR2 • InfiniBand QDR • S1070 1U GPU Computing Server • 1.3 GHz Tesla T10 processors • 4x4 GB GDDR3 SDRAM IB Compute node QDR IB HP xw9400 workstation PCIe x16 PCIe x16 Tesla S1070 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM
Accessing the GPU Cluster • Use Secure Shell (SSH) client to access AC • `sshUSER@ac.ncsa.uiuc.edu` (User: gpu001 - gpu200; Password: CHiPS-09) • You will see something like this printed out: See machine details and a technical report at: http://www.ncsa.uiuc.edu/Projects/GPUcluster/ Machine Description and HOW TO USE. See: /usr/local/share/ac.readme CUDA wrapper readme: /usr/local/share/cuda_wrapper.readme *IMPORTANT* If you are using multiple GPU devices per host, be sure to understand how the cuda_wrapper changes this system!! … July 03, 2009 Nvidia compute exclusive mode made default. If this breaks your application, "touch ~/FORCE_NORMAL" to create an override for all your jobs. Questions? Contact Jeremy Enos jenos@ncsa.uiuc.edu [gpuXYZ@ac ~]$ _
Installing Tutorial Examples • Run this sequence to retrieve and install tutorial examples: cd cp /tmp/chips_tutorial.tgz . tar -xvzf chips_tutorial.tgz cdchips_tutorial ls src1 src2 src3
Accessing the GPU Cluster Laptop 1 Laptop 2 Laptop 30 ac (head node) You are here ac01 (compute node) ac02 (compute node) ac32 (compute node)
Requesting a Cluster Node for Interactive Use • Run `qstat` to see what other users do, just for the fun of it • Run `qsub -I -lwalltime=02:00:00` to request a node with a single GPU for 2 hours of interactive use • You will see something like this printed out: qsub: waiting for job 64424.acm to start qsub: job 64424.acm ready [gpuXYZ@acAB ~]$ _
Requesting a Cluster Node Laptop 1 Laptop 2 Laptop 30 ac (head node) You are here ac01 (compute node) ac02 (compute node) ac32 (compute node)
Checking GPU Characteristics • Run `deviceQuery` CUDA Device Query (Runtime API) version (CUDART static linking) There is 1 device supporting CUDA Device 0: "Tesla C1060" CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 4294705152 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes … Clock rate: 1.30 GHz Compute mode: Exclusive (only one host thread at a time can use this device)
Compiling and Running and Existing Application • cdchips_tutorial/src1 • vecadd.c - reference C implementation • vecadd.cu – CUDA implementation • Compile & run CPU version gccvecadd.c -o vecadd_cpu ./vecadd_cpu Running CPU vecAdd for 16384 elements C[0]=2147483648.00 ... • Compile & run GPU version nvcc vecadd.cu -o vecadd_gpu ./vecadd_gpu Running GPU vecAdd for 16384 elements C[0]=2147483648.00 ...
nvcc • Any source file containing CUDA C language extensions must be compiled with nvcc • nvcc is a compiler driver that invokes many other tools to accomplish the job • Basic nvcc usage • nvcc <filename>.cu [-o <executable>] • Builds release mode • nvcc -deviceemu <filename>.cu • Builds device emulation mode (all code runs on CPU) • -g flag allows to build debug mode for gdb debugger
Anatomy of a GPU Application • Host side • Device side • CUDA programming model
CPU-Only Version void vecAdd(int N, float* A, float* B, float* C) { for (inti = 0; i < N; i++) C[i] = A[i] + B[i]; } int main(intargc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); vecAdd(N, A, B, C); // call compute kernel free(A); free(B); free(C); } Computational kernel Memory allocation Kernel invocation Memory de-allocation
Adding GPU support int main(intargc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N * sizeof(float)); cudaMalloc((void**)&devPtrB, N * sizeof(float)); cudaMalloc((void**)&devPtrC, N * sizeof(float)); cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice); Copy data from the CPU (host) memory to the GPU (device) memory Memory allocation on the GPU card
Adding GPU support vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC); cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(A); free(B); free(C); } Device memory de-allocation Copy results from device memory to the host memory Kernel invocation
GPU Kernel • CPU version void vecAdd(int N, float* A, float* B, float* C) { for (inti = 0; i < N; i++) C[i] = A[i] + B[i]; } • GPU version __global__ void vecAdd(float* A, float* B, float* C) { inti = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; }
threadID … float x = input[threadID]; float y = func(x); output[threadID] = y; … CUDA Programming Model • A CUDA kernel is executed by an array of threads • All threads run the same code (SPMD) • Each thread has an ID that it uses to compute memory addresses and make control decisions • Threads are arranged as a grid of thread blocks • Threads within ablock have accessto a segment ofshared memory Grid Thread Block 0 Thread Block 1 Thread Block N-1 … Shared memory Shared memory Shared memory
Kernel Invocation Syntax grid & thread block dimensionality vecAdd<<<32, 512>>>(devPtrA, devPtrB, devPtrC); Grid Thread Block 0 Thread Block 1 Thread Block N-1 inti = blockIdx.x * blockDim.x + threadIdx.x; … block ID within a grid number of theards per block Shared memory Shared memory Shared memory thread ID within a thread block
Kernel grid Device Block 6 Block 4 Block 2 Block 0 Block 5 Block 1 Block 3 Block 7 Device Block 6 Block 5 Block 1 Block 5 Block 4 Block 4 Block 3 Block 2 Block 0 Block 6 Block 7 Block 1 Block 2 Block 3 Block 7 Block 0 Mapping Threads to the Hardware • Blocks of threads are transparently assigned to SMs • A block of threads executes on one SM & does not migrate • Several blocks can reside concurrently on one SM time Each block can execute in any order relative to other blocks. Slide is courtesy of NVIDIA
(0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 1) Block (2, 0) Grid 2 CUDA Programming Model Device • A kernel is executed as a grid of thread blocks • Grid of blocks can be 1 or 2-dimentional • Thread blocks can be 1, 2, or 3-dimensional • Different kernels can have different grid/block configuration • Threads from the same block have access to a shared memory and their execution can be synchronized Host Kernel 1 Block (1, 1) Kernel 2 Slide is courtesy of NVIDIA
GPU Memory Hierarchy Host Device GPU CPU DRAM local global constant texture Multiprocessor Multiprocessor Multiprocessor chipset registers shared memory DRAM constant and texture caches
Porting matrix multiplier to GPU • cd ../chips_tutorial/src2 • Compile & run CPU version icc -O3 mmult.c -o mmult ./mmult 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... ... msec = 2215478 GFLOPS = 0.969
int main(intargc, char* argv[]) { int N = 1024; structtimeval t1, t2, ta, tb; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float)); minit(a, b, c, N); gettimeofday(&t1, NULL); mmult(a, b, c, N); // a = b * c gettimeofday(&t2, NULL); mprint(a, N, 5); free(a); free(b); free(c); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; msec2 = t2.tv_sec * 1000000 + t2.tv_usec; msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop); } // a = b * c void mmult(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) for (inti = 0; i < N; i++) a[i+j*N] += b[i+k*N]*c[k+j*N]; } void minit(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (inti = 0; i < N; i++) { a[i+N*j] = 0.0f; b[i+N*j] = 1.0f; c[i+N*j] = 1.0f; } } void mprint(float *a, int N, int M) { inti, j; for (int j = 0; j < M; j++) { for (inti = 0; i < M; i++) printf("%.2f ", a[i+N*j]); printf("...\n"); } printf("...\n"); }
Matrix Representation in Memory for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j]; • Matrices are stored in column-major order • For reference, jki-ordered version runs at 1.7 GFLOPS on 3 GHz Intel Xeon (single core)
C Map this code: for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j]; B A=B*C a1,2=b1,1*c1,2+b1,2*c2,2 into this (logical) architecture: Grid of thread blocks blockIdx.x blockDim.x 0 0 1 2 3 4 5 1 0 1 2 3 4 5 2 0 1 2 3 4 5 threadIdx.x blockIdx.x * blockDim.x + threadIdx.x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
dim3 grid(1024/32, 1024); dim3 threads (32); 32x1024 grid of thread blocks Block of 32x1x1 threads (blockIdx.x, blockIdx.y) (threadIdx.x) { inti = blockIdx.x*32 + threadIdx.x; int j = blockIdx.y; float sum = 0.0f; for (int k = 0; k < n; k++) sum += b[i+n*k] * c[k+n*j]; a[i+n*j] = sum; } 1024 thread blocks (j) 32 threads per block (i) 32 thread blocks (i)
Kernel Original CPU kernel GPU Kernel __global__ void mmult(float *a, float *b, float *c, int N) { inti = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y; float sum = 0.0; for (int k = 0; k < N; k++) sum += b[i+N*k] * c[k+N*j]; a[i+N*j] = sum; } void mmult(float *a, float *b, float *c, int N) { for (inti = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) a[i+j*N] += b[i+k*N]*c[k+j*N]; } dim3 dimGrid(32, 1024); dim3 dimBlock(32); mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N);
int main(intargc, char* argv[]) { int N = 1024; structtimeval t1, t2; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float)); minit(a, b, c, N); // allocate device memory float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N*N*sizeof(float)); cudaMalloc((void**)&devPtrB, N*N*sizeof(float)); cudaMalloc((void**)&devPtrC, N*N*sizeof(float)); // copu input arrays to the device meory cudaMemcpy(devPtrB, b, N*N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrC, c, N*N*sizeof(float), cudaMemcpyHostToDevice);
gettimeofday(&t1, NULL); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; // define grid and thread block sizes dim3 dimGrid(32, 1024); dim3 dimBlock(32); // launch GPU kernel mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N); // check for errors cudaError_t err = cudaGetLastError(); if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE); } // wait until GPU kernel is done cudaThreadSynchronize(); gettimeofday(&t2, NULL); msec2 = t2.tv_sec * 1000000 + t2.tv_usec;
// copy results to host cudaMemcpy(a, devPtrA, N*N*sizeof(float), cudaMemcpyDeviceToHost); mprint(a, N, 5); // free device memory cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(a); free(b); free(c); msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop); }
Porting matrix multiplier to GPU • Compile & run GPU version nvcc mmult.cu -o mmult ./mmult 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... ... msec = 91363 GFLOPS = 23.505
More on CUDA Programming • Language extensions • Function type qualifiers • Variable type qualifiers • Execution configuration • Built-in variables • Common runtime components • Built-in vector types • Device runtime components • Intrinsic functions • Synchronization and memory fencing functions • Atomic functions • Host runtime components (runtime API only) • Device management • Memory management • Error handling • Debugging in the device emulation mode • Exercise
Executed on the: Only callable from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host Function Type Qualifiers __device__ and __global__ functions do not support recursion, cannot declare static variables inside their body, cannot have a variable number of arguments __device__ functions cannot have their address taken __host__ and __device__ qualifiers can be used together, in which case the function is compiled for both __global__ and __host__ qualifiers cannot be used together __global__ function must have void return type, its execution configuration must be specified, and the call is asynchronous
Variable Type Qualifiers __shared__ and __constant__ variables have implied static storage __device__, __shared__ and __constant__ variables cannot be defined using external keyword __device__ and __constant__ variables are only allowed at file scope __constant__ variables cannot be assigned to from the devices, they are initialized from the host only __shared__ variables cannot have an initialization as part of their declaration
Execution Configuration • Function declared as • __global__ void kernel(float* param); • must be called like this: • kernel<<<Dg, Db, Ns, S>>>(param); • where • Dg (type dim3) specifies the dimension and size of the grid, such that Dg.x*Dg.y equals the number of blocks being launched; • Db (type dim3) spesifies the dimension abd size of each block of threads, such that Db.x*Db.y*Db.z equals the number of threads per block; • optional Ns (type size_z) specifies the number of bytes of shared memory dynamically allocated per block for this call in addition to the statically allocated memory • optional S (type cudaStream_t) specifies the stream associated with this kernel call
Built-in Variables It is not allowed to take addresses of any of the built-in variables It is not allowed to assign values to any of the built-in variables
Built-in Vector Types • Vector types derived from basic integer and float types • char1, char2, char3, char4 • uchar1, uchar2, uchar3, uchar4 • short1, short2, short3, short4 • ushort1, ushort2, ushort3, ushort4 • int1, int2, int3, int4 • uint1, uint2, uint3 (dim3), uint4 • long1, long2, long3, long4 • ulong1, ulong2, ulong3, ulong4 • longlong1, longlong2 • float1, float2, float3, float4 • double1, double2 They are all structures, like this: typedefstruct { float x,y,z,w; } float4; They all come with a constructor function in the form make_<type name>, e.g., int2 make_int2(int x, int y);