1 / 30

Lecture 8: CUDA

Lecture 8: CUDA. CUDA. A scalable parallel programming model for GPUs and multicore CPUs Provides facilities for heterogeneous programming Allows the GPU to be used as both a graphics processor and a computing processor. Pollack’s Rule.

aitana
Télécharger la présentation

Lecture 8: CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 8:CUDA

  2. CUDA A scalable parallel programming model for GPUs and multicore CPUs Provides facilities for heterogeneous programming Allows the GPU to be used as both a graphics processor and a computing processor.

  3. Pollack’s Rule • Performance increase is roughly proportional to the square root of the increase in complexity performance  √complexity • Power consumption increase is roughly linearly proportional to the increase in complexity power consumption  complexity

  4. CUDA • SPMD (Single Program Multiple Data) Programming Model • Programmer writes code for a single thread and GPU runs thread instances in parallel • Extends C and C++

  5. CUDA Three key abstractions • A hierarchy of thread groups • Shared memories • Barrier synchronization CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism

  6. CUDA Kernel: a function designed to be executed by many threads Thread block: a set of concurrent threads that can cooperate among themselves through barrier synchronization and through shared-memory access Grid: a set of thread blocks execute the same kernel program function designed to be executed by many threads

  7. CUDA Three key abstractions • A hierarchy of thread groups • Shared memories • Barrier synchronization CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism

  8. CUDA __ global__ void mykernel (int a, …) { ... } main() { ... nblocks = N/512; // max. 512 threads per block mykernel <<< nblocks, 512 >>> (aa, …); ... }

  9. CUDA

  10. CUDA • Thread management is performed by hardware • Max. 512 threads per block • The number of blocks can exceed the number of processors • Blocks execute independently and in any order • Threads can communicate through shared memory • Atomic memory operations exist on the global memory

  11. CUDA Memory Types Local Memory: private to a thread Shared Memory: shared by all threads of the block __shared__ Device memory: shared by all threads of an application __device__

  12. Pollack’s Rule • Performance increase is roughly proportional to the square root of the increase in complexity performance  √complexity • Power consumption increase is roughly linearly proportional to the increase in complexity power consumption  complexity

  13. CUDA Three key abstractions • A hierarchy of thread groups • Shared memories • Barrier synchronization CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallel and task parallelism

  14. CUDA __ global__ void mykernel (float* a, …) { ... } main() { ... int nbytes=N*sizeof(float); float* ha=(float*)malloc(nbytes); float* da=0; cudaMalloc((void**)&da, nbytes); cudaMemcpy(da, ha, nbytes, CudaMemcpyHosttoDevice); mykernel <<< N/blocksize, blocksize >>> (da, …); cudaMemcpy(ha, da, nbytes, CudaMemcpyDevicetoHost); cudaFree(da); ... }

  15. CUDA Synchronization Barrier: Threads wait until all threads in the block arrive at the barrier __syncthreads() The thread increments barrier count and the scheduler marks it as waiting. When all threads arrive barrier, scheduler releases all waiting threads.

  16. CUDA __ global__ void shift_reduce (int *inp, int N, int *tot) { unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; __shared__ int x[blocksize]; x[tid] = (i<N) ? inp[i] : 0; __synchthreads(); for (int s=blockDim.x / 2; s>0; s=s/2) { if (tid<s) x[tid] += x[tid+s]; __synchthreads(); } if (tid==0) atomicAdd(tot, x[tid]); }

  17. CUDA SPMD (Single Program Multiple Data) programming model • All threads execute the same program • Threads coordinate with barrier synchronization • Threads of a block express fine-grained data parallelism and thread parallelism • Independent blocks of a grid express coarse-grained data parallelism • Independent grids express coarse-grained task parallelism

  18. CUDA

  19. CUDA Scheduler • Hardware management and scheduling of threads and thread blocks • Scheduler has minimal runtime overhead

  20. CUDA Multithreading • Memory and texture fetch latency requires hundreds of processor clock cycles • While one thread is waiting for a load or texture fetch, the processor can execute another thread • Thousands of independent threads can keep many processors busy

  21. CUDA GPU Multiprocessor Architecture • Lightweight thread creation • Zero-overhead thread scheduling • Fast barrier synchronization Each thread has its own • Private registers • Private per-thread memory • PC • Thread execution state Support very fine-grained parallelism

  22. CUDA

  23. CUDA GPU Multiprocessor Architecture Each SP core • contains scalar integer and floating-point units • is hardware multithreaded • supports up to 64 threads • is pipelined and executes one instruction per thread per clock • has a large register file (RF), 1024 32-bit registers, • registers are partitioned among the assigned threads (Programs declare their register demand; compiler optimizes register allocation. Ex: (a) 32 registers per thread => 256 threads per block, or (b) fewer registers – more threads, or (c) more registers – fewer threads)

  24. CUDA Single Instruction Multiple Thread (SIMT) SIMT: a processor architecture that applies one instruction to multiple independent threads in parallel Warp: the set of parallel threads that execute the same instruction together in a SIMT architecture • Warp size is 32 threads (4 threads per SP, executed in 4 clock cycles) • Threads in a warp start at the same program address, but they can branch and execute independently. • Individual threads may be inactive due to independent branching

  25. CUDA

  26. CUDA SIMT Warp Execution • There are 4 thread lanes per SP • An issued warp instruction executes in 4 processor cycles • Instruction scheduler selects a warp every 4 clocks • The controler: • Collects thread programs into warps • Allocates a warp • Allocates registers for the warp threads (it can start a warp only when it can allocate the requested register count) • Starts warp execution • When all threads exit, it frees the registers

  27. CUDA Streaming Processor (SP) • Has 1024 32-bit registers (RF) • Can perform 32-bit and 64-bit integer operations: arithmetic, comparison, conversion, logic operations • Can perform 32-bit floating-point operations: add, multiply, min, max, multiply-add, etc. SFU (Special Function Unit) • Pipelined unit • Generates one 32-bit floating-point function per cycle: square root, sin, cos, 2x, log2x

  28. CUDA

  29. CUDA Memory System • Global Memory – external DRAM • Shared Memory – on chip • Per-thread local memory – external DRAM • Constant memory – in external DRAM and cached in shared memory • Texture memory – on chip

  30. Project Performance Measurement, Evaluation and Prediction of Multicore and GPU systems Multicore systems • CPU performance (instruction execution time, pipelining, etc.) • Cache performance • Performance using algorithmic structures GPU systems (NVIDIA-CUDA) • GPU core performance (instruction execution time, pipelining, etc.) • Global and shared memory performance (2) • Performance using algorithmic structures GPU performance in MATLAB environment

More Related