1 / 27

Training Program on GPU Programming with CUDA

Training Program on GPU Programming with CUDA. 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching Center @ UoM. Day 1, Session 2 CUDA Programming Model CUDA Threads. Training Program on GPU Programming with CUDA. Sanath Jayasena CUDA Teaching Center @ UoM. Outline for Day 1 Session 2.

katima
Télécharger la présentation

Training Program on GPU Programming with CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Training Program onGPU Programming with CUDA 31st July, 7th Aug, 14th Aug 2011 CUDA Teaching Center @ UoM

  2. Day 1, Session 2 CUDA Programming Model CUDA Threads Training Program on GPU Programming with CUDA Sanath Jayasena CUDA Teaching Center @ UoM

  3. Outline for Day 1 Session 2 CUDA Programming Model, CUDA Threads • Data Parallelism • CUDA Program Structure • Memory Model & Data Transfer (Brief) • Kernel Functions & Threading (Discussion with Example: Matrix Multiplication) CUDA Training Program

  4. Data Parallelism • Data Parallelism • A problem/program property • Many arithmetic operations can be safely performed on the data structures simultaneously • Example: matrix multiplication (next slide) • CUDA devices can exploit data parallelism to accelerate execution of applications CUDA Training Program

  5. Example: Matrix Multiplication • P = M · N • Each element in P is computed as dot product between a row of M and a column of N • All elements in P can be computed independently and simultaneously N width M P width width width CUDA Training Program

  6. CUDA Program Structure • A CUDA program consists of one or more phases executed on either the host (CPU) or a device (GPU), supplied as a single source code • Little or no data parallelism  host code • ANSI C, compiled with standard compiler • Significant data parallelism  device code • Extended ANSI C to specify kernels, data structs • NVIDIA C Complier separates the two and … CUDA Training Program

  7. Execution of a CUDA Program CUDA Training Program

  8. Execution of a CUDA Program • Execution starts with host (CPU) • When a kernel is invoked, execution moves to the device (GPU) • A large number of threads generated • Grid : collection of all threads generated by kernel • (Previous slide shows two grids of threads) • Once all threads in a grid complete execution, the grid terminates and execution continues on the host CUDA Training Program

  9. Example: Matrix Multiplication int main (void) { 1. // Allocate and initialize matrices M, N, P // I/O to read the input matrices M and N …. 2. // M * N on the device MatrixMulOnDevice (M, N, P, width); 3. // I/O to write the output matrix P // Free matrices M, N, P … return 0; } A simple CUDA host code skeleton for matrix multiplication CUDA Training Program

  10. CUDA Device Memory Model • Host, devices have separate memory spaces • E.g., hardware cards with their own DRAM • To execute a kernel on a device • Need to allocate memory on device • Transfer data: host memory  device memory • After device execution • Transfer results: device memory  host memory • Free device memory no longer needed CUDA Training Program

  11. CUDA Device Memory Model CUDA Training Program

  12. CUDA API : Memory Mgt. CUDA Training Program

  13. CUDA API : Memory Mgt. • Example float *Md; int size = Width * Width * sizeof(float); cudaMalloc((void**)&Md, size); … cudaFree(Md); CUDA Training Program

  14. CUDA API : Data Transfer CUDA Training Program

  15. Example: Matrix Multiplication CUDA Training Program

  16. Kernel Functions & Threading • A kernel function specifies the code to be executed by all threads of a parallel phase • All threads of a parallel phase execute the same code  single-program multiple-data (SPMD), a popular programming style for parallel computing • Need a mechanism to • Allow threads to distinguish themselves • Direct themselves to specific parts of data they are supposed to work on CUDA Training Program

  17. Kernel Functions & Threading • Keywords “threadIdx.x” and “threadIdx.y” • Thread indices of a thread • Allow a thread to identify itself at runtime (by accessing hardware registers associated with it) • Can refer a thread as Thread threadIdx.x,threadIdx.y • Thread indices reflect a multi-dimensional organization for threads CUDA Training Program

  18. Example: Matrix Multiplication Kernel See next slide for more details on accessing relevant data CUDA Training Program

  19. Thread Indices & Accessing Data Relevant to a Thread x Nd How matrix Pd would be laid out in memory (as it is a 1-D array) tx row 0 row 1 y width Pd width ty * width tx • Each thread uses tx, ty to identify the relevant row of Md, column of Nd and the element of Pd in the for loop • E.g., Thread2,3 will perform dot product between row 2 of Md and column 3 of Nd and write the result into element (2,3) of Pd Md Pd ty ty tx width CUDA Training Program

  20. Threading & Grids • When a kernel is invoked/launched, it is executed as a grid of parallel threads • A CUDA thread grid can have millions of lightweight GPU threads per kernel invocation • To fully utilize hardware  enough threads required large data parallelism required • Threads in a grid has a two-level hierarchy • A grid consists of 1 or more thread blocks • All blocks in a grid have same # of threads CUDA Training Program

  21. CUDA Thread Organization CUDA Training Program

  22. Threading with Grids & Blocks • Each thread block has a unique 2-D coordinate given by CUDA keywords “blockIdx.x” and “blockIdx.y” • All blocks must have the same structure, thread # • Each block has a 3-D array of threads up to a total of 1024 threads max • Coordinates of threads in a block are defined by indices: threadIdx.x, threadIdx.y, threadIdx.z • (Not all apps will use all 3 dimensions) CUDA Training Program

  23. Our Example: Matrix Multiplication • The kernel is shown 5 slides before (slide 18) • This can only use one thread block • The block is organized as a 2D-array • The code can compute a product matrix Pd of only up to 1024 elements • As a block can have a max of 1024 threads • Each thread computes one element in Pd • Is this sufficient / acceptable? CUDA Training Program

  24. Our Example: Matrix Multiplication • When host code invokes the kernel, the grid and block dimensions are set by passing them as parameters • Example // Setup the execution configuration dim3 dimBlock(16, 16, 1); //Width=16, as example dim3 dimGrid(1, 1, 1); //last 1 ignored // Launch the device computation threads! MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,16); CUDA Training Program

  25. Here is an Exercise… • Implement Matrix Multiplication • Execute it with different matrix dimensions using (a) CPU only, (b) GPUs and (c) GPUs with different grid/block organizations • Fill a table like the following CUDA Training Program

  26. Conclusion • We discussed CUDA Programming Model and CUDA Thread Basics • Data Parallelism • CUDA Program Structure • Memory Model & Data Transfer (briefly) • Kernel Functions & Threading • (Discussion with Example: Matrix Multiplication) CUDA Training Program

  27. References for this Session • Chapter 2 of: D. Kirk and W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann, 2010 • Chapters 4-5 of: E. Kandrot and J. Sanders, CUDA by Example, Addison-Wesley, 2010 • Chapter 2 of: NVIDIA CUDA C Programming Guide, V. 3.2/4.0, NVIDIA Corp. , 2010-2011 CUDA Training Program

More Related