1 / 22

GPU Programming with CUDA – Optimisation Mike Griffiths

GPU Programming with CUDA – Optimisation Mike Griffiths. GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/. Hardware Model. DRAM. GDRAM. GPU Kernel Code _________ _____ ______. Main Program Code ___________ _______ _________. GPU Kernel Code _________ _____ ______.

adsila
Télécharger la présentation

GPU Programming with CUDA – Optimisation Mike Griffiths

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU Programming with CUDA – OptimisationMike Griffiths GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/

  2. Hardware Model DRAM GDRAM GPU Kernel Code _________ _____ ______ Main Program Code ___________ _______ _________ GPU Kernel Code _________ _____ ______ GPU Kernel Code _________ _____ ______ CPU GPU PCIe I/O I/O SM Shared Memory

  3. Performance Inhibitors Data transfer to/from device memory Device under-utilisation GPU memory bandwidth Code Branching

  4. Performance Inhibitors Data transfer to/from device memory Device under-utilisation and occupancy GPU memory bandwidth Code Branching

  5. Data Transfer • CPU (host) and GPU (device) have separate dedicated memory • All data read/written on the device must be copied via PCIe bus • Very expensive operation • Optimisation Technique: Minimise data copies • Keep resident data on the device • May have to move some computation to the GPU even if is not computationally expensive • Might be quicker to re-calculate data on the device than copy it

  6. Data Transfer Example Loop over timesteps inexpensive_routine_on_host(data_on_host) copy data from host to device expensive_routine_on_device(data_on_device) copy data from device to host End loop over timesteps copy data from host to device Loop over timesteps inexpensive_routine_on_device(data_on_device) expensive_routine_on_device(data_on_device) End loop over timesteps copy data from device to host Port inexpensive routine to the device Minimise transfers by moving copy out of the loop

  7. Performance Inhibitors Data transfer to/from device memory Device under-utilisation and occupancy GPU memory bandwidth Code Branching

  8. Exposing Parallelism • GPU performance relies on the use of many threads • Degree of parallelism must be much higher than on the CPU • Ideally need many more threads than cores • Effort must be made to expose as much parallelism as possible • May require re-engineering your problem • If significant sections of code are serial then GPU acceleration will be limited • Amdahl’s Law

  9. Memory Latency • Access to GPU memory has several hundred cycles of latency • When a thread is waiting for data it is stalled • GPUs have very fast context switching • Stalled threads can be switched with active threads • Switching hides memory latency if other threads are performing compute • Requires many threads ideally performing large amounts of computation • Optimisation Technique: Have lots of threads with high arithmetic intensity • Defined as the ratio of arithmetic computation to memory accesses

  10. Exposing parallelism example Loop over i from 1 to 512 Loop over j from 1 to 512 independent iteration Original code 1D decomposition 2D decomposition Calci from thread/block ID Loop over j from 1 to 512 independent iteration Calc i & j from thread/block ID independent iteration ✔ ✖ 262,144 threads 512 threads

  11. Performance Inhibitors Data transfer to/from device memory Device under-utilisation and occupancy GPU memory bandwidth Code Branching

  12. Memory Coalescing • GPUs have high peak memory bandwidth • Maximum bandwidth is achieved when data is accessed in large requests rather than many small requests • Large requests must come from multiple threads • Otherwise memory accesses are serialised degrading performance • Memory coalescing: Consecutive threads accessing consecutive memory locations • Optimisation technique: Coalesced memory accesses reduce the number of requests and achieve higher bandwidth

  13. Coalescing Example index= blockIdx.x*blockDim.x + threadIdx.x; output[index] = 2*input[index]; Consecutive threads are those with consecutive threadIdx.x values Question: Do consecutive threads access consecutive memory locations?

  14. Coalescing Example index= blockIdx.x*blockDim.x + threadIdx.x; output[index] = 2*input[index]; Consecutive threads are those with consecutive threadIdx.x values Question: Do consecutive threads access consecutive memory locations? Yes: Consecutive index values leads to consecutive threadIdx values

  15. Coalescing Example 2 i= blockIdx.x*blockDim.x + threadIdx.x; for (j=0; j<N; j++) output[i][j]=2*input[i][j]; j= blockIdx.x*blockDim.x + threadIdx.x; for (i=0; i<N; i++) output[i][j]=2*input[i][j]; Question: Do consecutive threads access consecutive memory locations?

  16. Coalescing Example 2 i= blockIdx.x*blockDim.x + threadIdx.x; for (j=0; j<N; j++) output[i][j]=2*input[i][j]; j= blockIdx.x*blockDim.x + threadIdx.x; for (i=0; i<N; i++) output[i][j]=2*input[i][j]; Question: Do consecutive threads access consecutive memory locations? No: Consecutive threadIdx.x corresponds to consecutive i values Yes: Consecutive threadIdx.x corresponds to consecutive j values

  17. Memory Coalescing in 2D intj = blockIdx.x * blockDim.x + threadIdx.x; inti = blockIdx.y * blockDim.y + threadIdx.y; c[i][j] = a[i][j] + b[i][j]; • What about 2D or 3D decompositions • Exactly the same • Always threadIdx.x which should increment with consecutive threads • E.g. Matrix addition

  18. Performance Inhibitors Data transfer to/from device memory Device under-utilisation and occupancy GPU memory bandwidth Code Branching

  19. Code Branching • On NVIDIA GPUs there are less instructional scheduling units than cores • Threads are scheduled in groups of 32 (a warp) • Threads within a warp execute the same instruction in lock-step • Single Instruction Multiple Data (SIMD) • CUDA C Kernels are free to specify branches • BUT all threads will have to follow all code paths within the warp • Optimisation Technique: Avoid inter warp branching wherever possible

  20. Branching Example i = blockIdx.x*blockDim.x + threadIdx.x; if (i%2 == 0) … else … i = blockIdx.x*blockDim.x + threadIdx.x; if ((i/32)%2 == 0) … else … You want to split your threads into two groups:

  21. CUDA Profiling # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla M1060 # CUDA_CONTEXT 1 # TIMESTAMPFACTOR fffff6e2e9ee8858 method,gputime,cputime,occupancy method=[ memcpyHtoD ] gputime=[ 37.952 ] cputime=[ 86.000 ] method=[ memcpyHtoD ] gputime=[ 37.376 ] cputime=[ 71.000 ] method=[ memcpyHtoD ] gputime=[ 37.184 ] cputime=[ 57.000 ] method=[ _Z23inverseEdgeDetect1D_colPfS_S_ ] gputime=[ 253.536 ] cputime=[ 13.00 0 ] occupancy=[ 0.250 ] ... • Set COMPUTE_PROFILE environment variable to 1 • Log file will be created at runtime • E.g. cuda_profile_0.log • Contains timing information for kernel and data transfer • Possible to output more metrics (see doc/Compute_Profiler.txt)

  22. Conclusions • GPUs offers higher Floating Point and memory bandwidth performance than CPUs • A number of factors will inhibit execution performance • A number of techniques can be applied to circumvent these • Some techniques may require re-engineering your problem • If you application cant be adapted GPU performance will not be good! • It is important to have an understanding of the application, architecture and programming model

More Related