1 / 73

GPU Optimization using CUDA Framework

GPU Optimization using CUDA Framework. Anusha Vidap Divya Sree Yamani Raffath Sulthana . Talk Structure. GPU????? History !!! CUDA. Is optimization really needed??? Techniques to optimize. Project research goal. Why GPU??? . What is SIMD??.

lindley
Télécharger la présentation

GPU Optimization using CUDA Framework

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU Optimization using CUDA Framework Anusha Vidap Divya Sree Yamani Raffath Sulthana

  2. Talk Structure • GPU????? • History !!! • CUDA. • Is optimization really needed??? • Techniques to optimize. • Project research goal.

  3. Why GPU??? What is SIMD??

  4. GPU is just more than a graphic processing in this era. • But then, what is the need of CPUs???

  5. CUDA

  6. Why only CUDA???

  7. Automatic Scalability

  8. DYNAMIC PARALLELISM

  9. CUDA driving GPU growth

  10. HOST CODE vs CUDA CODE

  11. CUDA –Memories Overview • Global • Local • Constant • Texture • Shared • Registers

  12. Threads, Blocks,Grid and Warp

  13. Representation :

  14. Global Memory • Global memory resides in Device DRAM. • The name global refers to scope, can be accessed and modified from host and device. • Declared as __device__declaration • CudaMalloc() is dynamically allocated and is assigned to a regular C pointer variable. • Allocated explicitly by host(CPU) thread

  15. Global Memory _device_ • For data available to all threads in device. • Declared outside function bodies • Scope of Grid and lifetime of application

  16. Basic Operation

  17. Local Memory • Resides in Device memory • Can’t read other threads local memory • Much slower than register memory • Used to hold array if they are not indexed with a constant value • Hold variable when no more registers available for them

  18. Constant Memory • Step 1: A constant memory request for a warp is first split into two requests, one for each half-warp, that are issued independently. • Step 2: A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests. • Final Step: The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.

  19. Constant memory _constant_ • For data not altered by device. • Although stored in global memory, cached and has fast access • Declared outside function bodies

  20. Textures: • The GPU’s sophisticated texture memory may also be used for general-purpose computing. • Although NVIDIA designed the texture units for the classical OpenGL and DirectX rendering pipelines • Texture memory has some properties that make it extremely useful for computing. • Texture memory is cached on chip • Texture caches are designed for graphics applications.

  21. Textures.. • It is designed for addressing called texture fetching • All threads share texture memory • Texture addressed as one –dimensional / two dimensional array. • Elements of the array are called texels, short for “texture elements.

  22. Shared Memory • Each block has its own shared memory • Enables fast communication between threads in a block • Used to hold data that will be read and written by multiple threads • Reduces memory latency • This is very fast in accessing

  23. Shared memory_shared_ Shared memory is on the GPU chip and very fast Separate data available to all threads in one block. Declared inside function bodies So each block would have its own array A[N]

  24. Registers: • Registers are the fastest memory on GPU • Unlike CPU , there are thousand of registers on GPU. • Registers can be considered instead of 50 threads per blocks

  25. Registers • Compiler will place variables declared in kernel in registers when possible • Limit to the number of registers • Registers divided across “warps” (group of 32 threads that will operate in the SIMT mode) and have the lifetime of the warps __global__ kernel() { int x, y, z; … }

  26. Overview of all Memories

  27. Declaration :

  28. So. How does this GPU actually work??? • Data transfer and communication between host and GPU

  29. Processing Flow :

  30. Techniques to optimize • Coalescing data transfers to and from global memory • Shared memory bank conflicts • Performance of L1 cache settings • Optimization through CUDA Streams

  31. Coalescing data transfers to and from global memory

  32. Memory Coalescing • A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. • This is over simple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.

  33. Shared memory bank conflicts

  34. Memory Banks Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at the same time, where there are B banks.

More Related