Download
cs179 gpu programming n.
Skip this Video
Loading SlideShow in 5 Seconds..
CS179: GPU Programming PowerPoint Presentation
Download Presentation
CS179: GPU Programming

CS179: GPU Programming

131 Vues Download Presentation
Télécharger la présentation

CS179: GPU Programming

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. CS179: GPU Programming Lecture 5: Memory

  2. Today • GPU Memory Overview • CUDA Memory Syntax • Tips and tricks for memory handling

  3. Memory Overview • Very slow access: • Between host and device • Slow access: • Global Memory • Fast access: • Shared memory, constant memory, texture memory, local memory • Very fast access: • Register memory

  4. Global Memory • Read/write • Shared between blocks and grids • Same across multiple kernel executions • Very slow to access • No caching!

  5. Constant Memory • Read-only in device • Cached in multiprocessor • Fairly quick • Cache can broadcast to all active threads

  6. Texture Memory • Read-only in device • 2D cached -- quick access • Filtering methods available

  7. Shared Memory • Read/write per block • Memory is shared within block • Generally quick • Has bad worst-cases

  8. Local Memory • Read/write per thread • Not too fast (stored independent of chip) • Each thread can only see its own local memory • Indexable (can do arrays)

  9. Register Memory • Read/write per thread function • Extremely fast • Each thread can only see its own register memory • Not indexable (can’t do arrays)

  10. Syntax:Register Memory • Default memory type • Declare as normal -- no special syntax • intvar = 1; • Only accessible by current thread

  11. Syntax:Local Memory • “Global” variables for threads • Can modify across local functions for a thread • Declare with __device__ __local__ keyword • __device__ __local__ intvar = 1; • Can also just use __local__

  12. Syntax: Shared Memory • Shared across threads in block, not across blocks • Cannot use pointers, but can use array syntax for arrays • Declare with __device__ __shared__ keyword • __device__ __shared__ intvar[]; • Can also just use __shared__ • Don’t need to declare size for arrays

  13. Syntax: Global Memory • Created with cudaMalloc • Can pass pointers between host and kernel • Transfer is slow! • Declare with __device__keyword • __device__ intvar = 1;

  14. Syntax: Constant Memory • Declare with __device__ __constant__ keyword • __device__ __constant__ intvar = 1; • Can also just use __constant__ • Set using cudaMemcpyToSymbol(or cudaMemcpy) • cudaMemcpyToSymbol(var, src, count);

  15. Syntax: Texture Memory • To be discussed later…

  16. Memory Issues • Each multiprocessor has set amount of memory • Limits amount of blocks we can have • (# of blocks) x (memory used per block) <= total memory • Either get lots of blocks using little memory, or fewer blocks using lots of memory

  17. Memory Issues • Register memory is limited! • Similar to shared memory in blocks • Can have many threads using fewer registers, or few threads using many registers • Former is better, more parallelism

  18. Memory Issues • Global accesses: slow! • Can be sped up when memory is contiguous • Memory coalescing: making memory contiguous • Coalesced accesses are: • Contiguous accesses • In-order accesses • Aligned accesses

  19. Memory Coalescing:Aligned Accesses • Threads read 4, 8, or 16 bytes at a time from global memory • Accesses must be aligned in memory! • Good: • Bad: • Which is worse, reading 16 bytes from 0xABCD0 or 0xABCDE? 0x00 0x04 0x14 0x00 0x07 0x14

  20. Memory CoalescingAligned Accesses Also bad: beginning unaligned

  21. Memory Coalescing:Aligned Accesses • Built-in types force alignment • float3 (12B) takes up the same space as float4 (16B) • float3 arrays are not aligned! • To align a struct, use __align__(x) // x = 4, 8, 16 • cudaMallocaligns the start of each block automatically • cudaMalloc2D aligns the start of each row for 2D arrays

  22. Memory Coalescing:Contiguous Accesses • Contiguous = memory is together • Example: non-contiguous memory • Thread 3 and 4 swapped accesses!

  23. Memory Coalescing:Contiguous Accesses • Which is better? • index = threadIdx.x + blockDim.x * (blockIdx.x + gridDim.x * blockIdx.y); • index = threadIdx.x+ blockDim.y* (blockIdx.y+ gridDim.y* blockIdx.x);

  24. Memory Coalescing:Contiguous Accesses • Case 1: Contiguous accesses bank[0] thread[0][0] thread[1][0] bank[1] thread[0][1] thread[1][1] bank[2] bank[3]

  25. Memory Coalescing:Contiguous Accesses • Case 1: Contiguous accesses bank[0] thread[0][0] thread[1][0] bank[1] thread[0][1] thread[1][1] bank[2] bank[3]

  26. Memory Coalescing:In-order Accesses • In-order accesses • Do not skip addresses • Access addresses in order in memory • Bad example: • Left: address 140 skipped • Right: lots of skipped addresses

  27. Memory Coalescing • Good example:

  28. Memory Coalescing • Not as much of an issue in new hardware • Many restrictions relaxed -- e.g., do not need to have sequential access • However, memory coalescing and alignment still good practice!

  29. Memory Issues • Shared memory: • Also can be limiting • Broken up into banks • Optimal when entire warp is reading shared memory together • Banks: • Each bank services only one thread at a time • Bank conflict: when two threads try to access same block • Causes slowdowns in program!

  30. Bank Conflicts • Bad: • Many threads trying to access the same bank

  31. Bank Conflicts • Good: • Few to no bank conflicts

  32. Bank Conflicts • Banks service 32-bit words at a time at addresses mod 64 • Bank 0 services 0x00, 0x40, 0x80, etc., bank 1 services 0x04, 0x44, 0x84, etc. • Want to avoid multiple thread access to same bank • Keep data spread out • Split data that is larger than 4 bytes into multiple accesses • Be careful of data elements with even stride

  33. Broadcasting • Fast distribution of data to threads • Happens when entire warp tries to access same address • Memory will get broadcasted to all threads in one read

  34. Summary • Best memory management: • Balances memory optimization with parallelism • Break problem up into coalesced chunks • Process data in shared memory, then copy back to global • Remember to avoid bank conflicts!

  35. Next Time • Texture memory • CUDA Applications in graphics