1 / 27

CS 179: Lecture 3

CS 179: Lecture 3. A bold problem!. Suppose we have a polynomial P(r) with coefficients c 0 , …, c n -1 , given by: We want, for r 0 , …, r N -1 , the sum:. A kernel. !!. A more correct kernel. Performance problem!. Serialization of atomicAdd (). Parallel accumulation. ….

imala
Télécharger la présentation

CS 179: Lecture 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 179: Lecture 3

  2. A bold problem! • Suppose we have a polynomial P(r) with coefficients c0, …, cn-1, given by: • We want, for r0, …, rN-1, the sum:

  3. A kernel !!

  4. A more correct kernel

  5. Performance problem! • Serialization of atomicAdd()

  6. Parallel accumulation

  7. Parallel accumulation

  8. Accumulation methods – A visual • Purely atomic: • “Linear accumulation”: *output

  9. Multiprocessor (Blocks go here) SIMD processing unit (Warps go here) GPU: Internals Block Warp

  10. Warp • 32 threads that execute simultaneously! • A[] + B[] -> C[] problem: • Didn’t matter! • Warp of threads 0-31: • A[idx] + B[idx] -> C[idx] • Warp of threads 32-63: • A[idx] + B[idx] -> C[idx] • Warp of threads 64-95: • A[idx] + B[idx] -> C[idx] • …

  11. Divergence – Example //Suppose we have a pointer to some floating-point //value *val... if (threadIdx.x % 2 == 0) *value += 2; //Branch A else *value /= 3; //Branch B • Branches result in different instructions!

  12. Divergence • What happens: • Executes normally until if-statement • Branches to calculate Branch A (blue threads) • Goes back (!) and branches to calculate Branch B (red threads)

  13. Calculating polynomial values – does it matter? • Warp of threads 0-31: • (calculate some values) • Warp of threads 32-63: • (calculate some values) • Warp of threads 64-95: • (calculate some values) • Same instructions! Doesn’t matter!

  14. Linear reduction - does it matter? • (after calculating values…) • Warp of threads 0-31: • Thread 0: Accumulate sum • Threads 1-31: Do nothing • Warp of threads 32-63: • Do nothing • Warp of threads 64-95: • Do nothing • Doesn’t really matter… in this case.

  15. Improving our reduction • More threads participating in the process! • “Binary tree”

  16. Improving our reduction

  17. Improving our reduction //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0) AND you won’t exceed your block dimension: add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

  18. Improving our reduction • What will happen? • Each warp will do meaningful execution on ~16 threads • You use lots more warps than you have to! • A “divergent tree” • How to improve?

  19. “Non-divergent tree”

  20. “Non-divergent tree” //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension //For the first iteration, check that you don’t access //out of range memory while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

  21. “Non-divergent tree” • Suppose we have our block of 512 threads… • Warp of threads 0-31: • Accumulate result • … • Warp of threads 224-255: • Accumulate result • Warp of threads 256-287: • Do nothing • … • Warp of threads 480-511: • Do nothing

  22. “Non-divergent tree” • Suppose we’re now down to 32 threads… • Warp of threads 0-31: • Threads 0-15: Accumulate result • Threads 16-31: Do nothing • Much less divergence • Divergence only occurs in the middle, if ever!

  23. Reduction – Four Approaches • Atomic only: • Divergent tree: • Linear: • Non-divergent tree: *output

  24. Notes on files? (An aside) • Labs and CUDA programming typically have the following files involved: • ____.cc • Allows C++ code • g++ compiles this • ____.cu • Allows CUDA syntax • nvcc compiles this • ____.cuh • CUDA header file (declare accessible functions)

  25. Big Picture (so far) • CUDA allows large speedups! • Parallelism with CUDA: Different from CPU parallelism • Threads are “small” • Memory to think about • Increase speedup by thinking about problem constraints! • Reduction • (and more!)

  26. Big Picture (so far) • Steps in these two problems are widely applicable! • Dot product • Norm calculation • (and more)

  27. Other notes • Office hours: • Monday: 8-10 PM • Tuesday: 8-10 PM • Lab 1 due Wednesday, 5 PM

More Related