Understanding GPU Thread Hierarchy and Divergence for Enhanced Performance

CS 179: Lecture 4Lab Review 2

Groups of Threads (Hierarchy) (largest to smallest) • “Grid”: • All of the threads • Size: (number of threads per block) * (number of blocks) • “Block”: • Size: User-specified • Should at least be a multiple of 32 (often, higher is better) • Upper limit given by hardware (512 in Tesla, 1024 in Fermi) • Features: • Shared memory • Synchronization

Groups of Threads • “Warp”: • Group of 32 threads • Execute in lockstep (same instructions) • Susceptible to divergence!

Divergence “Two roads diverged in a wood… …and I took both”

Divergence • What happens: • Executes normally until if-statement • Branches to calculate Branch A (blue threads) • Goes back (!) and branches to calculate Branch B (red threads)

“Divergent tree” Assume 512 threads in block… … 506, 508, 510 … 500, 504, 508 … 488, 496, 504 … 464, 480, 496

“Divergent tree” Assumes block size is power of 2… //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0): add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

Example purposes only! Real blocks are way bigger! “Non-divergent tree”

“Non-divergent tree” Assumes block size is power of 2… //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

“Divergent tree”Where is the divergence? • Two branches: • Accumulate • Do nothing • If the second branch does nothing, then where is the performance loss?

“Divergent tree” – Analysis • First iteration: (Reduce 512 -> 256): • Warp of threads 0-31: (After calculating polynomial) • Thread 0: Accumulate • Thread 1: Do nothing • Thread 2: Accumulate • Thread 3: Do nothing • … • Warp of threads 32-63: • (same thing!) • … • (up to) Warp of threads 480-511 • Number of executing warps: 512 / 32 = 16

“Divergent tree” – Analysis • Second iteration: (Reduce 256 -> 128): • Warp of threads 0-31: (After calculating polynomial) • Threads 0: Accumulate • Thread 1-3: Do nothing • Thread 4: Accumulate • Thread 5-7: Do nothing • … • Warp of threads 32-63: • (same thing!) • … • (up to) Warp of threads 480-511 • Number of executing warps: 16 (again!)

“Divergent tree” – Analysis • (Process continues, until offset is large enough to separate warps)

“Non-divergent tree” – Analysis • First iteration: (Reduce 512 -> 256): (Part 1) • Warp of threads 0-31: • Accumulate • Warp of threads 32-63: • Accumulate • … • (up to) Warp of threads 224-255 • Then what?

“Non-divergent tree” – Analysis • First iteration: (Reduce 512 -> 256): (Part 2) • Warp of threads 256-287: • Do nothing! • … • (up to) Warp of threads 480-511 • Number of executing warps: 256 / 32 = 8 (Was 16 previously!)

“Non-divergent tree” – Analysis • Second iteration: (Reduce 256 -> 128): • Warp of threads 0-31, …, 96-127: • Accumulate • Warp of threads 128-159, …, 480-511 • Do nothing! • Number of executing warps: 128 / 32 = 4 (Was 16 previously!)

What happened? • “Implicit divergence”

Why did we do this? • Performance improvements • Reveals GPU internals!

Final Puzzle • What happens when the polynomial order increases? • All these threads that we think are competing… are they?

The Real World

In medicine… • More sensitive devices -> more data! • More intensive algorithms • Real-time imaging and analysis • Most are parallelizable problems! http://www.varian.com

MRI • “k-space” – Inverse FFT • Real-time and high-resolution imaging http://oregonstate.edu

CT, PET • Low-dose techniques • Safety! • 4D CT imaging • X-ray CT vs. PET CT • Texture memory! http://www.upmccancercenter.com/

Radiation Therapy • Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells • More accurate algorithms possible! • Accuracy = safety! • 40 minutes -> 10 seconds http://en.wikipedia.org

Notes • Office hours: • Kevin: Monday 8-10 PM • Ben: Tuesday 7-9 PM • Connor: Tuesday 8-10 PM • Lab 2: Due Wednesday (4/16), 5 PM

Understanding GPU Thread Hierarchy and Divergence for Enhanced Performance

Understanding GPU Thread Hierarchy and Divergence for Enhanced Performance

Presentation Transcript

CS 179: Lecture 3

Lab: Lecture 9 Review

CS 179 : Lecture 12

CS 179: GPU Programming

Lab Lecture#4

CS 179: GPU Programming

CS 136 Lab 2

CS 519: Lecture 4

CS 140L Lecture 4

CS 140L Lecture 4

CS 179 Database Project

CS 425 Lecture 4

CS 7960-4 Lecture 2

CS 179: Lecture 2 Lab Review 1

CS 179 Lecture 6

Lab Lecture#4

CS 7960-4 Lecture 4

CS 136 Lab 2