280 likes | 370 Vues
CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy). ( largest to smallest ) “Grid”: All of the threads Size: (number of threads per block) * (number of blocks) “Block”: Size: User-specified Should at least be a multiple of 32 (often, higher is better)
E N D
Groups of Threads (Hierarchy) (largest to smallest) • “Grid”: • All of the threads • Size: (number of threads per block) * (number of blocks) • “Block”: • Size: User-specified • Should at least be a multiple of 32 (often, higher is better) • Upper limit given by hardware (512 in Tesla, 1024 in Fermi) • Features: • Shared memory • Synchronization
Groups of Threads • “Warp”: • Group of 32 threads • Execute in lockstep (same instructions) • Susceptible to divergence!
Divergence “Two roads diverged in a wood… …and I took both”
Divergence • What happens: • Executes normally until if-statement • Branches to calculate Branch A (blue threads) • Goes back (!) and branches to calculate Branch B (red threads)
“Divergent tree” Assume 512 threads in block… … 506, 508, 510 … 500, 504, 508 … 488, 496, 504 … 464, 480, 496
“Divergent tree” Assumes block size is power of 2… //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0): add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output
Example purposes only! Real blocks are way bigger! “Non-divergent tree”
“Non-divergent tree” Assumes block size is power of 2… //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output
“Divergent tree”Where is the divergence? • Two branches: • Accumulate • Do nothing • If the second branch does nothing, then where is the performance loss?
“Divergent tree” – Analysis • First iteration: (Reduce 512 -> 256): • Warp of threads 0-31: (After calculating polynomial) • Thread 0: Accumulate • Thread 1: Do nothing • Thread 2: Accumulate • Thread 3: Do nothing • … • Warp of threads 32-63: • (same thing!) • … • (up to) Warp of threads 480-511 • Number of executing warps: 512 / 32 = 16
“Divergent tree” – Analysis • Second iteration: (Reduce 256 -> 128): • Warp of threads 0-31: (After calculating polynomial) • Threads 0: Accumulate • Thread 1-3: Do nothing • Thread 4: Accumulate • Thread 5-7: Do nothing • … • Warp of threads 32-63: • (same thing!) • … • (up to) Warp of threads 480-511 • Number of executing warps: 16 (again!)
“Divergent tree” – Analysis • (Process continues, until offset is large enough to separate warps)
“Non-divergent tree” – Analysis • First iteration: (Reduce 512 -> 256): (Part 1) • Warp of threads 0-31: • Accumulate • Warp of threads 32-63: • Accumulate • … • (up to) Warp of threads 224-255 • Then what?
“Non-divergent tree” – Analysis • First iteration: (Reduce 512 -> 256): (Part 2) • Warp of threads 256-287: • Do nothing! • … • (up to) Warp of threads 480-511 • Number of executing warps: 256 / 32 = 8 (Was 16 previously!)
“Non-divergent tree” – Analysis • Second iteration: (Reduce 256 -> 128): • Warp of threads 0-31, …, 96-127: • Accumulate • Warp of threads 128-159, …, 480-511 • Do nothing! • Number of executing warps: 128 / 32 = 4 (Was 16 previously!)
What happened? • “Implicit divergence”
Why did we do this? • Performance improvements • Reveals GPU internals!
Final Puzzle • What happens when the polynomial order increases? • All these threads that we think are competing… are they?
In medicine… • More sensitive devices -> more data! • More intensive algorithms • Real-time imaging and analysis • Most are parallelizable problems! http://www.varian.com
MRI • “k-space” – Inverse FFT • Real-time and high-resolution imaging http://oregonstate.edu
CT, PET • Low-dose techniques • Safety! • 4D CT imaging • X-ray CT vs. PET CT • Texture memory! http://www.upmccancercenter.com/
Radiation Therapy • Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells • More accurate algorithms possible! • Accuracy = safety! • 40 minutes -> 10 seconds http://en.wikipedia.org
Notes • Office hours: • Kevin: Monday 8-10 PM • Ben: Tuesday 7-9 PM • Connor: Tuesday 8-10 PM • Lab 2: Due Wednesday (4/16), 5 PM