1 / 20

ECE 569 High Performance Processors and Systems

ECE 569 High Performance Processors and Systems. Administrative HW2 due Thursday 2/13 @ start of class GPUs Conditional execution Current state of the art. GPUs == SIMD. Registers. Registers. Registers. ALU. ALU. ALU. RAM. . . . . . . Control Processor. Instruction Memory.

shubha
Télécharger la présentation

ECE 569 High Performance Processors and Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 569 High Performance Processors and Systems • Administrative • HW2 due Thursday 2/13 @ start of class • GPUs • Conditional execution • Current state of the art ECE 569 -- 11 Feb 2014

  2. GPUs == SIMD Registers Registers Registers ALU ALU ALU RAM . . . . . . Control Processor Instruction Memory ECE 569 -- 11 Feb 2014

  3. A Simple Execution Model • No branch prediction • Just evaluate branch targets and wait for resolution • But wait is only a small number of cycles once data is loaded from global memory • No speculation • Only execute useful instructions Lecture 07 ECE 569 -- 11 Feb 2014

  4. Example: conditionals if (threadIdx >= 2) { out[threadIdx] += 100; } else { out[threadIdx] += 10; } compare threadIdx,2 Reg Instruction Unit Reg Reg ... P0 P! PM-1 Memory Lecture 07 ECE 569 -- 11 Feb 2014

  5. then part if (threadIdx.x >= 2) { out[threadIdx.x] += 100; } else { out[threadIdx.x] += 10; } all threads in warp where condtrue do then part: load … add 100 store … other threads in warp are "masked off" and wait… X ✔ ✔ X Reg Instruction Unit Reg Reg ... P0 P! PM-1 Memory Lecture 07 ECE 569 -- 11 Feb 2014

  6. else part if (threadIdx >= 2) { out[threadIdx] += 100; } else { out[threadIdx] += 10; } all threads in warp where condfalse do else part: load … add 10 store … other threads in warp are "masked off" and wait… ✔ ✔ X X Reg Instruction Unit Reg Reg ... P0 P! PM-1 Memory Lecture 07 ECE 569 -- 11 Feb 2014

  7. Terminology • Divergent paths • Different threads within a warp take different control flow paths within a kernel function • N divergent paths in a warp? • result from nested conditionals… • An N-way divergent warp is serially issued over the N different paths using a hardware stack and per-thread predication logic to only write back results from the threads taking each divergent path. • Performance decreases by about a factor of N Lecture 07 ECE 569 -- 11 Feb 2014

  8. Example: Vector Reduction Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations Array elements

  9. Implementation unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) d_out[t] += d_out[t+stride]; } Lecture 07 ECE 569 -- 11 Feb 2014

  10. Some Observations • In each iteration, two control flow paths • Threads that perform addition and threads that do not • Threads that do not perform addition may cost extra cycles depending on the implementation of divergence • No more than half of threads will be executing at any time • All odd index threads are disabled right from the beginning! • On average, less than ¼ of the threads will be activated for all warps over time Lecture 07 ECE 569 -- 11 Feb 2014

  11. A better approach Thread 0 0 1 2 3 … 13 14 15 16 17 18 19 1 0+16 15+31 3 4

  12. A better implementation unsigned int t = threadIdx.x; for (unsigned int stride = blockDim.x >> 1; stride >= 1; stride >> 1) { __syncthreads(); if (t < stride) d_out[t] += d_out[t+stride]; } Lecture 07 ECE 569 -- 11 Feb 2014

  13. Some Observations • Only the last 5 iterations will have divergence • within a warp • Entire warps will be shut down as iterations progress • For a 512-thread block, 4 iterations to shut down all but one warp in each block • Better resource utilization, will retire warps and blocks faster Lecture 07 ECE 569 -- 11 Feb 2014

  14. Optimization: Predicated Execution <p1> LDR r1,r2,0 • If p1 is TRUE, instruction executes normally • If p1 is FALSE, instruction treated as NOP ECE 569 -- 11 Feb 2014

  15. Predication Example : : if (X == 10) c = c + 1; : : : : LDR r5, X p1 <- r5 eq 10 <p1> LDR r1, C <p1> ADD r1, r1, 1 <p1> STR r1, C : : ECE 569 -- 11 Feb 2014

  16. Predication optimizes execution of simple if-then-else A A B C D B C D ECE 569 -- 11 Feb 2014

  17. Predication == better scheduling : : p1,p2 <- r5 eq 10 <p1> inst 1 from B <p1> inst 2 from B <p1> : : <p2> inst 1 from C <p2> inst 2 from C : : : : p1,p2 <- r5 eq 10 <p1> inst 1 from B <p2> inst 1 from C <p1> inst 2 from B <p2> inst 2 from C <p1> : : schedule ECE 569 -- 11 Feb 2014 17

  18. State of the Art with GPUs Lecture 07 ECE 569 -- 11 Feb 2014 18

  19. nVidiaKepler GK110 • 7.1B transistors • 2,880 cores • 2012 model :-) ECE 569 -- 11 Feb 2014

  20. TITAN: 2nd fastest supercomputer 560,640 total cores: ½ CPU and ½ GPU CPUs: 18,688 AMD Opteron (16-core) GPUs: 18,688 nVidia Tesla K20x 710,144 GB RAM 8,000 kWatts of power ECE 569 -- 11 Feb 2014

More Related