1 / 13

EECS 583 – Class 21 Research Topic 3: Compilation for GPUs

EECS 583 – Class 21 Research Topic 3: Compilation for GPUs. University of Michigan December 12, 2011 – Last Class!!. Announcements & Reading Material. This class reading

jena
Télécharger la présentation

EECS 583 – Class 21 Research Topic 3: Compilation for GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 583 – Class 21Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

  2. Announcements & Reading Material • This class reading • “Program optimization space pruning for a multithreaded GPU,” S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008. • Project demos • Dec 13-16, 19 (19th is full) • Send me email with date and a few timeslots if your group does not have a slot • Almost all have signed up already

  3. Project Demos • Demo format • Each group gets 30 mins • Strict deadlines because many back to back groups • Don’t be late! • Plan for 20 mins of presentation (no more!), 10 mins questions • Some slides are helpful, try to have all group members say something • Talk about what you did (basic idea, previous work), how you did it (approach + implementation), and results • Demo or real code examples are good • Report • 5 pg double spaced including figures – same content as presentation • Due either when you do you demo or Dec 19 at 6pm

  4. Midterm Results Mean: 97.9 StdDev: 13.9 High: 128 Low: 50 Answer key on the course webpage, pick up graded exams from Daya If you did poorly, all is not lost. This is a grad class, the project is by far most important!!

  5. Why GPUs? 4

  6. Efficiency of GPUs GTX 285 : 5.2 GFLOP/W High Flop Per Watt i7 : 0.78 GFLOP/W High Flop Per Dollar GTX 285 : 3.54 GFLOP/$ High Memory Bandwidth i7 : 0.36 GFLOP/$ GTX 285 : 159 GB/Sec i7 : 32 GB/Sec High Flop Rate i7 : 51 GFLOPS High DP Flop Rate GTX 285 :1062 GFLOPS GTX 285 : 88.5 GFLOPS i7 :102 GFLOPS GTX 480 : 168 GFLOPS 5

  7. GPU Architecture Shared Shared SM 0 SM 1 SM 2 SM 29 Shared Shared 0 0 0 0 1 1 1 1 3 3 3 3 2 2 2 2 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Regs Regs Regs Regs Host Memory CPU Interconnection Network PCIe Bridge Global Memory (Device Memory) 6

  8. CUDA • “Compute Unified Device Architecture” • General purpose programming model • User kicks off batches of threads on the GPU • Advantages of CUDA • Interface designed for compute - graphics free API • Orchestration of on chip cores • Explicit GPU memory management • Full support for Integer and bitwise operations 7

  9. Programming Model Device Grid 1 Host Kernel 1 Time Grid 2 Kernel 2 8

  10. GPU Scheduling SM 0 SM 30 SM 2 SM 3 SM 1 Shared Shared Shared Shared Shared 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 Regs Regs Regs Regs Regs Grid 1 9

  11. Warp Generation Warp 0 Warp 1 SM0 Block 0 ThreadId Shared 31 63 32 0 1 0 2 3 Block 1 4 5 6 7 Registers Block 2 Block 3 10

  12. Memory Hierarchy Grid 0 Per app Constant Memory Per app Texture Memory Per app Global Memory __global__ int GlobalVar Per-thread Local Memory Per-thread Register Per Block Shared Memory Texture<float,1,ReadMode> TextureVar Device Host __shared__ int SharedVar Block 0 __constant__ int ConstVar int RegisterVar Thread 0 int LocalVarArray[10] 11

  13. Discussion Points • Who has written CUDA, how have you optimized it, how long did it take? • Did you do tune using a better algorithm than trial and error? • Is there any hope to build a GPU compiler that can automatically do what CUDA programmers do? • How would you do it? • What’s the input language? C, C++, Java, StreamIt? • Are GPUs a compiler writers best friend or worst enemy? • What about non-scientific codes, can they be mapped to GPUs? • How can GPUs be made more “general”?

More Related