130 likes | 267 Vues
This document details the final session of EECS 583 at the University of Michigan, focusing on GPU compilation and optimization. It outlines the reading material—including "Program optimization space pruning for a multithreaded GPU”—and specifies project demo requirements and timelines. Students receive 30 minutes for presentations followed by a Q&A session. The text also covers insights on GPU efficiency, CUDA programming models, and the challenges faced in optimizing GPU applications, along with discussion questions that prompt critical thinking about GPU architectures and compiler design.
E N D
EECS 583 – Class 21Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!
Announcements & Reading Material • This class reading • “Program optimization space pruning for a multithreaded GPU,” S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008. • Project demos • Dec 13-16, 19 (19th is full) • Send me email with date and a few timeslots if your group does not have a slot • Almost all have signed up already
Project Demos • Demo format • Each group gets 30 mins • Strict deadlines because many back to back groups • Don’t be late! • Plan for 20 mins of presentation (no more!), 10 mins questions • Some slides are helpful, try to have all group members say something • Talk about what you did (basic idea, previous work), how you did it (approach + implementation), and results • Demo or real code examples are good • Report • 5 pg double spaced including figures – same content as presentation • Due either when you do you demo or Dec 19 at 6pm
Midterm Results Mean: 97.9 StdDev: 13.9 High: 128 Low: 50 Answer key on the course webpage, pick up graded exams from Daya If you did poorly, all is not lost. This is a grad class, the project is by far most important!!
Efficiency of GPUs GTX 285 : 5.2 GFLOP/W High Flop Per Watt i7 : 0.78 GFLOP/W High Flop Per Dollar GTX 285 : 3.54 GFLOP/$ High Memory Bandwidth i7 : 0.36 GFLOP/$ GTX 285 : 159 GB/Sec i7 : 32 GB/Sec High Flop Rate i7 : 51 GFLOPS High DP Flop Rate GTX 285 :1062 GFLOPS GTX 285 : 88.5 GFLOPS i7 :102 GFLOPS GTX 480 : 168 GFLOPS 5
GPU Architecture Shared Shared SM 0 SM 1 SM 2 SM 29 Shared Shared 0 0 0 0 1 1 1 1 3 3 3 3 2 2 2 2 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Regs Regs Regs Regs Host Memory CPU Interconnection Network PCIe Bridge Global Memory (Device Memory) 6
CUDA • “Compute Unified Device Architecture” • General purpose programming model • User kicks off batches of threads on the GPU • Advantages of CUDA • Interface designed for compute - graphics free API • Orchestration of on chip cores • Explicit GPU memory management • Full support for Integer and bitwise operations 7
Programming Model Device Grid 1 Host Kernel 1 Time Grid 2 Kernel 2 8
GPU Scheduling SM 0 SM 30 SM 2 SM 3 SM 1 Shared Shared Shared Shared Shared 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 Regs Regs Regs Regs Regs Grid 1 9
Warp Generation Warp 0 Warp 1 SM0 Block 0 ThreadId Shared 31 63 32 0 1 0 2 3 Block 1 4 5 6 7 Registers Block 2 Block 3 10
Memory Hierarchy Grid 0 Per app Constant Memory Per app Texture Memory Per app Global Memory __global__ int GlobalVar Per-thread Local Memory Per-thread Register Per Block Shared Memory Texture<float,1,ReadMode> TextureVar Device Host __shared__ int SharedVar Block 0 __constant__ int ConstVar int RegisterVar Thread 0 int LocalVarArray[10] 11
Discussion Points • Who has written CUDA, how have you optimized it, how long did it take? • Did you do tune using a better algorithm than trial and error? • Is there any hope to build a GPU compiler that can automatically do what CUDA programmers do? • How would you do it? • What’s the input language? C, C++, Java, StreamIt? • Are GPUs a compiler writers best friend or worst enemy? • What about non-scientific codes, can they be mapped to GPUs? • How can GPUs be made more “general”?