EECS 583 – Class 21 Research Topic 3: Compilation for GPUs

EECS 583 – Class 21Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

Announcements & Reading Material • This class reading • “Program optimization space pruning for a multithreaded GPU,” S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008. • Project demos • Dec 13-16, 19 (19th is full) • Send me email with date and a few timeslots if your group does not have a slot • Almost all have signed up already

Project Demos • Demo format • Each group gets 30 mins • Strict deadlines because many back to back groups • Don’t be late! • Plan for 20 mins of presentation (no more!), 10 mins questions • Some slides are helpful, try to have all group members say something • Talk about what you did (basic idea, previous work), how you did it (approach + implementation), and results • Demo or real code examples are good • Report • 5 pg double spaced including figures – same content as presentation • Due either when you do you demo or Dec 19 at 6pm

Midterm Results Mean: 97.9 StdDev: 13.9 High: 128 Low: 50 Answer key on the course webpage, pick up graded exams from Daya If you did poorly, all is not lost. This is a grad class, the project is by far most important!!

Why GPUs? 4

Efficiency of GPUs GTX 285 : 5.2 GFLOP/W High Flop Per Watt i7 : 0.78 GFLOP/W High Flop Per Dollar GTX 285 : 3.54 GFLOP/$ High Memory Bandwidth i7 : 0.36 GFLOP/$ GTX 285 : 159 GB/Sec i7 : 32 GB/Sec High Flop Rate i7 : 51 GFLOPS High DP Flop Rate GTX 285 :1062 GFLOPS GTX 285 : 88.5 GFLOPS i7 :102 GFLOPS GTX 480 : 168 GFLOPS 5

GPU Architecture Shared Shared SM 0 SM 1 SM 2 SM 29 Shared Shared 0 0 0 0 1 1 1 1 3 3 3 3 2 2 2 2 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Regs Regs Regs Regs Host Memory CPU Interconnection Network PCIe Bridge Global Memory (Device Memory) 6

CUDA • “Compute Unified Device Architecture” • General purpose programming model • User kicks off batches of threads on the GPU • Advantages of CUDA • Interface designed for compute - graphics free API • Orchestration of on chip cores • Explicit GPU memory management • Full support for Integer and bitwise operations 7

Programming Model Device Grid 1 Host Kernel 1 Time Grid 2 Kernel 2 8

GPU Scheduling SM 0 SM 30 SM 2 SM 3 SM 1 Shared Shared Shared Shared Shared 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 Regs Regs Regs Regs Regs Grid 1 9

Warp Generation Warp 0 Warp 1 SM0 Block 0 ThreadId Shared 31 63 32 0 1 0 2 3 Block 1 4 5 6 7 Registers Block 2 Block 3 10

Memory Hierarchy Grid 0 Per app Constant Memory Per app Texture Memory Per app Global Memory __global__ int GlobalVar Per-thread Local Memory Per-thread Register Per Block Shared Memory Texture<float,1,ReadMode> TextureVar Device Host __shared__ int SharedVar Block 0 __constant__ int ConstVar int RegisterVar Thread 0 int LocalVarArray[10] 11

Discussion Points • Who has written CUDA, how have you optimized it, how long did it take? • Did you do tune using a better algorithm than trial and error? • Is there any hope to build a GPU compiler that can automatically do what CUDA programmers do? • How would you do it? • What’s the input language? C, C++, Java, StreamIt? • Are GPUs a compiler writers best friend or worst enemy? • What about non-scientific codes, can they be mapped to GPUs? • How can GPUs be made more “general”?

EECS 583 – Class 21 Research Topic 3: Compilation for GPUs