ECE 569 High Performance Processors and Systems

ECE 569 High Performance Processors and Systems • Administrative • HW2 available, due next Thursday 2/13 @ start of class • GPUs • Thrust library • Execution model ECE 569 -- 06 Feb 2014

Control ALU ALU ALU ALU DRAM Cache DRAM CPU vs. GPU • Fundamentally different designs: CPU GPU simpler, slower, massively-more cores ECE 569 -- 06 Feb 2014

GPGPU languages • CUDA • OpenCL • Microsoft DirectCompute • … ECE 569 -- 06 Feb 2014

Higher-level Abstractions • Microsoft AMP • Thrust • … ECE 569 -- 06 Feb 2014

Thrust • STL-like approach to GPU programming • A template library for CUDA • Installed as part of CUDA 4.0 and newer • 80-20 rule: make 80% of CUDA programming easier… intmain() { int N = 100; thrust::host_vector<int> h_vec(N); // generate data on host:thrust::generate(h_vec.begin(), h_vec.end(), rand); thrust::device_vector<int> d_vec = h_vec; // copy to device: thrust::sort(d_vec.begin(), d_vec.end()); // sort: thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); // copy back: } ECE 569 -- 06 Feb 2014

Demo • Sorting in Thrust…

Demo • Summing a vector in Thrust… #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/reduce.h> #include <thrust/functional.h> #include <stdio.h> intmain() { intN = 100; thrust::host_vector<int> h_vec(N); for(inti = 0; i < N; i++) // fill with 1, 2, 3, ..., N: h_vec[i] = (i+1); thrust::device_vector<int> d_vec = h_vec; // copy to device: intsum = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>()); printf("** Sum: %d\n\n", sum); return 0; }

High-level GPU Architecture Global Memory Host Memory ECE 569 -- 06 Feb 2014

Streaming Multiprocessor (SM) Streaming Multiprocessor • Streaming Multiprocessor (SM) • 8 Streaming Processors (SP) • 2 Super Function Units (SFU) • Multi-threaded instruction dispatch • 1 to 512 threads active • Shared instruction fetch per 32 threads • Cover latency of texture/memory loads • 20+ GFLOPS • 16 KB shared memory • DRAM texture and memory access Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP Global Memory

Device Streaming Multiprocessor N Streaming Multiprocessor 2 Streaming Multiprocessor 1 Shared Memory Registers Registers Registers Instruction Unit … Processor 1 Processor 2 Processor M Constant Cache Texture Cache Device memory Memory Architecture • The local, global, constant, and texture spaces are regions of device memory (DRAM) • Each multiprocessor has: • A set of 32-bit registers per processor • On-chip shared memory • A read-only constant cache • A read-only texture cache • Data cache (Fermi only) Data Cache, Fermi only Global, constant, texture memories

Terminology device = GPU = set of multiprocessors Streaming Multiprocessor= set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory Warp = a subset of a thread block (typically 32) that forms the basic unit of scheduling.

NVIDIA GPU Execution Model I. SIMD Execution of a warp II. Multithreaded Execution across different warps / blocks III. Each thread block mapped to single SM Global Memory

SIMT = Single-Instruction Multiple Threads • Coined by Nvidia • Combines SIMD execution within a warp with SPMD execution across warps

CUDA Thread Block Overview • All threads in a block execute the same kernel program (SPMD) • Programmer declares block: • Block size 1 to 512 concurrent threads • Block shape 1D, 2D, or 3D • Block dimensions in threads • Threads have thread id numbers within block • Thread program uses thread id to select work and address shared data • Threads in the same block share data and synchronize while doing their share of the work • Threads in different blocks cannot cooperate • Each block can execute in any order relative to other blocks! CUDA Thread Block Thread Id #:0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA

Launching a Kernel Function • A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3DimGrid(100, 50); // 5000 thread blocks dim3DimBlock(4, 8, 8); // 256 threads per block size_tSharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<<DimGrid, DimBlock, SharedMemBytes>>>(...); Only for data that is not statically allocated

t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31 Example: Thread Scheduling on G80 • Each Block is executed as 32-thread Warps • Warps are scheduling units in SM • If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? • A total of 768 threads (max) • Each Block consists of 256/32 = 8 Warps • There are 8 * 3 = 24 Warps … Block 1 Warps … Block 2 Warps … Block 3 Warps … … … Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP

warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96 SM Warp Scheduling • SM hardware implements zero-overhead Warp scheduling • Warps whose next instruction has its operands ready for consumption are eligible for execution • Eligible Warps are selected for execution on a prioritized scheduling policy • 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 • If one global memory access is needed for every 4 instructions… • A minimum of 13 Warps are needed to fully tolerate 200-cycle memory latency SM multithreaded Warp scheduler time ...

How is context switching so efficient? Block 0 Thread 0 Register File Block 0 Thread 1 Block 0 Thread 256 • Large register file (16K registers/block) • Each thread assigned a “window” of physical registers • Works if entire thread block’s registers do not exceed capacity (otherwise, compiler fails) • Similarly, shared memory requirements must not exceed capacity for all blocks simultaneously scheduled Block 8 Thread 1 Block 8 Thread 256 Block 8 Thread 0

MT IU MT IU SP SP Shared Memory Shared Memory t0 t1 t2 … tm t0 t1 t2 … tm SM 0 SM 1 Blocks Blocks • Threads run concurrently • SM maintains thread/block id #s • SM manages/schedules thread execution

ECE 569 High Performance Processors and Systems