GPU Architecture Challenges for Throughput Computing Tor M. Aamodt Assistant Professor, UBC

GPU Architecture Challenges for Throughput Computing Tor M. Aamodt Assistant Professor, UBC March 17, 2011 University of Toronto

Outline GPU Computing GPGPU-Sim / Manycore Accelerators (Micro)Architecture Challenges: Branch Divergence (DWF, TBC) On-Chip Interconnect 2

What do these have in common? 3

Source: AMD Hotchips 19 5

GPU Computing Technology trends => want “simpler” cores (less power). GPUs represent an extreme in terms of computation per unit area. Current GPUs tend to work well for applications with regular parallelism (e.g., dense matrix multiply). Research Questions: Can we make GPUs better for a wider class of parallel applications? Can we make them even more efficient? 4

Split problem between CPU and GPU GPU (most computation here) CPU (sequential code “accelerator”) 6

Heterogeneous Computing CPU spawn GPU done CPU CPU spawn GPU Time 9

CUDA Thread Hierarchy • Kernel = grid of blocks of warps of threads • scalar threads 8

CUDA Example [Luebke] Standard C Code void saxpy_serial(intn, float a, float *x, float *y) { for (inti = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel main() { … saxpy_serial(n, 2.0, x, y); }

CUDA Example [Luebke] CUDA code __global__ void saxpy_parallel(int n, float a, float *x, float *y) { inti = blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i]=a*x[i]+y[i]; } main() { // omitted: allocate and initialize memory // Invoke parallel SAXPY kernel with 256 threads/block intnblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); // omitted: transfer results from GPU to CPU }

GPU Microarchitecture Overview (10,000’) Shader Core Shader Core Shader Core Shader Core GPU Interconnection Network Memory Controller Memory Controller Memory Controller Off-chip DRAM GDDR GDDR GDDR 13

Single Instruction, Multiple Thread (SIMT) All threads in a kernel grid run same “code”. A given block in kernel grid runs on single “shader core”. A Warp in a block is a set of threads grouped to execute in SIMD lock step Using stack hardware and/or predication can support different branch outcomes per thread in warp. Thread Warp 3 Thread Warp 8 Common PC Thread Warp Thread Warp 7 Scalar Scalar Scalar Scalar Thread Thread Thread Thread W X Y Z SIMD Pipeline 15

“Shader Core” Microarchitecture Heavily multithreaded: 32 “warps” each representing 32 scalar threads Designed to tolerate long latency operations rather than avoid them. 14

“GPGPU-Sim” (ISPASS 2009) GPGPU simulator developed by my group at UBC Goal: platform for architecture research on manycore accelerators running massively parallel applications. Support CUDA’s “virtual instruction set” (PTX). Provide a timing model with “good enough” accuracy for architecture research. 10

GPGPU-Sim Usage Input: Unmodified CUDA or OpenCL application Output: Clock cycles required to execute + statistics that can be used to determine where cycles were lost due to “microarchitecture level” inefficiency.

Accuracy vs. hardware (GPGPU-Sim 2.1.1b) Correlation ~0.90 (Architecture simulators give up accuracy to enable flexibility-- can explore more of the design space) 11

GPGPU-SimVisualizer (ISPASS 2010)

GPGPU-Sim w/ SASS (decuda) + uArch Tuning (under development) ~0.976 correlation on subset of CUDA SDK that currently runs. Currently adding in Support for Fermi uArch Don’t ask when it Will be available  Correlation ~0.95 12

First Problem: Control flow Branch Path A Path B Group scalar threads into warps Branch divergencewhen threads inside warps want to follow different execution paths. Branch Path A Path B 16

Current GPUs: Stack-Based Reconvergence(Building upon Levinthal & Porter, SIGGRAPH’84) Stack Reconv. PC Next PC Active Mask Common PC Thread Warp TOS TOS TOS TOS TOS TOS TOS A - - E E - - E - E - E - C E E D B E D E E D A G 1001 1001 0110 1111 0110 1111 1111 1111 0110 1111 1111 1111 B Thread 1 Thread 2 Thread 3 Thread 4 C D F E A D G A B C E G Time Our version: Immediate postdominatorreconvergence A/1111 B/1111 C/1001 D/0110 E/1111 G/1111 17

Dynamic Warp Formation(MICRO’07 / TACO’09) Branch Path A Path B Consider multiple warps Opportunity? Branch Path A 18

Dynamic Warp Formation Idea: Form new warp at divergence Enough threads branching to each path to create full new warps 19

Dynamic Warp Formation: Example Legend Execution of Warp x Execution of Warp y at Basic Block A at Basic Block A D A new warp created from scalar threads of both Warp x and y executing at Basic Block D A A B B C C D D E E F F G G A A A A C D F A A B B E E G G A A A x/1111 y/1111 B x/1110 y/0011 C x/1000 D x/0110 F x/0001 y/0010 y/0001 y/1100 E x/1110 y/0011 G x/1111 y/1111 Baseline Time Dynamic Warp Formation Time

New Logic Dynamic Warp Formation: Implementation Modified Register File 21

Thread Block Compaction (HPCA 2011)

Majority Scheduling Best Performing in Prev. Work Prioritize largest group of threads with same PC Starvation, Poor Reconvergence LOWER SIMD Efficiency! Key obstacle: Variable Memory Latency E C E D D 9 6 3 4 9 6 3 4 1 2 7 8 1 2 7 8 9 6 3 4 D E C D E 5 -- 11 12 -- 10 -- -- -- 10 -- -- 5 -- 11 12 -- 10 -- -- E 1 2 3 4 E 5 6 7 8 E 9 10 11 12 DWF Pathologies: Starvation B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; 1000s cycles Time

Coalesced Memory Access = Memory SIMD 1st Order CUDA Programmer Optimization Not preserved by DWF DWF Pathologies: Extra Uncoalesced Accesses E: B = C[tid.x] + K; Memory #Acc = 3 No DWF 0x100 E 1 2 3 4 0x140 E 5 6 7 8 E 9 10 11 12 0x180 L1 Cache Absorbs Redundant Memory Traffic Memory #Acc =9 With DWF 0x100 E 1 2712 0x140 E 9638 L1$ Port Conflict E 5 10114 0x180

Some CUDA applications depend on the lockstep execution of “static warps” E.g. Task Queue in Ray Tracing Warp 0 Thread 0 ... 31 Warp 1 Thread 32 ... 63 Warp 2 Thread 64 ... 95 Implicit Warp Sync. DWF Pathologies:Implicit Warp Sync. int wid = tid.x / 32; if (tid.x % 32 == 0) { sharedTaskID[wid] = atomicAdd(g_TaskID, 32); } my_TaskID = sharedTaskID[wid] + tid.x % 32; ProcessTask(my_TaskID);

Compute kernels usually contain divergent and non-divergent (coherent) code segments Coalesced memory access usually in coherent code segments DWF no benefit there Static Warp Divergence Dynamic Warp Reset Warps Static Warp Observation Coherent Divergent Recvg Pt. Coherent Coales. LD/ST

Block-wide Reconvergence Stack Regroup threads within a block Better Reconv. Stack: Likely Convergence Converge before Immediate Post-Dominator Robust Avg. 22% speedup on divergent CUDA apps No penalty on others Warp 0 Warp 1 Warp 2 PC PC PC RPC RPC RPC AMask AMask AMask E E E -- -- -- 1111 1111 1111 D D D E E E 0011 0100 1100 C C C E E E 1100 1011 0011 E Warp 0 E Warp 1 E Warp 2 Thread Block 0 PC RPC Active Mask E -- 1111 1111 1111 C D Warp X Warp U D E 0011 0100 1100 C D Warp T Warp Y C E 1100 1011 0011 Thread Block Compaction

Run a thread block like a warp Whole block moves between coherent/divergent code Block-wide stack to track exec. paths reconvg. Barrier at branch/reconverge pt. All avail. threads arrive at branch Insensitive to warp scheduling Warp compaction Regrouping with all avail. threads If no divergence, gives static warp arrangement Implicit Warp Sync. Starvation Extra Uncoalesced Memory Access Thread Block Compaction

E E C D B B 1 2 7 8 1 2 -- -- -- -- 3 4 1 2 3 4 1 2 3 4 1 2 3 4 E C B D B E 5 6 7 8 -- 6 -- -- 5 6 7 8 5 6 7 8 5 -- 7 8 5 6 7 8 B E - - 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 D E B B C E 9 10 11 12 9 10 11 12 9 10 -- -- -- -- 11 12 9 10 11 12 9 10 11 12 D E -- -- 3 4 -- 6 -- -- 9 10 -- -- C E 1 2 -- -- 5 -- 7 8 -- -- 11 12 D C 9 6 3 4 1 2 7 8 D C -- 10 -- -- 5 -- 11 12 PC RPC Active Threads -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Thread Block Compaction A: K = A[tid.x]; B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; Time

Barrier every basic block?! (Idle pipeline) Switch to warps from other thread blocks Multiple thread blocks run on a core Already done in most CUDA applications Branch Warp Compaction Block 0 Execution Execution Execution Execution Block 1 Block 2 Time Thread Block Compaction

Per-Warp Stack  Block-Wide Stack I-Buffer + TIDs  Warp Buffer Store the dynamic warps New Unit: Thread Compactor Translate activemask to compact dynamic warps Branch Target PC Block-Wide Stack Fetch Thread Compactor ALU Warp Buffer ALU Active Valid[1:N] Pred. ALU ALU Mask I-Cache Decode Issue RegFile Score- MEM Board Done (WID) Microarchitecture Modifications

Immediate Post-Dominator: Conservative All paths from divergent branch must merge there Convergence can happen earlier When any two of the paths merge Extended Recvg. Stack to exploit this TBC: 30% speedup for Ray Tracing A Rarely Taken B C E D F iPDom of A Likely-Convergence while (i < K) { X = data[i]; A: if ( X = 0 ) B: result[i] = Y; C: else if ( X = 1 ) D: break; E: i++; } F: return result[i];

2 Benchmark Groups: COHE = Non-Divergent CUDA applications DIVG = Divergent CUDA applications Serious Slowdown from pathologies COHE DWF DIVG TBC No Penalty for COHE 22% Speedup on DIVG 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 IPC Relative to Baseline Experimental Results Per-Warp Stack

Next: How should on-chip interconnect be designed? (MICRO 2010) 36

Throughput-Effective Design Two approaches: • Reduce Area • Increase performance Look at properties of bulk-synchronous parallel (aka “CUDA”) workloads

Throughput vsinverse of Area

Many-to-Few-to-Many Traffic Pattern MC output bandwidth MC input bandwidth core injection bandwidth MC0 MC1 Cn C0 C2 C2 C0 Cn C1 C1 reply network request network MCm

Exploit Traffic Pattern Somehow? • Keep bisection bandwidth same, reduce router area… • Half-Router: • Limited connectivity • No turns allowed • Might save ~50% of router crossbar area Half-Router Connectivity

Checkerboard Routing, Example • Routing from a half-router to a half-router • even # of columns away • not in the same row • Solution: needs two turns • (1) route to an intermediate full-router using YX • (2)then route to the destination using XY

Multi-port routers at MCs • Increase the injection ports of Memory Controller routers • Only increase terminal BW of the fewnodes • No change in Bisection BW • Minimal area overhead (~1% in NoC area) • Speedups of up to 25% • Reduces the bottleneck at the few nodes

Results • HM speedup 13% across 24 benchmarks • Total router area reduction of 14.2%

Next: GPU Off-chip Memory Bandwidth Problem (MICRO’09) 24

DRAM Column Decoder Column Decoder Column Decoder Row Buffer Row Buffer Row Buffer Row Buffer Memory Controller Memory Array Row Decoder Row Decoder Background: DRAM • Row Access: • Activate a row of DRAM bank and load into row buffer (slow) • Column Access: • Read and write data in row buffer (fast) • Precharge: • Write row buffer data back into row (slow) 46

Background: DRAM Row Access Locality Definition: Number of accesses to a row between row switches “row switch” (GDDR uses multiple banks to hide latency) tRC = row cycle time tRP = row precharge time tRCD = row activate time Row access locality Achievable DRAM Bandwidth Performance 47

Interconnect Arbitration Policy: Round-Robin RowY RowA RowA Memory Controller 0 N W Router E S RowX RowC RowB RowB RowC RowB RowA Memory Controller 1 RowB RowA RowX RowY 48

The Trend: DRAM Access Locality in Many-Core Good Bad Pre-interconnect access locality Post-interconnect access locality Inside the interconnect, interleaving of memory request streams reduces the DRAM access locality seen by the memory controller 49

Today’s Solution: Out-of-Order Scheduling Request Queue Youngest • Queue size needs to increase as number of cores increase • Requires fully-associative logic • Circuit issues: • Cycle time • Area • Power Row A Row B Row A Row A Row B Oldest Row A Opened Row: A DRAM Opened Row: B Switching Row 50

GPU Architecture Challenges for Throughput Computing Tor M. Aamodt Assistant Professor, UBC

GPU Architecture Challenges for Throughput Computing Tor M. Aamodt Assistant Professor, UBC

Presentation Transcript

High Throughput Computing

Annamore M. MATAMBANADZO, PhD Research Assistant Professor

Enhancing GPU for Scientific Computing

Timothy G. Rogers 1 , Mike O’Connor 2 , Tor M. Aamodt 1

GPU Computing Tools

GPU Computing

Optimizations Techniques for GPU Computing

GPU Computing Techniques

GPU Architecture

GPU Parallel Computing

GPU Architecture Overview

Scan Primitives for GPU Computing

Kathleen M. Kaplan, D.Sc. Assistant Professor

CUDA GPU Computing

GPU System Architecture

Dr. Thomas Goetz, MD, FRCSC Clinical Assistant Professor, UBC

DIRAC: A Scalable Lightweight Architecture for High Throughput Computing

GPU Computing

GPU Computing

GPU Architecture Overview

M. BALAMURUGAN ASSISTANT PROFESSOR OF ENGLISH EDAYATHANGUDY.

Assistant Professor