Warp-Level Divergence in GPUs: Characterization , Impact , and Mitigation

Ping Xiang,Yi Yang,HuiyangZhou Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation The 20th IEEE International Symposium On High Performance Computer Architecture, Orlando, Florida, USA 1

Outline • Background • Motivation • Mitigation: WarpMan • Experiments • Conclusions 2

Overview of GPU Architecture Control Register File ALU ALU Warp TB Threads Warp ALU ALU Shared Memory Cache Warp ALU ALU … ALU ALU DRAM 3

Motivation: Resource Fragmentation • Typically large TB size (512, e.g.) • More efficient data sharing/communication within a TB • Limited total TB number Register File TB TB Unused Registers 4

Motivation Warp-Level Divergence: warps within the same TB don’t finish at the same time …. Warp1 Warp2 Finished Warp3 Resources cannot be released promptly Finished Warp4 TB Unused Resources 5

Outline • Background • Motivation • Characterization: • Mitigation: WarpMan • Experiments • Conclusions 6

Characterization: Temporal Resource underutilization Register File TB Finished Finished TB Unused Resources Spatial Resource underutilization 7

Spatial Resource Underutilization Register resource as an example 46% 28% 17% 8

Temporal Resource Underutilization RTRU: ratio of temporal resource underutilization • Case Study: Ray Tracing • 6 warps per TB • Study TB0 as an example RTRU = 49.7% 9

Why There Is Temporal Resource Underutilization? • Input-dependent workload imbalance • Same code, different input: “if(a < 123)” • Program-dependent workload imbalance • Code like if(tid < 32) • Memory divergence • Some warps experience more cache hits than others • Warp schedulingpolicy • Scheduler prioritizes certain warps than others 10

Characterization: RTRU 11

Outline • Background • Motivation • Characterization: • Micro-benchmarking • Mitigation: WarpMan • Experiments • Conclusions 12

Micro-benchmark • Code runs on both GTX480 and GTX 680 • 1. __global__ void TB_resource_kernel(…, bool call = false){ • 2. if(call) bloatOccupancy(start, size); • ... • 3. clock_tstart_clock = clock(); • 4. if(tid < 32){ //tid is the threadid within a TB • 5 clock_offset = 0; • 6. while( clock_offset < clock_count ) { • 7. clock_offset = clock() - start_clock; • 8. } • 9. } • 10. clock_tend_clock = clock(); • 11. d_o[index] = start_clock; //index is the global thread id • 12. d_e[index] = end_clock; • 13.} 13

Micro-benchmarking • Results > Using CUDA device [0]: GeForce GTX 480 > Detected Compute SM 3.0 hardware with 8 multi-processors. … CTA 250 Warp 0: start 80, end 81 CTA 269 Warp 0: start 80, end 81 CTA 272 Warp 0: start 80, end 81 CTA 283 Warp 0: start 80, end 81 CTA 322 Warp 0: start 80, end 81 CTA 329 Warp 0: start 80, end 81 … 14

Outline • Background • Motivation • Mitigation: WarpMan • Experiments • Conclusions 15

WarpMan Warp Level Resource Management TB-level Resource Management SM Finished Warp0 TB0 TB2 Finished Warp1 Finished Warp2 TB1 Unused Resources Workload TB0 TB2 Warp0 cycle Warp1 Warp2 TB1 16

WarpMan Warp Level Resource Management TB-level Resource Management WarpMan SM SM Finished Warp TB0 TB0 TB2 Finished Warp Finished Warp Finished Released Resource Warp2 From TB2 TB1 TB1 Unused Resources Warp0 From TB2 Warp1 From TB2 Workload TB0 Saved Cycle Warp0 cycle Warp1 Warp2 warp2 TB1 Warp0 and warp 1 17

WarpMan ---- Design • Dispatch logic • Traditional TB-level dispatching logic • Add partial TB dispatch logic • Workload buffer • Store the dispatched but not running partial TBs 18

Dispatching Shared memory Warp entries TB entries Registers Workload to be dispatched Resources required for a TB TB-level Resource Check A full TB The shared memory is still allocated at the TB level Warp-level Resource Check A partial TB Resources required for a Warp 19

Workload Buffer • Store the dispatched but not running TB • Hardware TB id (assigned by the hardware) • Software TB id (defined by the software) • Start warp id • End warp id • Valid bit 3 26 5 5 1 40bits 20

Workload Buffer Store the dispatched but not running TB TB120 WarpMan SM Workload buffer TB118 Finished Warp2 From TB120 TB Num 120 TB117 0 Start Warp ID 1 2 2 End Warp ID 1 0 Valid Unused Resources Warp0 From TB120 Warp1 From TB120 21

Outline • Background • Motivation: • Mitigation: WarpMan • Experiments • Conclusions 22

Methodology • Use GPUWattch for both timingand energyevaluation • Baseline Architecture: (GTX480) • 15 SMs, with SIMD size of 32, running at 1.4Ghz • Max TBs per SM is 8, Max threads per SM is 1536 • Scheduling policy: round robin / two level • 16KB L1 cache, 48 KB shared memory. 128KB regs • Applicationsfrom: • Nvidia CUDA SDK • Rodinia Benchmark Suit • GPGPUsim 23

Performance Results: • temp: allow early finished warps to release resource for new warps • temp + spatial: resources are allocated /released at warp level • The performance improvements can be as high as 71%/76% • On average, 15.3% improvements 24

Energy Results The energy savings can be as high as over 20%, and 6% on average

A Software Alternative a smaller TB size • Change the softwareto have a smaller TB size • Change the hardwareto enable more concurrent TBs • Inefficientshared memory usage / synchronization • Decreasethe data locality • Moreas we proceed to the experimental results… 26

Comparing to the Software Alternative • CT and ST: software approach decreases L1 locality • NN and BT: reduced total number of threads • On average: 25% improvement VS 48% degradation 125% 52% 27

Related Work • Resource underutilization due to branch divergence or thread-level divergence has been well studied. • Yi Yang et al [Pact-21] targets at the shared memoryresource management and is complementary to our proposed WarpMan scheme. • D. Tarjan, et al [US Patent-2009], proposes to use virtual register table to manage physical register file to enable more concurrent TBs 28

Conclusion • We highlight the limitations of TB-level resource management • we characterizewarp-level divergence and reveal the fundamental reasons for such divergent behavior; • we propose WarpManand show that it can be implemented with minor hardware changes • we show that our proposed solution is highly effective and achieves significant performanceimprovements and energy savings Questions? 29

Warp-Level Divergence in GPUs: Characterization , Impact , and Mitigation

Warp-Level Divergence in GPUs: Characterization , Impact , and Mitigation

Presentation Transcript

Characterization of Pore Structure: Foundation

FEMA’S BENEFIT COST ANALYSIS

Application and Mitigation Requirements

Site Characterization and Analysis

Module 3 Mitigation Options

Monte Carlo implementations on GPUs

Characterization Creating Characters Direct Characterization Indirect Characterization Motivation Your Turn

WARP

Session 10 Disaster Mitigation

Session 10 Disaster Mitigation

Divergence Trading Workshop Day One

CUDA Programming Model Overview Memory Hierarchy and Optimization

Divergence Theorem

State of Louisiana Hazard Mitigation Plan Update

D. Casey Levy State Hazard Mitigation Officer

Nonlinearity characterization and modelling

Materials Characterization

The Truth About Sharks

KOL Management Skills

Properties, Characterization, and Applications of Carbon Nanotubes

3 rd NEO-ASE Workshop on Asteroid Threat Mitigation Earth University, Costa Rica