Enhancing GPU Performance Through Fetch and Memory Scheduling
Explore the impact of instruction fetching and memory scheduling on GPU efficiency, with detailed insights into architecture, merging strategies, code examples, and optimization techniques for improved performance.
Enhancing GPU Performance Through Fetch and Memory Scheduling
E N D
Presentation Transcript
Effect of Instruction Fetch and Memory Scheduling on GPU PerformanceNagesh B Lakshminarayana, Hyesoon Kim
Outline Background and Motivation Policies Experimental Setup Results Conclusion 2
GPU Architecture (based on Tesla Architecture) SM – Streaming Multiprocessor SP – Scalar Processor SIMT – Single Instruction Multiple Thread 3
SM Architecture (based on Tesla Architecture) • Fetch Mechanism • Fetch 1 instruction for selected warp • Stall Fetch for warp when it executes a Load/Store or when it encounters a Branch • Scheduler Policy • Oldest first and Inorder (within warp) • Caches • I Cache, Shared Memory, Constant Cache and Texture Cache 4
Handling Multiple Memory Requests • MSHR/Memory Request Queue • Allows merging of memory requests (Intra-core) • DRAM Controller • Allows merging of memory requests (Inter-core) 5
Code Example - Intra-Core Merging • From MonteCarlo in CUDA SDK for(iSum = threadIdx.x; iSum < SUM_N; iSum += blockDim.x) { … for(int i = iSum; i < pathN; i += SUM_N) { real r = d_Samples[i]; real callValue = endCallValue(S, X, r, MuByT, VBySqrtT); sumCall.Expected += callValue; sumCall.Confidence += callValue * callValue; } … } iSum 0, 2 = 2 iSum 1, 2 = 2 iSum 2, 2 = 2 A X, Y X – Block Id, Y – Thread Id i 0, 2 = 2 i 1, 2 = 2 i 2, 2 = 2 r 0, 2 = r 1, 2 = r 2, 2= d_Samples[2] multiple blocks are assigned to the same SM threads with corresponding Ids in different blocks access the same memory locations 7
Why look at Fetch? • Allows implicit control over resources allocated to a warp • Can control progress of a warp • Can boost performance by fetching more for critical warps • Implicit resource control within a core 9
Why look at DRAM Scheduling? • Memory System is a performance bottleneck for several applications • DRAM scheduling decides the order in which memory requests are granted • Can prioritize warps based on criticality • Implicit performance control across cores 10
By controlling Fetch and DRAM Scheduling we can control performance 11
How is This Useful? • Understand applications and their behavior better • Detect patterns or behavioral groups across applications • Design new policies for GPGPU applications to improve performance 12
Outline Background and Motivation Policies Experimental Setup Results Conclusion 13
Fetch Policies • Round Robin (RR) [default in Tesla architecture] • FAIR • Ensures uniform progress of all warps • ICOUNT [Tullsen’96] • Same as ICOUNT in SMT • Tries to increase throughput by giving priority to fast moving threads • Least Recently Fetched(LRF) – Prevents starvation of warps 14
New Oracle Based Fetch Policies • ALL • Gives priority to longer warps (total length until termination) • Ensures all warps finish at the same time,this results in higher occupancy Priorities: warp 0 > warp 1 > warp 2 > warp 3 15
New Oracle Based Fetch Policies • BAR • Gives priority to warps with greater number of instructions to next barrier • Idea is to reduce wait time at barriers Priorities: warp 0 > warp 1 > warp 2 > warp 3 Priorities: warp 2 > warp 1 > warp 0 > warp 3 16
New Oracle Based Fetch Policies • MEM_BAR • Similar to BAR but gives higher priority to warps with more memory instructions Priorities: warp 0 > warp 2 > warp 1 = warp 3 Priorities: warp 1 > warp 0 = warp 2 > warp 3 Priority(Wa) > Priority(Wb) If MemInst(Wa) > MemInst(Wb)or If MemInst(Wa) = MemInst(Wb) AND Inst(Wa) > Inst(Wb) 17
DRAM Scheduling Policies • FCFS • FRFCFS [Rixner’00] • FR_FAIR (new policy) • Row hit with fairness • Ensures uniform progress of warps • REM_INST (new Oracle based policy) • Row hit with priority for warps with greater number of instructions remaining for termination • Prioritizes longer warps 18
Outline Background and Motivation Policies Experimental Setup Results Conclusion 19
Experimental Setup • Simulated GPU Architecture • 8 SMs • Frontend : 1 wide, 1KB I Cache, branch stall • Execution : 8 wide SIMD execution unit, IO scheduling, 4 cycle latency for most instructions • Caches : 64KB software managed cache, 8 load accesses/cycle • Memory : 32B wide bus, 8 DRAM banks • RR fetch, FRFCFS DRAM scheduling (baseline) • Trace driven, cycle accurate simulator • Per warp traces generated using GPU Ocelot[Kerr’09] 20
Benchmarks • Taken from • CUDA SDK 2.2 – MonteCarlo, Nbody, ScalarProd • PARBOIL[UIUC’09] – MRI-Q, MRI-FHD, CP, PNS • RODINIA[Che’09] – Leukocyte, Cell, Needle • Classification based on lengths of warps • Symmetric, if <= 2% divergence • Asymmetric, otherwise (results included in paper) 21
Outline Background and Motivation Policies Experimental Setup Results Conclusion 22
Results - Symmetric Applications Baseline : RR + FRFCFS • Compute intensive – no variation with different fetch policies • Memory bound – improvement with fairness oriented fetch policies i.e., FAIR, ALL, BAR, MEM_BAR 23
Results – Symmetric Applications Baseline : RR + FRFCFS • On average, better than FRFCFS • MersenneTwister shows huge improvement • REM_INST DRAM policy performs similar to FRFAIR 24
Analysis: MonteCarlo FRFCFS DRAM Scheduling • Fairness oriented fetch policies improve performance by increasing intra-core merging 25
Analysis: MersenneTwister Baseline : RR + FRFCFS • FAIR DRAM Scheduling (FRFAIR, REM_INST) improves performance by increasing DRAM Row Buffer Hit ratio 26
Analysis: BlackScholes FRFCFS DRAM Scheduling • Fairness oriented fetch policies increase MLP • Increased (MLP + Row Buffer Hit ratio) improves performance 27
Outline Background and Motivation Policies Experimental Setup Results Conclusion 28
Conclusion • Compute intensive applications • Fetch and DRAM Scheduling do not matter • Symmetric memory intensive applications • Fairness oriented Fetch (FAIR, ALL, BAR, MEM_BAR) and DRAM policies (FR_FAIR, REM_INST) provide performance improvement • MonteCarlo(40%),MersenneTwister(50%), BlackScholes(18%) • Asymmetric memory intensive applications • No correlation between performance and Fetch and DRAM Scheduling policies 29
THANK YOU! 30