Paragon: Collaborative Speculative Loop Execution on GPU and CPU

Paragon: Collaborative Speculative Loop Execution on GPU and CPU MehrzadSamadi1 Amir Hormati2 Janghaeng Lee1 and Scott Mahlke1 1University of Michigan - Ann Arbor 2Microsoft Research, Microsoft

Amdahl’s Law • GPGPU may have <100x speedup but... 50% 50% NO GPU utilization GPU Executable Even 1000x here does NOT bring more than 2x in overall NO GPU utilization Execution Time

General Purpose Computing on GPU • Limitation of • Massive Data-Parallelism • Linear array access • NO Indirect array access • NO Pointers • Leaves GPUs underutilized • GPUs are not so much generalized GPU Executable How can GPUs be more GENERAL?

Motivation – More Generalization • Reduce Sections • Non-Linear array access • Indirect array access • Array access through pointers • Difficult for programmers to verify • Loop-Carried Dependencies NO GPU utilization for(y=0; y<ny; y++) for(x=0; x<nx; x++){xr = x %squaresize[XUP];yr = y %squaresize[YUP];i = xr + yr;lattice[i].x = x;lattice[i].y = y; } for(i=1; i<m; i++) for(j=iaL[i]; j<iaL[i+1]-1; j++) x[i] = x[i] - aL[j] * x[jaL[j]]; for(int i=0; i<n; i++){ *c = *a + *b; a++; b++; c++; }

Motivation – More Generalization • Reduce Sections • Non-Linear array access • Indirect array access • Array access through pointers • Difficult for programmers to verify • Loop-Carried Dependencies NO GPU utilization

Paragon Execution Loop 2 Sequential Loop 1 Loop 3 Sequential Sequential DO-ALL Possibly-Parallel DO-ALL Sequential CPU Sequential Sequential L2 CPU L3 L1 L2 GPU Conflict Check

Paragon Execution with Conflict Loop 2 Sequential Loop 1 Loop 3 Sequential Sequential DO-ALL Possibly-Parallel DO-ALL Sequential CPU Sequential Sequential L2 CPU L1 L2 L3 GPU Conflict

Paragon Process Flow Input: Sequential Code Offline Compilation Runtime Kernel Management Loop Classification Profiling Conflict Management Unit Instrumentation Executionwithout Profiling CUDA + pThread

Offline Compilation • Loop classification • Sequential Loops • Dependence determined at compile-time • Assign to CPU statically • DO-ALL Loops • Assign to GPU statically • Possible DO-ALL Loops • Dependence can be determined at RUNTIME

Runtime Profiling • Spawns thread on CPU • Sequential execution thread • Monitoring thread • Keeps track of memory foot print • Marks loop • Sequential • If many conflicts • Parallelizable • If no/few conflicts Assigned to CPU and GPU

Conflict Detection - Logging • Lazy conflict detection • Allocate memory when executing kernel • “write-set” for store • “read-set” for load intC_wr_log[sizeof_C]; boolC_rd_log[sizeof_C]; for (i = 0; i < N; i++){ idx = I[i]; C[idx] = A[idx] + B[idx]; } for (i = tid; i < N; i += ThreadCnt){ idx = I[i]; C[idx] = A[idx] + B[idx]; } AtomicInc(C_wr_log[idx]);

Conflict Detection - Checking • Done in parallel following kernel • Conflict if • Address written more than once • Address read and written at least once C_rd_log C_wr_log Thread 1 [0] [0] F 0 OK Thread 2 [1] [1] F 0 Thread 3 [2] [2] F 1 Thread 4 [3] [3] F 0 ... ... ... Conflict F T 0 Thread N F T 2 1 [N] [N]

Experimental Setup • CPU • Intel Core i7 • GPU • NVIDIA GTX 560 with 2GB DDR5 • Benchmark • Loops with pointers • FDTD, Siedel, Jacobi2d, GEMM, TMV • Indirect/Non-Linear access • Saxpy, House, Ipvec, Ger, Gemver, SOR, FWD

Results for Pointer Access 36x

Results for Indirect Access 3.8x

Conclusion • Paragon improves performance • More GPU Utilization • Speculatively run possibly-parallel loops on GPU • No performance penalty on mis-speculation • Letting CPU run sequentially at the same time • Conflict checking is done in GPU

Q & A

Overhead Breakdown

Paragon: Collaborative Speculative Loop Execution on GPU and CPU

Paragon: Collaborative Speculative Loop Execution on GPU and CPU

Presentation Transcript

Lazy and Speculative Execution

Loop-Extended Symbolic Execution on Binary Programs

Heterogeneous CPU/GPU co-processor clusters

Future of GPU/CPU Computing and Programming

GPU Parallel Execution Model / Architecture

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations

iGPU : Exception Support and Speculative Execution on GPUs

Speculative Execution In Distributed File System and External Synchrony

SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM

Speculative Execution in a Distributed File System

A Configurable Simulator for OOO Speculative Execution

GPU and CPU: The Differences

Mixed Speculative Multithreaded Execution Models

Processor Verification with Precise Exceptions and Speculative Execution

Speculative Execution in a Distributed File System

GPU vs. CPU

GPU Parallel Execution Model / Architecture

Out-of-Order Speculative Execution

Loop-Extended Symbolic Execution on Binary Programs

Lecture 7 : Speculative Execution and Recovery

Lazy and Speculative Execution

Out-of-Order Speculative Execution