Concolic Verification for GPUs: Addressing Testing Gaps

GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li1,2, Peng Li1, Geof Sawaya1, Ganesh Gopalakrishnan1, Indradeep Ghosh2, Sreeranga P. Rajan2 1 2 Fujitsu Labs of America Feb. 2012 1

GPUs are widely used! (courtesy of Nvidia, www.engadget.com) (courtesy of Nvidia) (courtesy of AMD) (courtesy of Intel) In such application domains, it is important that GPU computations yield correct answers and are bug-free. About 40 of the top 500 machines are GPU based Personal supercomputers used for scientific research (biology, physics, …) increasingly based on GPUs 2

Existing GPU Testing Methods are Inadequate • Insufficient branch-coverage and interleaving-coverage, leading to • Missed data races

Existing GPU Testing Methods are Inadequate • Insufficient branch-coverage and interleaving-coverage, leading to • Missed data races Write(a) Read(a) Write(a) Write(a)

Existing GPU Testing Methods are Inadequate • Data races are a huge problem • Testing is NEVER conclusive • One has to infer data race's ill effects indirectly through corrupted values • Even instrumented race checking gives results only for a specific platform, and not for future validations, • for example for a different warp scheduling, e.g. change over from old Tesla to New Fermi

Existing GPU Testing Methods are Inadequate • Insufficient branch-coverage and interleaving-coverage, leading to • Missed data races • Missed deadlocks

Existing GPU Testing Methods are Inadequate • Insufficient branch-coverage and interleaving-coverage, leading to • Missed data races • Missed deadlocks __SyncThreads()

Existing GPU Testing Methods are Inadequate • Insufficient branch-coverage and interleaving-coverage, leading to • Missed data races • Missed deadlocks • Insufficient measurement of performance penalties due to • Warp Divergence

Existing GPU Testing Methods are Inadequate • Insufficient branch-coverage and interleaving-coverage, leading to • Missed data races • Missed deadlocks • Insufficient measurement of performance penalties due to • Warp Divergence • Non-coalesced memory accesses

Existing GPU Testing Methods are Inadequate • Insufficient branch-coverage and interleaving-coverage, leading to • Missed data races • Missed deadlocks • Insufficient measurement of performance penalties due to • Warp Divergence • Non-coalesced memory accesses Memory

Existing GPU Testing Methods are Inadequate • Insufficient branch-coverage and interleaving-coverage, leading to • Missed data races • Missed deadlocks • Insufficient measurement of performance penalties due to • Warp Divergence • Non-coalesced memory accesses • Bank conflicts Memory Banks

Existing GPU Testing Methods are Inadequate • CUDA GDB Debugger • Manually debug the code and check races and deadlocks • CUDA Profiler • Report numbers difficult to read • Low coverage (i.e. no all possible inputs) • GKLEE • Better tool for verification and testing • Can address all the previously mentioned points • e.g.has found bugs in real SDK kernels previously thought to be bug-free • give root causes of the bugs

Our Contributions GKLEE: a Symbolic Virtual GPU for Verification, Analysis, and Test-generation GKLEE reports Races, Deadlocks, Bank Conflicts, Non-Coalesced Accesses, Warp Divergences GKLEE generates Tests to Run on GPU Hardware 14

Architecture of GKLEE C++ GPU Program (with Sym. Inputs) GKLEE (Executor, scheduler, checker, test generator) LLVM GCC Compiler LLVMcuda NVCC CUDA Syntax Handler GPU configuration Test Cases Statistics /Bugs Replay on Real GPU 15

Rest of the Talk Simple CUDA example Details of Symbolic Virtual GPU Analysis Details: Races, Deadlocks Degree of Warp divergences, Bank Conflicts, Non-Coalesced Accesses Functional Correctness Automatic Test Generation Coverage-directed test-case reduction 16

CUDA • A simple dialect of C++ with CUDA directives • Thread blocks / teams -- SIMD “warps” • Synchronization through barriers / atomics (GKLEE being extended to handle atomics) 17

Example: Increment Array Elements Increment N-element array A by scalar b tid 0 1 … A A[0]+b A[1]+b ... t0 t1 __global__ void inc_gpu(int*A, int b, intN) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) A[idx] = A[idx] + b; } 18

Illustration of Race Increment N-element vector A by scalar b tid 0 1 63 A t0: read A[63] ... t63: write A[63] __global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) A[idx] = A[(idx – 1) % N] + b; } RACE! 19

Illustration of Deadlock Increment N-element vector A by scalar b tid 0 1 … A ... __global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) { A[idx] = A[idx] + b; __syncthreads(); } idx ≥ N idx < N DEADLOCK! 20

Example of a Race Found by GKLEE __global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ... } inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data) { s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; } “GKLEE: Is there a Race ?” 21

Example of a Race Found by GKLEE __global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ... } inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data) { s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; } Threads 5 and and 13 have a WW race when d_Data[5] = 0x04040404 and d_Data[13] = 0. GKLEE 22

Example of Test Coverage due to GKLEE __global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads(); for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid]; } __shared__ unsigned shared[NUM]; inline void swap(unsigned& a, unsigned& b) { unsigned tmp = a; a = b; b = tmp; } 23 23

Example of Test Coverage due to GKLEE __global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads(); for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid]; } __shared__ unsigned shared[NUM]; inline void swap(unsigned& a, unsigned& b) { unsigned tmp = a; a = b; b = tmp; } “How do we test this?” 24 24

Example of Test Coverage due to GKLEE __global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads(); for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid]; } __shared__ unsigned shared[NUM]; inline void swap(unsigned& a, unsigned& b) { unsigned tmp = a; a = b; b = tmp; } Answer 1 : “Random + “ 25 25

Example of Test Coverage due to GKLEE __global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads(); for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid]; } __shared__ unsigned shared[NUM]; inline void swap(unsigned& a, unsigned& b) { unsigned tmp = a; a = b; b = tmp; } Here are 5 tests with 100% source code coverage 79% avg. thread + barrier interval coverage Answer 2 : Ask GKLEE: 26 26

GKLEE: Symbolic Virtual GPU Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 0) Thread (0, 2) Thread (0, 1) Thread (1, 0) Thread (1, 2) Thread (1, 1) Thread (2, 1) Thread (2, 2) Thread (2, 0) Thread (3, 2) Thread (3, 1) Thread (3, 0) Thread (4, 2) Thread (4, 1) Thread (4, 0) • GKLEE models a GPU using software • The virtual GPU represents the CUDA Programming Model (hence hide many hardware details) • Similar to the CUDA emulator in this aspect; but with many unique features • Can simulate CPU+GPU GKLEE virtual GPU virtual CPU 27

Concolic Execution on the Virtual GPU • The values can be CONCrete or symbOLIC (CONCOLIC) in GKLEE • A value may be a complicated symbolic expression • Symbolic expressions are handled by constraint solvers • Determine satisfiability • Give concrete values as evidence • Constraint solving has become 1,000x faster over the last 10 years 28

Comparing Concrete and Symbolic Execution All values are concrete a b c 10 Program: b = a * 2; c = a + b; if (c > 100) assert(0); 10 20 30 10 20 unreachable 29

Comparing Concrete and Symbolic Execution The values can be concrete or symbolic a b c x(-,+ ) Program: b = a * 2; c = a + b; if (c > 100) assert(0); else … x(-,+ ) 2x x(-,+ ) 3x 2x reachable, e.g. x = 40 reachable, e.g. x = 30 Now path condition is: 3x <= 100 30

GKLEE Works on LLVM Bytecode • CUDA C++ programs are compiled to LLVM bytecode by LLVM-GCC with our CUDA syntax handler • Our online technical report contains detailed description • GKLEE extends KLEE to handle CUDA features LLVMcuda Syntax and Semantics 31

Thread Scheduling: In general, an Exp. Number of Schedules! It is like shuffling decks of cards > 13 trillion shuffles exist for 5 decks with 5 cards !! > 13 trillion schedules exist for 5 threads with 5 instructions !! More precisely, 25! / (5!)5 32

GKLEE Avoids Examining Exp. Schedules !! Instead of considering all Schedules and All Potential Races… 33

GKLEE Avoids Examining Exp. Schedules !! Consider JUST THIS SINGLE CANONICAL SCHEDULE !! Folk Theorem (proved in our paper): “We will find A RACE If there is ANY race” !! Instead of considering all Schedules and All Potential Races… 34

Closer Look: canonical scheduling Race-free operations can be exchanged a valid schedule: • The scheduler: • Applies the canonical schedule; • Checks races upon the barriers; • If no race then continues; otherwise reports the race and terminate t1:a3: write x t2:a4: write y t2:a6: read y t1:a5: read x t2:a2: write y t1:a1: read x another valid schedule (e.g. canonical schedule): t1:a3: write x t2:a4: write y t1:a5: read x t2:a6: read y t1:a1: read x t2:a2: write y 35

SIMD-aware Canonical Scheduling in GKLEE SIMD/Barrier Aware Canonical scheduling within warp/block t33 t34 t64 t1 t2 t32 Instr. 1 Instr. 1 … Barrier Interval (BI1) Instr. 2 Instr. 2 Instr. 3 Instr. 3 Instr. 4 Instr. 4 Barrier Interval (BI2) Instr. 5 Instr. 5 Instr. 6 Instr. 6 Record accesses in canonical schedule Check whether the accesses conflict (e.g. have the same address) 36

SIMD-aware Race Checking in GKLEE Check races on the fly (in the canonical schedule) t33 t34 t64 t1 t2 t32 Instr. 1 Instr. 1 … Barrier Interval (BI1) Instr. 2 Instr. 2 Instr. 3 Instr. 3 Instr. 4 Instr. 4 Barrier Interval (BI2) Instr. 5 Instr. 5 Instr. 6 Instr. 6 intra-warp races inter-warp and inter-block races 37

SIMD-aware Race Checking in GKLEE Check races on the fly (in the canonical schedule) t33 t34 t64 t1 t2 t32 Instr. 1 Instr. 1 … Barrier Interval (BI1) Instr. 2 Instr. 2 Instr. 3 Instr. 3 Instr. 4 Instr. 4 Barrier Interval (BI2) Instr. 5 Instr. 5 Instr. 6 Instr. 6 intra-warp races inter-warp and inter-block races 38

SDK Kernel Example: race checking __global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ... } inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data) { s_Hist[threadPos + IMUL(data, THREAD_N)]++; } t1 t2 threadPos = … threadPos = … data = (data4>26) & 0x3FU data = (data4>26) & 0x3FU s_Hist[threadPos + Data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++;

SDK Kernel Example: race checking t1 t2 RW set: t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), … t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), … threadPos = … threadPos = … data = (data4>26) & 0x3FU data = (data4>26) & 0x3FU s_Hist[threadPos + data*THREAD_N]++; ? s_Hist[threadPos + data*THREAD_N]++; t1,t2,d_Data: (t1  t2)  (((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64) == ((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64)

SDK Kernel Example: race checking t1 t2 RW set: t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), … t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), … threadPos = … threadPos = … data = (data4>26) & 0x3FU data = (data4>26) & 0x3FU GKLEE indicates that these two addresses are equalwhen t1 = 5, t2 = 13, d_data[5]= 0x04040404, and d_data[13] = 0 indicating a Write-Write race s_Hist[threadPos + data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++;

Experimental Results, Part I (check correctness and performance issues) • The results of running GKLEE on CUDA SDK 2.0 kernels. GKLEE checks • well synchronized barriers; (2) races; (3) functional correctness; (4) bank conflicts; (5) memory coalescing; (6) warp divergence; (7) required volatile keyword. 42

Automatic Test Generation t1 t2 c1 ¬c1 c3 ¬c3 c3 c3 ¬c3 ¬c3 c2 ¬c2 c4 ¬c4 ¬c4 ¬c4 c4 c4 • GKLEE guarantees to explore all paths w.r.t. given inputs • The path constraint at the end of each path is solved to generate concrete test cases • GKLEE supports many heuristic reduction techniques t1+t2 c1 ¬c1 c2 ¬c2 c3 ¬c3 ¬c1  ¬c3 c4 ¬c4 … c1c2 c3 c4 solve this constraint to give a concrete test 43

SDK Example: comprehensive testing __global__ void BitonicKernel(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads(); for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid]; } shared[0]≤shared[1] shared[0] > shared[1] shared[1] < shared[2] shared[1] ≥shared[2] … shared[0] ≤ shared[2] shared[0] > shared[2] Unsat: shared[0] > shared[1] shared[1] ≥shared[2]  shared[0] ≤ shared[2] 44 44

SDK Example: comprehensive verification … … … … Functional correctness: output values is sorted: values[0] ≤ values[1] ≤ … ≤ values[n] … … … … values=… values=… values=… values=… values=… values=… 45 45

Experimental Results, Part II… (Automatic Test Generation) Coverage information about the generated tests for some CUDA kernels. Covtand CovTBtmeasure bytecode coverage w.r.t threads. No test reductions used in generating this table. Exec. time on typical workstation. 46

Experimental Results, Part II (Coverage Directed Test Reduction) Results after applying reduction Heuristics RedTB and RedBI cut the paths according to the coverage information of Thread+Barrier and Barrier respectively. Basically a path is pruned if it is unlikely to contribute new coverage. 47

Additional GKLEE Features • GKLEE employs an efficient memory organization • Employs many expression evaluation optimizations • Simplify concolicexpressions on the fly • Dynamically cache results • Apply dependency analysis before constraint solving • Use manually optimized C/C++ Libraries • GKLEE also handles all of the C++ Syntax • GKLEE never generates false alarms 48

Experimental Results, Part III(performance comparison of two tools) Execution times (in seconds) of GKLEE and PUG [SIGSOFT FSE 2010] for functional correctness check. #T is the number of threads. Time is reported in the format of GPU time(entire time); T.O means > 5 minutes. 49

Other Details 50 • Diverged warp scheduling, intra-warp, inter-warp/-block race checking, textual aligned barrier checking • Checking performance issues • warp divergence, bank conflicts, global memory coalescing • Path/Test reduction techniques • Volatile declaration checking • Handling symbolic aliasing and pointers • Drivers for the kernels and replaying on the real GPU • Other results, e.g. on CUDA SDK 4.0 programs • CUDA’s relaxed memory model and semantics

Concolic Verification for GPUs: Addressing Testing Gaps