Protecting Critical Instructions Against Soft Errors: A Program Analysis Approach

Assuring Application-level Correctness Against Soft Errors Jason Cong and KarthikGururaj

Motivation • Soft errors – issue for correct operation of CMOS circuits • Problem becomes more severe – ITRS 2009 • Smaller device sizes • Low supply voltages • Effect of soft errors on circuits • Karnik 2004, Nguyen 2003 • Effect of soft errors on software and processors • Li et al 2005, Wang et al 2004

Motivation • Traditional notion of correctness • Every last bit of every variable in a program should be correct • Referred to as numerical correctness • Application-level correctness • Several applications can tolerate a degree of error • Image viewer, video decoding etc • However, there exist critical instructions even in such applications • Example: state machine in video decoder

Motivation • Goal: Detect all “critical” instructions in the program • Protect “critical” instructions in the program against soft errors • Using duplication

Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results

Defining critical instructions • Elastic outputs – program outputs which can tolerate a certain amount of error • Media applications – image, video etc • Heuristics – Support vector machine • Characterizing quality of elastic outputs – Fidelity metric • Example: PSNR (peak signal to noise ratio) for JPEG, bit error rate,

Defining critical instructions • Given application A: • I is the input to the application • A set of outputs Oc - numerical correctness required • A set of elastic outputs O • Fidelity metric F(I,O) for elastic outputs • T – threshold for acceptable output • An execution of A is said to satisfy application-level correctness if: • All outputs εOc are numerically correct • F(I,O) ≥ T for elastic outputs • Nmin – the minimum number of elements of O that need to erroneous for F(I,O) to fall below T

Example: JPEG decoder • PSNR of 35dB is assumed to be good quality • MSE = 20.56 • Using 8-bit pixel values (MAX=255), • Max error = 255 • For a 1024x768 pixel image, Nmin ~ 251

Defining critical instructions • An instruction X is said to be critical if • X affects one of the outputs of Oc (numerical correctness required) OR • X affects Nmin elastic output elements O

Program representation • LLVM compiler infrastructure • LLVM intermediate representation • Weighted program dependence graph (PDG) – G

Example LLVM IR – 3 address code

Example PDG - based on LLVM IR

Example Node for computing X

Example Node for computing X Node (out_i) to compute C[Z]+X Node (so) to store C[Z]+X into array output

Example Node for computing X Node (so) to write to output array Node (so) to store C[Z]+X into array output Edge to represent dependence between X and out_i Edge to represent dependence between out_i and so

Assigning edge weights • Edge weight u→v - how many instances of node v are affected by 1 instance of u? • Example: • X outside the loop, out_i inside the loop • Edge weight N • Nodes out_i and so are in the same basic block – • Edge weight 1

Static analysis for detecting critical instructions • Find how many instances of output O are affected by node x • propagate(x →v) is the number of instances of v that are affected by an instance of x

Example • propagate(u→v) initialized to edge weight for all edges (u →v) • propagate(X →out_i) = N • w(out_i →so) = 1 • propagate(X →so) = propagate(X →out_i) * w(out_i →so) • More formally

Profiling and runtime monitoring • Static analysis is conservative in nature • May produce overly pessimistic results • Main reason – edge weights are initialized too high • Profiling with test inputs to estimate edge weights

Example • Assum static analysis overestimates edge weight between sc and c_z • Profiling gives value of 1 • Node sc is likely non-critical (LNC) • Contrast this with node X which is static critical

Profiling and runtime monitoring • Likely critical instructions – duplicated and checked in software • Using the SWIFT method proposed by Reis et al 2005 • Likely non-critical instructions – monitored using lightweight runtime monitoring technique • Static non-critical instructions – no error checking

Results • Benchmarks for Mediabench, SPEC, Mibench • Simics/GEMS simulation infrastructure

Static instruction classification • Significant number of instructions are non-critical • Profiling helps to determine likely non-critical instructions

Comparison with previous work • Significant savings over approach proposed by Thaker et al • Protects all instructions which compute memory addresses and control flow

Conclusion • Static + dynamic technique for detecting critical instructions • Detect several non-critical instructions • Reduce overall energy by 25%

Protecting Critical Instructions Against Soft Errors: A Program Analysis Approach