300 likes | 407 Vues
Explore how to ensure application-level correctness in the face of soft errors in CMOS circuits. Detect and safeguard crucial program instructions, diving into critical instruction definition, program representation, and analysis methods. Learn how to characterize the quality of elastic outputs and utilize static analysis and profiling for robust results.
E N D
Assuring Application-level Correctness Against Soft Errors Jason Cong and KarthikGururaj
Motivation • Soft errors – issue for correct operation of CMOS circuits • Problem becomes more severe – ITRS 2009 • Smaller device sizes • Low supply voltages • Effect of soft errors on circuits • Karnik 2004, Nguyen 2003 • Effect of soft errors on software and processors • Li et al 2005, Wang et al 2004
Motivation • Traditional notion of correctness • Every last bit of every variable in a program should be correct • Referred to as numerical correctness • Application-level correctness • Several applications can tolerate a degree of error • Image viewer, video decoding etc • However, there exist critical instructions even in such applications • Example: state machine in video decoder
Motivation • Goal: Detect all “critical” instructions in the program • Protect “critical” instructions in the program against soft errors • Using duplication
Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results
Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results
Defining critical instructions • Elastic outputs – program outputs which can tolerate a certain amount of error • Media applications – image, video etc • Heuristics – Support vector machine • Characterizing quality of elastic outputs – Fidelity metric • Example: PSNR (peak signal to noise ratio) for JPEG, bit error rate,
Defining critical instructions • Given application A: • I is the input to the application • A set of outputs Oc - numerical correctness required • A set of elastic outputs O • Fidelity metric F(I,O) for elastic outputs • T – threshold for acceptable output • An execution of A is said to satisfy application-level correctness if: • All outputs εOc are numerically correct • F(I,O) ≥ T for elastic outputs • Nmin – the minimum number of elements of O that need to erroneous for F(I,O) to fall below T
Example: JPEG decoder • PSNR of 35dB is assumed to be good quality • MSE = 20.56 • Using 8-bit pixel values (MAX=255), • Max error = 255 • For a 1024x768 pixel image, Nmin ~ 251
Defining critical instructions • An instruction X is said to be critical if • X affects one of the outputs of Oc (numerical correctness required) OR • X affects Nmin elastic output elements O
Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results
Program representation • LLVM compiler infrastructure • LLVM intermediate representation • Weighted program dependence graph (PDG) – G
Example LLVM IR – 3 address code
Example PDG - based on LLVM IR
Example Node for computing X
Example Node for computing X Node (out_i) to compute C[Z]+X Node (so) to store C[Z]+X into array output
Example Node for computing X Node (so) to write to output array Node (so) to store C[Z]+X into array output Edge to represent dependence between X and out_i Edge to represent dependence between out_i and so
Assigning edge weights • Edge weight u→v - how many instances of node v are affected by 1 instance of u? • Example: • X outside the loop, out_i inside the loop • Edge weight N • Nodes out_i and so are in the same basic block – • Edge weight 1
Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results
Static analysis for detecting critical instructions • Find how many instances of output O are affected by node x • propagate(x →v) is the number of instances of v that are affected by an instance of x
Example • propagate(u→v) initialized to edge weight for all edges (u →v) • propagate(X →out_i) = N • w(out_i →so) = 1 • propagate(X →so) = propagate(X →out_i) * w(out_i →so) • More formally
Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results
Profiling and runtime monitoring • Static analysis is conservative in nature • May produce overly pessimistic results • Main reason – edge weights are initialized too high • Profiling with test inputs to estimate edge weights
Example • Assum static analysis overestimates edge weight between sc and c_z • Profiling gives value of 1 • Node sc is likely non-critical (LNC) • Contrast this with node X which is static critical
Profiling and runtime monitoring • Likely critical instructions – duplicated and checked in software • Using the SWIFT method proposed by Reis et al 2005 • Likely non-critical instructions – monitored using lightweight runtime monitoring technique • Static non-critical instructions – no error checking
Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results
Results • Benchmarks for Mediabench, SPEC, Mibench • Simics/GEMS simulation infrastructure
Static instruction classification • Significant number of instructions are non-critical • Profiling helps to determine likely non-critical instructions
Comparison with previous work • Significant savings over approach proposed by Thaker et al • Protects all instructions which compute memory addresses and control flow
Conclusion • Static + dynamic technique for detecting critical instructions • Detect several non-critical instructions • Reduce overall energy by 25%