Application-level Techniques to Improve System Resilience

Application-level Techniques to Improve System Resilience Vishal Chandra Sharma*, Arvind Haran*, ZvonimirRakamaric*, Ganesh Gopalakrishnan*§ {vcsharma, haran, zvonimir, ganesh}@cs.utah.edu School of Computing, University of Utah *Supported in part by NSF Award CCF 1255776 and SRC contract 2013-TJ-2426. §Faculty Associate, SUPER (http://super-scidac.org/)

Research Goals • Robust Evaluation Infrastructure • Released KULFI, an open source instruction-level fault injector • Evaluation of sorting routines done using KULFI, results shared at PRDC’13 • Lightweight Application-level Detectors • Developed FUSED, soft-error detection framework • Preliminary results to be presented at SELSE’14 • Further work in progress to develop heuristics to optimize detector placement • Identifying Vulnerable Code-Regions • Application includes detector placement optimization • Work in progress

KULFI : A Soft-Error Injector • Flexible evaluation infrastructure using KULFI • Active collaborations to promote usage of KULFI in other resilience studies • Current collaborators -- Greg Bronevetsky (LLNL), Sui Chen, Lu Peng (LSU)

Motivating Example LSB position of x flipped int x = 3;int y = 11; printf(“x=%d, y=%d” ,x ,y) if (x < 3&& y > 10) y++; else x++;

Motivating Example LSB position of x flipped int x = 3;int y = 11; printf(“x=%d, y=%d” ,x ,y) if (x < 3&& y > 10) y++; else x++; SDC in the output value of x Program output:x=4, y=12

A Software-Level Approach to Fault Detection int x = 2 ; int y = 11;PP0:If ( x<3 && y>10 ){PP1: y++; PP2:}else{ PP3: x++;PP4: }PP5:printf(“x=%d, y=%d”,x , y) Program Conditionals: x<3, y>10 Program Points: PP0, PP1, PP2, PP3, PP4, PP5 Predicate State at PP0: <PP0, TT> Predicate State at PP1:<PP1, TT> Example Predicate State Transition: <PP0, TT>  <PP1, TT>

PTD : Visualizing Spurious Transitions

FUSED Soft-Error Detection Framework • Automatically synthesizes and inserts detectors • Uses profilers to generate likely invariants • Likely invariants are used for soft error detection

Preliminary Experimental Results • FUSED is evaluated using SuperLU scientific library • Up to 90% of soft errors are detected • Detectors only inserted into top-level LU factorization routine • Average execution overhead of 19% due to the detectors • In future, optimize detector placement to reduce overhead

PTD for SuperLU[slu99,05,11]

Identifying Vulnerable Code-Regions • Identify highly active code-regions w.r.t. data-flow • Compute activity cost for data-flow edges • Highly active code-regions most likely to be hit by soft-errors • Applications include detector placement optimization Work in Progress

Concluding Remarks & Future Work • KULFI, an open source fault injector for evaluation infrastructure • Try out KULFI: https://github.com/soar-lab/KULFI • FUSED error detection framework • Continue working to develop heuristics for detector placement optimization • Plan for open source release • Identifying Vulnerable Code-Regions • Applications include detector placement optimization • Characterizing resilience properties of a program

References [arg13] Snir, M., et al. Addressing Failures in Exascale Computing. No. ANL/MCS-TM-33. Argonne National Laboratory (ANL), 2013 [lanl05] Michalak, Sarah E., et al. "Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer." IEEE Transactions on Device and Materials Reliability, 2005 [llvm04] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in International Symposium on Code Generation and Optimization (CGO), 2004 [pct05] T. Ball, “A theory of predicate-complete test coverage and generation,” in International Conference on Formal Methods for Components and Objects (FMCO), 2005 [iswat08] S. K. Sahoo, M. lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Y. Zhou, “Using likely program invariants to detect hardware errors,” in IEEE International Conference on Dependable Systems and Networks (DSN), 2008 [sloan13] Sloan, Joseph, Rakesh Kumar, and Greg Bronevetsky. "An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance.“, in IEEE International Conference on Dependable Systems and Networks (DSN), 2013

References [slu99] Demmel, James W., et al. "A supernodal approach to sparse partial pivoting.“ SIAM Journal on Matrix Analysis and Applications, 1999 [slu05] Li, Xiaoye S. "An overview of SuperLU: Algorithms, implementation, and user interface." ACM Transactions on Mathematical Software (TOMS), 2005[slu11] Li, X. S., Demmel, J. W., Gilbert, J. R., Grigori, L., Shao, M., & Yamazaki, I. (2011). SuperLU Users’ Guide. url: http://crd. lbl. gov/~ xiaoye/SuperLU/superlu_ug. Pdf. [sprs11] Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS), 2011 [parsec08] C. Bienia, S. Kumar, J. Singh, and K. Li, “The PARSEC benchmark suite: Characterization and architectural implications,” ser. PACT, 2008 [relax10] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An architectural framework for software recovery of hardware faults,” in International Symposium on Computer Architecture (ISCA), 2010 [schen13] S. Chen, personal communication, 2013.

Backup

Closely Related Work • Low-cost software level detector is the need of the hour • iSWAT by Sahoo et. al. [iswat08] uses likely program invariants • Derives likely invariants by monitoring program properties • Uses hardware-assisted framework to detect false positives • Not based on predicate abstraction • Error localization by Sloan et.al. [sloan13] uses algorithm based approach • Need fault injector as part of evaluation infrastructure • LLVM-level fault injector developed by Kuijif et. al. [relax10] • Publicly unavailable • A recent study [schen13] done by a user suggests KULFI has better fine-grained options • LLFI fault injector by Thomas et. al. • Developed around same time as KULFI, shares many similar features

KULFI: Fault Injection Logic Start Forall dynamic instructions Feasible? No Yes Inject Fault with user provided probability Stop

Case Study • Empirically study resiliency of sorting algorithms - Bubblesort, Quicksort, Mergesort, Radixsort, Heapsort • Inject exactly one fault in a randomly chosen dynamic instruction of a sorting routine • 1 fault injection experiment = 100 runs with exactly on fault injected • Categorize outcome into SDC, Benign, or Segmentation fault categories • Benign: 41, Segmentation: 29, SDC: 30

Case Study • Executed 200 fault injection experiments per sorting routine • Total number of fault injections = 5*200*100 = 100000 • Plotted fault counts from each outcome category for each fault injection experiment • Result shows strong clustering pattern with statistically signiﬁcant distribution for each outcome category

Results

Overview • Introduction • KULFI: A LLVM Level Fault Injector • Case Study • Fault Detector • Concluding Remarks

A Software-Level Approach to Fault Detection • Predicates: Pure boolean program conditionals • Predicate State: <PP,BV> • PP: Program point between two successive program statements • BV:Bit-vector representing concrete boolean values of program conditionals at a given program point • Predicate State Transition: <PP:BV>  <PP’:BV’> • PP’ is a program point which is an immediate successor of PP • BV’ is the bit-vector representing concrete boolean values of program conditionals at PP’

A Software-Level Approach to Fault Detection Start Start Program Program Execute Program Execute Program Get Predicate Transition No Extract all valid predicate transitions Is last transition? Check if Valid ? Yes Stop No Yes Fault Detected Stop

Predicate Transition Diagram (PTD) Start Program Execute Program Execute Program Inject Fault Track Predicate Transitions Track Predicate Transitions Merge Predicate Transition Diagram Stop

PTD for Blackscholes in Parsec 3.0[parsec08]

Acknowledgements • Pedro Diniz • PrabhakarKudva • ShuvenduLahiri • KarthikPattabiraman • Sui Chen • Anonymous reviewers of PRDC conference who reviewed our paper

Application-level Techniques to Improve System Resilience

Application-level Techniques to Improve System Resilience

Presentation Transcript

Therapeutic techniques to improve hand function

Application of Data Mining Techniques to Industrial Processes to Improve Business Performance

Towards Formal Approaches to System Resilience

Multi-level Abstraction Techniques – A Practical Application Perspective

Evaluation Techniques to Improve Performance

FOAM APPLICATION TECHNIQUES

Application Resilience

Diversifying Sensors to Improve Network Resilience

Application Techniques

SYSTEM-LEVEL TEST TECHNIQUES INTRODUCTION

Resilience in elite level sport

Innovative Techniques to Improve Weather Observations

Table 6.A Key actions to improve resilience

Four Techniques to Improve Discussion

Best Techniques to Improve Hiring People

Resilience Management Tools For Business To Improve Productivity

Best Techniques To Improve Your Memory

Secret Techniques To Improve Website Design

6 TECHNIQUES TO IMPROVE HOW YOU SELL TO C-LEVEL EXECUTIVES

Techniques to improve your Memory

Innovative Techniques to Improve Weather Observations

Techniques To Improve Cheap Eyelashes