1 / 28

Application-level Techniques to Improve System Resilience

Application-level Techniques to Improve System Resilience. Vishal Chandra Sharma * , Arvind Haran * , Zvonimir Rakamaric * , Ganesh Gopalakrishnan *§ { vcsharma , haran , zvonimir , ganesh }@cs.utah.edu School of Computing, University of Utah.

barbra
Télécharger la présentation

Application-level Techniques to Improve System Resilience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application-level Techniques to Improve System Resilience Vishal Chandra Sharma*, Arvind Haran*, ZvonimirRakamaric*, Ganesh Gopalakrishnan*§ {vcsharma, haran, zvonimir, ganesh}@cs.utah.edu School of Computing, University of Utah *Supported in part by NSF Award CCF 1255776 and SRC contract 2013-TJ-2426. §Faculty Associate, SUPER (http://super-scidac.org/)

  2. Research Goals • Robust Evaluation Infrastructure • Released KULFI, an open source instruction-level fault injector • Evaluation of sorting routines done using KULFI, results shared at PRDC’13 • Lightweight Application-level Detectors • Developed FUSED, soft-error detection framework • Preliminary results to be presented at SELSE’14 • Further work in progress to develop heuristics to optimize detector placement • Identifying Vulnerable Code-Regions • Application includes detector placement optimization • Work in progress

  3. KULFI : A Soft-Error Injector • Flexible evaluation infrastructure using KULFI • Active collaborations to promote usage of KULFI in other resilience studies • Current collaborators -- Greg Bronevetsky (LLNL), Sui Chen, Lu Peng (LSU)

  4. Motivating Example LSB position of x flipped int x = 3;int y = 11; printf(“x=%d, y=%d” ,x ,y) if (x < 3&& y > 10) y++; else x++;

  5. Motivating Example LSB position of x flipped int x = 3;int y = 11; printf(“x=%d, y=%d” ,x ,y) if (x < 3&& y > 10) y++; else x++; SDC in the output value of x Program output:x=4, y=12

  6. A Software-Level Approach to Fault Detection int x = 2 ; int y = 11;PP0:If ( x<3 && y>10 ){PP1: y++; PP2:}else{ PP3: x++;PP4: }PP5:printf(“x=%d, y=%d”,x , y) Program Conditionals: x<3, y>10 Program Points: PP0, PP1, PP2, PP3, PP4, PP5 Predicate State at PP0: <PP0, TT> Predicate State at PP1:<PP1, TT> Example Predicate State Transition: <PP0, TT>  <PP1, TT>

  7. PTD : Visualizing Spurious Transitions

  8. FUSED Soft-Error Detection Framework • Automatically synthesizes and inserts detectors • Uses profilers to generate likely invariants • Likely invariants are used for soft error detection

  9. Preliminary Experimental Results • FUSED is evaluated using SuperLU scientific library • Up to 90% of soft errors are detected • Detectors only inserted into top-level LU factorization routine • Average execution overhead of 19% due to the detectors • In future, optimize detector placement to reduce overhead

  10. PTD for SuperLU[slu99,05,11]

  11. Identifying Vulnerable Code-Regions • Identify highly active code-regions w.r.t. data-flow • Compute activity cost for data-flow edges • Highly active code-regions most likely to be hit by soft-errors • Applications include detector placement optimization Work in Progress

  12. Concluding Remarks & Future Work • KULFI, an open source fault injector for evaluation infrastructure • Try out KULFI: https://github.com/soar-lab/KULFI • FUSED error detection framework • Continue working to develop heuristics for detector placement optimization • Plan for open source release • Identifying Vulnerable Code-Regions • Applications include detector placement optimization • Characterizing resilience properties of a program

  13. References [arg13] Snir, M., et al. Addressing Failures in Exascale Computing. No. ANL/MCS-TM-33. Argonne National Laboratory (ANL), 2013 [lanl05] Michalak, Sarah E., et al. "Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer."  IEEE Transactions on Device and Materials Reliability, 2005 [llvm04] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in International Symposium on Code Generation and Optimization (CGO), 2004 [pct05] T. Ball, “A theory of predicate-complete test coverage and generation,” in International Conference on Formal Methods for Components and Objects (FMCO), 2005 [iswat08] S. K. Sahoo, M. lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Y. Zhou, “Using likely program invariants to detect hardware errors,” in IEEE International Conference on Dependable Systems and Networks (DSN), 2008 [sloan13] Sloan, Joseph, Rakesh Kumar, and Greg Bronevetsky. "An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance.“,  in IEEE International Conference on Dependable Systems and Networks (DSN), 2013

  14. References [slu99] Demmel, James W., et al. "A supernodal approach to sparse partial pivoting.“ SIAM Journal on Matrix Analysis and Applications, 1999 [slu05] Li, Xiaoye S. "An overview of SuperLU: Algorithms, implementation, and user interface." ACM Transactions on Mathematical Software (TOMS), 2005[slu11] Li, X. S., Demmel, J. W., Gilbert, J. R., Grigori, L., Shao, M., & Yamazaki, I. (2011). SuperLU Users’ Guide. url: http://crd. lbl. gov/~ xiaoye/SuperLU/superlu_ug. Pdf. [sprs11] Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS), 2011 [parsec08] C. Bienia, S. Kumar, J. Singh, and K. Li, “The PARSEC benchmark suite: Characterization and architectural implications,” ser. PACT, 2008 [relax10] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An ar- chitectural framework for software recovery of hardware faults,” in International Symposium on Computer Architecture (ISCA), 2010 [schen13] S. Chen, personal communication, 2013.

  15. Backup

  16. Closely Related Work • Low-cost software level detector is the need of the hour • iSWAT by Sahoo et. al. [iswat08] uses likely program invariants • Derives likely invariants by monitoring program properties • Uses hardware-assisted framework to detect false positives • Not based on predicate abstraction • Error localization by Sloan et.al. [sloan13] uses algorithm based approach • Need fault injector as part of evaluation infrastructure • LLVM-level fault injector developed by Kuijif et. al. [relax10] • Publicly unavailable • A recent study [schen13] done by a user suggests KULFI has better fine-grained options • LLFI fault injector by Thomas et. al. • Developed around same time as KULFI, shares many similar features

  17. KULFI: Fault Injection Logic Start Forall dynamic instructions Feasible? No Yes Inject Fault with user provided probability Stop

  18. Case Study • Empirically study resiliency of sorting algorithms - Bubblesort, Quicksort, Mergesort, Radixsort, Heapsort • Inject exactly one fault in a randomly chosen dynamic instruction of a sorting routine • 1 fault injection experiment = 100 runs with exactly on fault injected • Categorize outcome into SDC, Benign, or Segmentation fault categories • Benign: 41, Segmentation: 29, SDC: 30

  19. Case Study • Executed 200 fault injection experiments per sorting routine • Total number of fault injections = 5*200*100 = 100000 • Plotted fault counts from each outcome category for each fault injection experiment • Result shows strong clustering pattern with statistically significant distribution for each outcome category

  20. Results

  21. Results

  22. Results

  23. Overview • Introduction • KULFI: A LLVM Level Fault Injector • Case Study • Fault Detector • Concluding Remarks

  24. A Software-Level Approach to Fault Detection • Predicates: Pure boolean program conditionals • Predicate State: <PP,BV> • PP: Program point between two successive program statements • BV:Bit-vector representing concrete boolean values of program conditionals at a given program point • Predicate State Transition: <PP:BV>  <PP’:BV’> • PP’ is a program point which is an immediate successor of PP • BV’ is the bit-vector representing concrete boolean values of program conditionals at PP’

  25. A Software-Level Approach to Fault Detection Start Start Program Program Execute Program Execute Program Get Predicate Transition No Extract all valid predicate transitions Is last transition? Check if Valid ? Yes Stop No Yes Fault Detected Stop

  26. Predicate Transition Diagram (PTD) Start Program Execute Program Execute Program Inject Fault Track Predicate Transitions Track Predicate Transitions Merge Predicate Transition Diagram Stop

  27. PTD for Blackscholes in Parsec 3.0[parsec08]

  28. Acknowledgements • Pedro Diniz • PrabhakarKudva • ShuvenduLahiri • KarthikPattabiraman • Sui Chen • Anonymous reviewers of PRDC conference who reviewed our paper

More Related