1 / 32

SWAT: Designing Resilient Hardware by Treating Software Anomalies

SWAT: Designing Resilient Hardware by Treating Software Anomalies. Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign

dalton
Télécharger la présentation

SWAT: Designing Resilient Hardware by Treating Software Anomalies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SWAT: Designing Resilient Hardware byTreating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu

  2. Motivation • Hardware failures will happen in the field • Aging, soft errors, inadequate burn-in, design defects, … Need in-field detection, diagnosis, recovery, repair • Reliability problem pervasive across many markets • Traditional redundancy (e.g., nMR) too expensive • Piecemeal solutions for specific fault model too expensive • Must incur low area, performance, power overhead Today: low-cost solution for multiple failure sources

  3. Observations • Need handle only hardware faults that propagate to software • Fault-free case remains common, must be optimized Watch for software anomalies (symptoms) • Hardware fault detection ~ Software bug detection • Zero to low overhead “always-on” monitors Diagnose cause after symptom detected • May incur high overhead, but rarely invoked  SWAT: SoftWare Anomaly Treatment

  4. Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair SWAT Framework Components • Detection:Symptoms of software misbehavior, minimal backup hardware • Recovery:Hardware/software checkpoint and rollback • Diagnosis:Rollback/replay on multicore • Repair/reconfiguration: Redundant, reconfigurable hardware • Flexible control through firmware

  5. Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair SWAT Framework Components 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b]

  6. Simple Hardware-only Symptom-based detection • Observe anomalous symptoms for fault detection • Incur low overheads for “always-on” detectors • Minimal support from hardware, no software support • Fatal traps generatedby hardware • Division by Zero, RED State, etc. • Hangs detected using simple hardware hang detector • High OS activity detected with performance counter • Typical OS invocations take 10s or 100s of instructions

  7. Fault 10M instr If no symptom in 10M instr, run to completion Functional simulation Timing simulation App masked, or symptom > 10M, or silent data corruption (SDC) Experimental Methodology • Microarchitecture-level fault injection • GEMS ooo timing models + Simics full-system simulation • SPEC apps on OpenSolaris, UltraSPARC III ISA • Fault model • Stuck-at, bridging faults in latches of 8 arch structures • 12,800 faults, <0.3% error @ 95% confidence • Also studied transients, but this talk on permanents • Simulate impact of fault in detail for 10M instructions

  8. Efficacy of Simple HW Only Detectors - Coverage Permanent faults • 98% of unmasked faults detected in 10M instr (w/o FPU) • 0.4% of injected faults result in SDC (w/o FPU) • Need hardware support or other monitors for FPU

  9. Latency to Detection from Software Corruption • 88% detected within 100K instructions, rest within 10M instr • Can use hardware recovery methods – SafetyNet, Revive

  10. Conclusions So Far SWAT approach feasible and attractive • Very low-cost hardware detectors already effective • 98% coverage, only 0.4% SDC for 7 of 8 structures • Next • Can we get even better coverage, especially SDC rate?

  11. Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair SWAT Framework Components 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b]

  12. Improving SWAT Detection Coverage Can we improve coverage, SDC rate further? • SDC faults primarily corrupt data values • Illegal control/address values caught by other symptoms • Need detectors to capture “semantic” information • Software-level invariants capture program semantics • Use when higher coverage desired • Sound program invariants  expensive static analysis • We use likely program invariants

  13. Likely Program Invariants • Likely program invariants • Hold on all observed inputs, expected to hold on others • But suffer from false positives • Use SWAT diagnosis to detect false positives on-line • iSWAT invariant detectors • Range-based value invariants[Sahoo et al. DSN ‘08] • Check MIN  value  MAX on data values • Disable invariant when diagnose false-positive

  14. iSWAT Implementation Application Training Phase Compiler Pass in LLVM Test, train, external inputs Invariant Monitoring Code - - - - - Application - - - - - Range i/p #1 Range i/p #n . . . . Invariant Ranges

  15. iSWAT Implementation Application Training Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Invariant Checking Code - - - - - Application - - - - - Test, train, external inputs Ref input Invariant Monitoring Code - - - - - Application - - - - - Inject Faults Full System Simulation Invariant Violation Ranges i/p #1 Ranges i/p #n . . . . SWAT Diagnosis Fault Detection False Positive (Disable Invariant) Invariant Ranges

  16. iSWAT Results • Evaluated iSWAT on 5 apps w/ previous methodology • Key results • Undetected faults reduce by 30% • SDCs reduce by 73% (33 to 9) • Runtime overhead • 5% on x86, 14% on UltraSparc IIIi • Can be further reduced with optimized invariants • Exploring more sophistication to  coverage,  overhead

  17. Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair SWAT Framework Components 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b]

  18. Fault Diagnosis • Symptom-based detection is cheap but • High latency from fault activation to detection • Difficult to diagnose root cause of fault • How to diagnose SW bug vs. transient vs. permanent fault? • For permanent fault within core • Disable entire core? Wasteful! • Disable/reconfigure µarch-level unit? • How to diagnose faults to µarch unit granularity? • Key ideas • Single core fault model, multicore  fault-free core available • Checkpoint/replay for recovery  replay on good core, compare • Synthesizing DMR, but only for diagnosis

  19. Symptom No symptom Transient h/w bug or non-deterministic s/w bug Continue Execution SW Bug vs. Transient vs. Permanent • Rollback/replay on same/different core • Watch if symptom reappears Faulty Good Symptom detected Rollback on faulty core Permanent h/w bug or deterministic s/w bug or false positive (iSWAT) Rollback/replay on good core No symptom Symptom Permanent h/w fault, needs repair! False positive (iSWAT) or deterministic s/w bug (send to s/w layer)

  20. Diagnosis Framework Symptom detected Diagnosis Permanent fault Software bug Transient fault Microarchitecture-Level Granularity Diagnosis Unit X is faulty

  21. Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Faulty Core Execution Fault-Free Core Execution =? Diagnosis Algorithm

  22. Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Fault-Free Core Execution Rollback faulty-core to checkpoint Replay execution, collect info =? Diagnosis Algorithm

  23. What info to collect? What to do on divergence? What info to compare? Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect info Fault-free instruction exec =? Diagnosis Algorithm

  24. Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect µarch info Fault-free instruction exec Synch state Faulty trace =? divergence Test trace Diagnosis Algorithm: 1. Front-end 2. Meta-datapath 3. Datapath

  25. Diagnosis Results • 98% of detected faults are diagnosed • 89% diagnosed to unique unit/array entry • Meta-datapath faults in out-of-order execution mislead TBFD

  26. Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair SWAT Framework Components 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b]

  27. SwatSim: Fast and Accurate Fault Modeling • Need accurate µarch-level fault models • Gate level injections accurate but too slow • µarch (latch) level injections fast but inaccurate • Can we achieve µarch-level speed at gate-level accuracy? • SwatSim – Hierarchical (mixed mode) simulation • Simulate mostly at µarch level • Simulate only faulty component at gate-level, on-demand • Invoke gate-level simulation online for permanent faults • Simulating fault effect with real-world vectors • Used OpenSPARC RTL models

  28. µarch simulation r3  r1 op r2 Yes No Input Stimuli Gate-Level Fault Simulation Output Response r3 Fault propagated to output Continue µarch simulation µarch-Level Simulation SWAT-Sim: Gate-level Accuracy at µarch Speeds Faulty Unit Used?

  29. Results from SwatSim • SwatSim implemented within full-system simulation • GEMS+Simics for µarch simulation • NCVerilog + VPI for gate-level ALU, AGEN from OpenSPARC models • Performance overhead • 100,000X faster than gate level full processor simulation • 2X slowdown over µarch level simulation • Accuracy of µarch fault models using SWAT coverage/latency • Compared µarch stuck-at with SwatSim stuck-at, delay • µarch fault models generally inaccurate • Accuracy varies depending on structure, fault model • Differences in activation rate, multi-bit flips • Unsuccessful attempts to derive more accurate µarch fault models Need SwatSim, at least for now

  30. Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair Summary – SWAT Works! 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b]

  31. Summary: SWAT Advantages • Handles all faults that matter • Oblivious to low-level failure modes & masked faults • Low, amortized overheads • Optimize for common case, exploit s/w reliability solutions • Holistic systems view enables novel, synergistic solutions • Invariant detectors use diagnosis mechanisms • Diagnosis uses recovery mechanisms • Customizable and flexible • Firmware control can adapt to specific reliability needs • E.g., hybrid, app-specific recovery (TBD) • Beyond hardware reliability • SWAT treats hardware faults as software bugs • Long-term goal: unified system (hw + sw) reliability at lowest cost • Potential applications to post-silicon test and debug

  32. Ongoing and Future Work • Complete SWAT system implementation • Recovery and firmware control w/ OpenSPARC hypervisor/OS • Multithreadedsoftware on multicore: Initial results promising • More aggressive detection • More aggressive software reliability techniques • H/W assertions to complement software (w/ Shobha Vasudevan) • Modeling • Comprehensive SWATSim w/ OpenSPARC RTL for more h/w modules • Off-core faults • Validation on FPGA (w/ Michigan) using Leon based system • Would be nice to have state-of-the-art multicore SPARC system • Post-silicon debug and test • Engagements with Sun • Studentsummer intern w/ Dr. Ishwar Parulkar, teleconferences, visits

More Related