SWAT: Designing Resilient Hardware by Treating Software Anomalies

SWAT: Designing Resilient Hardware byTreating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu

Motivation • Hardware failures will happen in the field • Aging, soft errors, inadequate burn-in, design defects, … Need in-field detection, diagnosis, recovery, repair • Reliability problem pervasive across many markets • Traditional redundancy (e.g., nMR) too expensive • Piecemeal solutions for specific fault model too expensive • Must incur low area, performance, power overhead Today: low-cost solution for multiple failure sources

Observations • Need handle only hardware faults that propagate to software • Fault-free case remains common, must be optimized Watch for software anomalies (symptoms) • Hardware fault detection ~ Software bug detection • Zero to low overhead “always-on” monitors Diagnose cause after symptom detected • May incur high overhead, but rarely invoked  SWAT: SoftWare Anomaly Treatment

Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair SWAT Framework Components • Detection:Symptoms of software misbehavior, minimal backup hardware • Recovery:Hardware/software checkpoint and rollback • Diagnosis:Rollback/replay on multicore • Repair/reconfiguration: Redundant, reconfigurable hardware • Flexible control through firmware

Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair SWAT Framework Components 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b]

Simple Hardware-only Symptom-based detection • Observe anomalous symptoms for fault detection • Incur low overheads for “always-on” detectors • Minimal support from hardware, no software support • Fatal traps generatedby hardware • Division by Zero, RED State, etc. • Hangs detected using simple hardware hang detector • High OS activity detected with performance counter • Typical OS invocations take 10s or 100s of instructions

Fault 10M instr If no symptom in 10M instr, run to completion Functional simulation Timing simulation App masked, or symptom > 10M, or silent data corruption (SDC) Experimental Methodology • Microarchitecture-level fault injection • GEMS ooo timing models + Simics full-system simulation • SPEC apps on OpenSolaris, UltraSPARC III ISA • Fault model • Stuck-at, bridging faults in latches of 8 arch structures • 12,800 faults, <0.3% error @ 95% confidence • Also studied transients, but this talk on permanents • Simulate impact of fault in detail for 10M instructions

Efficacy of Simple HW Only Detectors - Coverage Permanent faults • 98% of unmasked faults detected in 10M instr (w/o FPU) • 0.4% of injected faults result in SDC (w/o FPU) • Need hardware support or other monitors for FPU

Latency to Detection from Software Corruption • 88% detected within 100K instructions, rest within 10M instr • Can use hardware recovery methods – SafetyNet, Revive

Conclusions So Far SWAT approach feasible and attractive • Very low-cost hardware detectors already effective • 98% coverage, only 0.4% SDC for 7 of 8 structures • Next • Can we get even better coverage, especially SDC rate?

Improving SWAT Detection Coverage Can we improve coverage, SDC rate further? • SDC faults primarily corrupt data values • Illegal control/address values caught by other symptoms • Need detectors to capture “semantic” information • Software-level invariants capture program semantics • Use when higher coverage desired • Sound program invariants  expensive static analysis • We use likely program invariants

Likely Program Invariants • Likely program invariants • Hold on all observed inputs, expected to hold on others • But suffer from false positives • Use SWAT diagnosis to detect false positives on-line • iSWAT invariant detectors • Range-based value invariants[Sahoo et al. DSN ‘08] • Check MIN  value  MAX on data values • Disable invariant when diagnose false-positive

iSWAT Implementation Application Training Phase Compiler Pass in LLVM Test, train, external inputs Invariant Monitoring Code - - - - - Application - - - - - Range i/p #1 Range i/p #n . . . . Invariant Ranges

iSWAT Implementation Application Training Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Invariant Checking Code - - - - - Application - - - - - Test, train, external inputs Ref input Invariant Monitoring Code - - - - - Application - - - - - Inject Faults Full System Simulation Invariant Violation Ranges i/p #1 Ranges i/p #n . . . . SWAT Diagnosis Fault Detection False Positive (Disable Invariant) Invariant Ranges

iSWAT Results • Evaluated iSWAT on 5 apps w/ previous methodology • Key results • Undetected faults reduce by 30% • SDCs reduce by 73% (33 to 9) • Runtime overhead • 5% on x86, 14% on UltraSparc IIIi • Can be further reduced with optimized invariants • Exploring more sophistication to  coverage,  overhead

Fault Diagnosis • Symptom-based detection is cheap but • High latency from fault activation to detection • Difficult to diagnose root cause of fault • How to diagnose SW bug vs. transient vs. permanent fault? • For permanent fault within core • Disable entire core? Wasteful! • Disable/reconfigure µarch-level unit? • How to diagnose faults to µarch unit granularity? • Key ideas • Single core fault model, multicore  fault-free core available • Checkpoint/replay for recovery  replay on good core, compare • Synthesizing DMR, but only for diagnosis

Symptom No symptom Transient h/w bug or non-deterministic s/w bug Continue Execution SW Bug vs. Transient vs. Permanent • Rollback/replay on same/different core • Watch if symptom reappears Faulty Good Symptom detected Rollback on faulty core Permanent h/w bug or deterministic s/w bug or false positive (iSWAT) Rollback/replay on good core No symptom Symptom Permanent h/w fault, needs repair! False positive (iSWAT) or deterministic s/w bug (send to s/w layer)

Diagnosis Framework Symptom detected Diagnosis Permanent fault Software bug Transient fault Microarchitecture-Level Granularity Diagnosis Unit X is faulty

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Faulty Core Execution Fault-Free Core Execution =? Diagnosis Algorithm

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Fault-Free Core Execution Rollback faulty-core to checkpoint Replay execution, collect info =? Diagnosis Algorithm

What info to collect? What to do on divergence? What info to compare? Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect info Fault-free instruction exec =? Diagnosis Algorithm

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect µarch info Fault-free instruction exec Synch state Faulty trace =? divergence Test trace Diagnosis Algorithm: 1. Front-end 2. Meta-datapath 3. Datapath

Diagnosis Results • 98% of detected faults are diagnosed • 89% diagnosed to unique unit/array entry • Meta-datapath faults in out-of-order execution mislead TBFD

SwatSim: Fast and Accurate Fault Modeling • Need accurate µarch-level fault models • Gate level injections accurate but too slow • µarch (latch) level injections fast but inaccurate • Can we achieve µarch-level speed at gate-level accuracy? • SwatSim – Hierarchical (mixed mode) simulation • Simulate mostly at µarch level • Simulate only faulty component at gate-level, on-demand • Invoke gate-level simulation online for permanent faults • Simulating fault effect with real-world vectors • Used OpenSPARC RTL models

µarch simulation r3  r1 op r2 Yes No Input Stimuli Gate-Level Fault Simulation Output Response r3 Fault propagated to output Continue µarch simulation µarch-Level Simulation SWAT-Sim: Gate-level Accuracy at µarch Speeds Faulty Unit Used?

Results from SwatSim • SwatSim implemented within full-system simulation • GEMS+Simics for µarch simulation • NCVerilog + VPI for gate-level ALU, AGEN from OpenSPARC models • Performance overhead • 100,000X faster than gate level full processor simulation • 2X slowdown over µarch level simulation • Accuracy of µarch fault models using SWAT coverage/latency • Compared µarch stuck-at with SwatSim stuck-at, delay • µarch fault models generally inaccurate • Accuracy varies depending on structure, fault model • Differences in activation rate, multi-bit flips • Unsuccessful attempts to derive more accurate µarch fault models Need SwatSim, at least for now

Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair Summary – SWAT Works! 1. Detectors w/ simple hardware [ASPLOS ’08] 2. Detectors w/ compiler support [DSN ’08a] 4. Accurate Fault Models [HPCA’09] 3. Trace-Based Fault Diagnosis [DSN ’08b]

Summary: SWAT Advantages • Handles all faults that matter • Oblivious to low-level failure modes & masked faults • Low, amortized overheads • Optimize for common case, exploit s/w reliability solutions • Holistic systems view enables novel, synergistic solutions • Invariant detectors use diagnosis mechanisms • Diagnosis uses recovery mechanisms • Customizable and flexible • Firmware control can adapt to specific reliability needs • E.g., hybrid, app-specific recovery (TBD) • Beyond hardware reliability • SWAT treats hardware faults as software bugs • Long-term goal: unified system (hw + sw) reliability at lowest cost • Potential applications to post-silicon test and debug

Ongoing and Future Work • Complete SWAT system implementation • Recovery and firmware control w/ OpenSPARC hypervisor/OS • Multithreadedsoftware on multicore: Initial results promising • More aggressive detection • More aggressive software reliability techniques • H/W assertions to complement software (w/ Shobha Vasudevan) • Modeling • Comprehensive SWATSim w/ OpenSPARC RTL for more h/w modules • Off-core faults • Validation on FPGA (w/ Michigan) using Leon based system • Would be nice to have state-of-the-art multicore SPARC system • Post-silicon debug and test • Engagements with Sun • Studentsummer intern w/ Dr. Ishwar Parulkar, teleconferences, visits

SWAT: Designing Resilient Hardware by Treating Software Anomalies

SWAT: Designing Resilient Hardware by Treating Software Anomalies

Presentation Transcript

Z-Buffer Optimizations

Chapter 2

Putting the Boot to the Tobacco Industry

Designing and Delivering Scalable and Resilient Web Services

A History of SWAT in Photos SWAT Leadership Training

Computer Hardware and Software

Dynamic Hardware Software Partitioning

HARDWARE AND SOFTWARE

Available Hardware and Software

Introduction to Bluespec: A new methodology for designing Hardware Arvind

BAE Systems C2C SWAT Project

Introduction to UI testing with SWAT

SWAT: Designing Resilient Hardware by Treating Software Anomalies

C H A P T E R

SWAT!

Computer Hardware