SWAT: Designing Reisilent Hardware by Treating Software Anomalies

SWAT: Designing Reisilent Hardware byTreating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu

Motivation • Hardware failures will happen in the field • Aging, soft errors, inadequate burn-in, design defects, … Need in-field detection, diagnosis, recovery, repair • Reliability problem pervasive across many markets • Traditional redundancy (e.g., nMR) too expensive • Piecemeal solutions for specific fault model too expensive • Must incur low area, performance, power overhead Today: low-cost solution for multiple failure sources

Observations • Need handle only hardware faults that propagate to software • Fault-free case remains common, must be optimized • Watch for software anomalies (symptoms) Hardware fault detection ~ Software bug detection Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked • SWAT: SoftWare Anomaly Treatment

SWAT Framework Components • Detection:Symptoms of S/W misbehavior, minimal backup H/W • Recovery:Hardware/Software checkpoint and rollback • Diagnosis:Rollback/replay on multicore • Repair/reconfiguration: Redundant, reconfigurable hardware • Flexible control through firmware Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair

Detectors w/ Hardware support [ASPLOS ‘08] Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair 4. Accurate Fault Modeling 3. Trace Based Fault Diagnosis [Li et al., DSN ‘08] SWAT 2. Detectors w/ Software support [Sahoo et al., DSN ‘08]

Hardware-Only Symptom-based detection • Observe anomalous symptoms for fault detection • Incur low overheads for “always-on” detectors • Minimal support from hardware • Fatal traps generatedby hardware • Division by Zero, RED State, etc. • Hangs detected using simple hardware hang detector • High OS activity detected with performance counter • Typical OS invocations take 10s or 100s of instructions

Fault 10M instr If no symptom in 10M instr, run to completion Functional simulation Timing simulation App masked, or symptom > 10M, or silent data corruption (SDC) Experimental Methodology • Microarchitecture-level fault injection • GEMS timing models + Simics full-system simulation • SPEC workloads on Solaris-9 OS • Permanent fault models • Stuck-at, bridging faults in latches of 8 arch structures • 12,800 faults, <0.3% error @ 95% confidence • Simulate impact of fault in detail for 10M instructions

Efficacy of Hardware-only Detectors • Coverage: Percentage of unmasked faults detected • 98% faults detected, 0.4% give SDC (w/o FPU) • Additional support required for FPU-like units • 66% of detected faults corrupt OS state, need recovery • Despite low OS activity in fault-free execution • Latency: Number of instr between activation and detection • HW recovery for upto 100k instr, SW longer latencies • App in 87% of detections recoverable using HW • OS recoverable in virtually all detections using HW • OS recovery using SW hard

Improving SWAT Detection Coverage Can we improve coverage, SDC rate further? • SDC faults primarily corrupt data values • Illegal control/address values caught by other symptoms • Need detectors to capture “semantic” information • Software-level invariants capture program semantics • Use when higher coverage desired • Sound program invariants  expensive static analysis • We use likely program invariants

Likely Program Invariants • Likely program invariants • Hold on all observed inputs, expected to hold on others • But suffer from false positives • Use SWAT diagnosis to detect false positives on-line • iSWAT - Compiler-assisted symptom detectors • Range-based value invariants[Sahoo et al. DSN ‘08] • Check MIN  value  MAX on data values • Disable invariant when diagnose false-positive

iSWAT implementation Application Training Phase Compiler Pass in LLVM Test, train, external inputs Invariant Monitoring Code - - - - - Application - - - - - Ranges i/p #1 Ranges i/p #n . . . . Invariant Ranges

iSWAT implementation Application Training Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Invariant Checking Code - - - - - Application - - - - - Test, train, external inputs Ref input Invariant Monitoring Code - - - - - Application - - - - - Inject Faults Full System Simulation Invariant Violation Ranges i/p #1 Ranges i/p #n . . . . SWAT Diagnosis Fault Detection False Positive (Disable Invariant) Invariant Ranges

iSWAT Results • Explored SWAT with 5 apps on previous methodology • Undetected faults reduce by 30% • Invariants reduce SDCs by 73% (33 to 9) • Overheads: 5% on x86, 14% on UltraSparc IIIi • Reasonably low overheads on some machines • Un-optimized invariants used, can be further reduced • Exploring more sophistication for coverage, overheads

Fault Diagnosis • Symptom-based detection is cheap but • High latency from fault activation to detection • Difficult to diagnose root cause of fault • How to diagnose SW bug vs. transient vs. permanent fault? • For permanent fault within core • Disable entire core? Wasteful! • Disable/reconfigure µarch-level unit? • How to diagnose faults to µarch unit granularity? • Key ideas • Single core fault model, multicore  fault-free core available • Checkpoint/replay for recovery  replay on good core, compare • Synthesizing DMR, but only for diagnosis

Symptom No symptom Transient or non- deterministic s/w bug Continue Execution SW Bug vs. Transient vs. Permanent • Rollback/replay on same/different core • Watch if symptom reappears Faulty Good Symptom detected Rollback on faulty core False positive (iSWAT) or Deterministic s/w or Permanent h/w bug Rollback/replay on good core No symptom Symptom Permanent h/w fault, needs repair! False positive (iSWAT) or Deterministic s/w bug, send to s/w layer

Diagnosis Framework Symptom detected Diagnosis Permanent fault Software bug Transient fault Microarchitecture-Level Diagnosis Unit X is faulty

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Faulty Core Execution Fault-Free Core Execution =? Diagnosis Algorithm

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Fault-Free Core Execution Rollback faulty-core to checkpoint Replay execution, collect info =? Diagnosis Algorithm

What info to collect? What to do on divergence? What info to compare? Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect info Fault-free instruction exec =? Diagnosis Algorithm

HW used Fault-free results Faulty dst preg dec alu x 5 3 add r1,r3,r5 add r1,r3,r5 0 1 12 x sub r6,r1,r2 2 9 sub r6,r1,r2 2 1 7 Both divergent instructions used same ALU  ALU1 faulty Can a Divergent Instruction Lead to Diagnosis? Simpler case: ALU fault

p20 4 r2 p20 Fault-free r1=12 Diverged! error! p24 p24 8 3 r5 p24 r1 p4 r3 r3 p4 p55 p24 32 Can a Divergent Instruction Lead to Diagnosis? • Complex example: Fault in register alias table (RAT) entry • Divergent instructions do not directly lead to faulty unit • Instead, look backward/forward in instruction stream • Need to collect and analyze instruction trace Reg File RAT log phy phy val IA: r3  r2 + r2 r1 p4 p20 4 r2 p20 p24 3 r3 p13 p4 8 r5 p24 IB: r1  r5 * r2 But IB does not use faulty HW…

Diagnosing Permanent Fault to µarch Granularity • Trace-based fault diagnosis (TBFD) • Compare instruction trace of faulty vs. good execution • Divergence  faulty hardware used  diagnosis clues • Diagnose faults to µarch units of processor • Check µarch-level invariants in several parts of processor • Front end, Meta-datapath, datapath faults • Diagnosis in out-of-order logic (meta-datapath) complex • Results • 98% of the faults by SWAT successfully diagnosed • TBFD flexible for other detectors/granularity of repair

4. Accurate Fault Modeling SWAT Detectors w/ Hardware support [ASPLOS ‘08] 2. Detectors w/ Software support [Sahoo et al., DSN ‘08] Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair 3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]

SWATSim: Fast and Accurate Fault Models • Need accurate µarch-level fault models • Gate level injections accurate but too slow • µarch (latch) level injections fast but inaccurate • Can we achieve µarch-level speed at gate-level accuracy? • Mix-mode (hierarchical) Simulation • µarch-level + Gate-level simulation • Simulate only faulty component at gate-level, on-demand • Invoke gate-level sim at online for permanent faults • Simulating fault effect with real-world vectors

µarch simulation r3  r1 op r2 Yes No Input Stimuli Gate-Level Fault Simulation Output Response r3 Fault propagated to output Continue µarch simulation µarch-Level Simulation SWAT-Sim: Gate-level Accuracy at µarch Speeds Faulty Unit Used?

Results from SWAT-Sim • SWAT-sim implemented within full-system simulation • NCVerilog + VPI for gate-level sim of ALU/AGEN modules • SWAT-Sim: High accuracy at low overheads • 100,000x faster than gate-level, same modeling fidelity • 2x slowdown over µarch-level, at higher accuracy • Accuracy of µarch models using SWAT coverage/latency • µarch stuck-at models generally inaccurate • Differences in activation rate, multi-bit flips • Complex manifestations  Hard to derive better models • Need SWAT-Sim, at least for now

SWAT Summary • SWAT: SoftWare Anomaly Treatment • Handle all and only faults that matter • Low, amortized overheads • Holistic systems view enables novel solutions • Customizable and flexible • Prior results: • Low-cost h/w detectors gave high coverage, low SDC rate • This talk: • iSWAT: Higher coverage w/ software-assisted detectors • TBFD: µarch level fault diagnosis by synthesizing DMR • SWAT-Sim: Gate-level fault accuracy at µarch level speed

Future Work • Recovery: hybrid, application-specific • Aggressive use of software reliability techniques • Leverage diagnosis mechanism • Multithreaded software • Off-core faults • Post-silicon debug and test • Use faulty trace as fault-model oblivious test vector • Validation on FPGA (w/ Michigan) • Hardware assertions to complement software symptoms

BACKUP SLIDES

Breakup of Detections by SW symptoms • 98% unmasked faults detected within 10M instr (w/o FPU) • Need HW support or SW monitoring for FPU

SW Components Corrupted • 66% of faults corrupt system state before detection • Need to recover system state

Latency from Application mismatch • 86% of faults detected under 100k • 42% detected under 10k

Latency from OS mismatch • 99% of faults detected under 100k

iSWAT implementation Application Training Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Invariant Checking Code - - - - - Application - - - - - Test, train, external inputs Ref input Invariant Monitoring Code - - - - - Application - - - - - Inject Faults Full System Simulation Invariant Violation Ranges i/p #1 Ranges i/p #n . . . . SWAT Diagnosis Fault Detection False Positive (Disable Invariant) Invariant Ranges

Meta-datapath Faults Datapath Faults Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke diagnosis Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect µarch info Fault-free instruction exec Faulty trace =? Test trace Faults in Front-end TBFD

Fault Diagnosability • 98% of detected faults are diagnosed • 89% diagnosed to unique unit/array entry • Meta-datapath faults in out-of-order exec mislead TBFD

Accuracy of existing Fault Models • SWAT-sim implemented within full-system simulator • NCVerilog + VPI to simulate gate-level ALU and AGEN • Existing µarch-level fault models inaccurate • Differences in activation rate, multi-bsit flips • Accurate models hard to derive need SWAT-Sim!

Summary: SWAT Advantages • Handles all faults that matter • Oblivious to low-level failure modes & masked faults • Low, amortized overheads • Optimize for common case, exploit s/w reliability solutions • Holistic systems view enables novel solutions • Invariant detectors use diagnosis mechanisms • Diagnosis uses recovery mechanisms • Customizable and flexible • Firmware based control affords hybrid, app-specific recovery (TBD) • Beyond hardware reliability • SWAT treats hardware faults as software bugs • Long-term goal: unified system (hw + sw) reliability at lowest cost • Potential applications to post-silicon test and debug

Transients Results • 6400 transient faults injected across 8 structures • 83% unmasked faults detected within 10M instr • Only 0.4% of injected faults results in SDCs

SWAT: Designing Reisilent Hardware by Treating Software Anomalies

SWAT: Designing Reisilent Hardware by Treating Software Anomalies

Presentation Transcript

Hardware vs. Software

Hardware-Software Partitioning

SWAT: Designing Resilient Hardware by Treating Software Anomalies

SWAT

SWAT

SWAT: Designing Resilient Hardware by Treating Software Anomalies

Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults

Software Designing

SWAT

SWAT!

SWAT

SWAT

SWAT: Designing Resilient Hardware by Treating Software Anomalies

Hardware and Software

SWAT!!!

Hardware/Software Codesign

SWAT: Designing Reisilent Hardware by Treating Software Anomalies