230 likes | 343 Vues
Self-calibrating Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke. MICRO-40 December 3, 2007. [Srinivasan, DSN‘04]. [Borkar, MICRO‘05]. Motivation. “Designing Reliable Systems from Unreliable Components…”
E N D
Self-calibrating Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke MICRO-40 December 3, 2007 1
[Srinivasan, DSN‘04] [Borkar, MICRO‘05] Motivation • “Designing Reliable Systems from Unreliable Components…” - Shekhar Borkar (Intel) Failures will be wearout induced More failures to come 2
RAMP Current Approaches • Traditional • Design margins • Burn-in • Detection: based on replication of computation • TMR (Tandem/HP NonStop servers) • DIVA (Bower, MICRO’05) • Prediction: utilizes precise analytical models and/or sensors • Canary circuits (SentinelSilicion, RidgeTop) • RAMP (Srinivasan, UIUC/IBM) Impractical Static Costly 3
Wearout Mechanisms • Many failure mechanisms have been shown to be progressive • Hot carrier injection (HCI) • Negative Bias Temperature Inversion (NBTI) • Electromigration (EM) • Oxide Breakdown (OBD) 4
Objective • Propose a failure prediction technique that exploits the progressive nature of wearout • Monitor impact on path delays • Prediction • Monitors evolution of wearout • Proactive • enables failure avoidance/mitigation • Continuous feedback • False negatives and positives • Detection • Identifies existing fault • Reactive • enables failure recovery • End-of-life feedback • False negatives 5
Oxide Breakdown (OBD) • Accumulation of defects leads to a conductive path Percolation Model [Stathis, JAP‘06] 6
[Rodriguez, Stathis, Linder, IRPS ‘03] OBD HSPICE Model • Post-breakdown leakage modeling [BSIM4.6.0, ‘06] 7
Characterization Testbench • 90nm standard cell library tcircuit tcell 8
0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 Delay Profiling Unit (DPU) 1 input signal 1 1 Latency Sampling uArch Module 1 1 10
TRIX Analysis Magnitude of divergence between TRIXglobal and TRIXlocal reflects amount of degradation 11
TRIX Analysis Details • Exponential Moving Average (EMA) • Triple-smoothed Exponential Moving Average 12
Noisy Latency Profile Percent Nominal Delay (%) Increasing Age 13
0 1 0 0 0 0 0 1 1 0 DPU with TRIX Hardware TRIXl Calculation input signal Latency Sampling Prediction TRIXg Calculation 14
+ Wearout Detection Unit (WDU) TRIXl Calculation Latency Sampling Prediction TRIXg Calculation 15
Fully Synthesized, P&R, OR1200 Core Monte Carlo Simulator Evaluation Framework Gate-level Processor Simulator OR1200 Verilog Synthesis and Place and Route 90nm Library Timing, Power, and Temperature Simulations MediaBench Suite Workload Simulator HSPICE Simulations OBD Wearout Model Wearout Simulator 16
WDU Accuracy 17
WDU Overhead 18
WDU Overhead 19
Long-term Vision • Introspective Reliability Management (IRM) • Intelligent reliability management directed by on-chip sensor feedback • Prospective sensors • Delay (WDU) • Leakage/Vt • Temperature 20
Conclusions • Many progressive wearout phenomenon impact device-level performance. • It’s possible to characterize this impact and anticipate failures • WDU performance • Failure predicted within 20% of end of life (tunable) • Area overhead < 3% (hybrid) • Low-level sensors can be used to enable intelligent reliability management 22
Questions? ? 23