1 / 43

Virtually-Aged Sampling DMR Unifying Circuit F ailure P rediction and Detection

Virtually-Aged Sampling DMR Unifying Circuit F ailure P rediction and Detection. Raghuraman Balasubramanian Karthikeyan Sankaralingam. Microprocessor Reliability . More devices will fail on the field in f uture technology nodes. 10nm. 16nm. 32nm. Failure Rate. Time (years). 2.

tamera
Télécharger la présentation

Virtually-Aged Sampling DMR Unifying Circuit F ailure P rediction and Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Virtually-Aged Sampling DMR Unifying Circuit Failure Prediction and Detection Raghuraman Balasubramanian Karthikeyan Sankaralingam

  2. Microprocessor Reliability  More devices will fail on the field in future technology nodes 10nm 16nm 32nm Failure Rate Time (years) 2 • A lot of research on how to… • Mitigate / Recover / Repair … • Detect : DMR, Diva, Argus, BIST, SWAT… • Predict : Canaries, Razor, WearMon… • Coverage, detection latency, fault type…

  3. Circuit Failure Prediction 3 • Our goals • Low Design Complexity • Low Overheads • High Accuracy • Full Coverage

  4. To get there… Lets start from a good baseline Sampling-DMR 4

  5. Sampling+DMR Nomura, Shuou, et al. "Sampling+dmr: practical and low-overhead permanent fault detection." International Symposium on Computer Architecture (ISCA), 2011 5 • Permanent fault detection • 100% coverage • < 2% Energy overheads

  6. But There is a Problem Time (years) A Gate Fails Architectural Errors Sampling-DMR SamplingWindows With Infrequently Occurring Errors Missed Errors  Sampling-DMR Virtually Aged Virtual aging makes the gates behave as if they were 6 months older 6

  7. Virtually Aged Sampling DMR • Virtual Aging • Fault Exposure • In most gates the faults are automatically exposed • A new mechanism to expose faults in other gates • Detect Errors 7

  8. Executive Summary 8 • Virtually Aged Sampling-DMR • Microprocessor Failure Prediction • Full logic coverage • With < 0.7% energy overhead • Negligible performance overhead

  9. Outline 9 Motivation and Overview Virtual aging? Are all gates covered? Evaluation Methodology Results Related work Questions

  10. Virtual Aging As a chip wears out, the gates become slower As we decrease Vdd, the gates become slower Virtual aging => Reducing Vdd == 6-month Delay Degradation 10

  11. Outline 11 Motivation and Overview Virtual aging Are all gates covered? Evaluation Methodology Results Related work Questions

  12. Are all gates covered? 12 • Most gates (near-critical paths) ✔ • Initial worst-case propagation delay ∼clock period • Wearout ➔ propagation delay↑ > clock period • Delay fault is naturally exposed • Some gates (non-critical paths) ✗ • Initial worst-case propagation delay << clock period • Wearout ➔ propagation delay↑ < clock period ⇒ Fault is not manifested • Delay degradation is benign • Eventually catastrophic breakdown  Photo credit : Wikimedia Commons

  13. Soft and Hard breakdown Degradation = f(utilization, operating conditions, process variations) Any gate may fail. 13

  14. Fault Capture Logic for Non-Critical Paths 14

  15. Fault Capture Logic for Non-Critical Paths Comprehensive Logic coverage 15

  16. Virtually Aged SDMR 16

  17. Outline 17 Motivation and Overview Virtual aging?? Are all gates covered?? Evaluation Methodology Results Related work Questions

  18. Evaluation Methodology Applications Applications DMR Error?? Input Sequences Fault Vector Delay Aware Simulation • Full SPEC benchmarks • OpenRISC Processor • ~400,000 Fault Injection • Experiments Synopsys HSPICE + MOSRA Delay as a function of Time/Vdd 18

  19. Outline 19 Motivation and Overview Virtual aging? Are all gates covered? Evaluation Methodology Results Related work Questions

  20. Results Paper includes results on running 10 SPEC benchmarks to completion spanning almost 400,000 experimental runs 20 Is delay degradation measurably observable? Can voltage reduction mimic virtual aging? Do the manifested faults get exposed to the μ arch and cause timing faults? Do the faults exposed to the microarchitecture translate to architectural errors, then detected? What are the overheads?

  21. Is delay degradation measurably observable? 21 5 gates represent fault sites Model paths through these gates in HSPICE MOSRA wearout models

  22. Can voltage reduction mimic virtual aging? 22 HSPICE @ Vdd = 1.2 V, Vdd = 1.15V

  23. What are the overheads? 23 Synthesized with 32nm Synopsys process Implemented additional logic for fast paths

  24. Results - Summary 24 • Experimental Result Predict failures 9 months in advance using a Vdd reduction of 50mV • Empirical result + Mathematical modeling Can predict failure within 0.4 days in all but 1 of 1 billion chips

  25. Outline 25 Motivation and Overview Virtual aging? Are all gates covered? Evaluation Methodology Results Related work Questions

  26. Circuit Failure Prediction 26 • Predict the onset of failures • Low Design Complexity • Low Overheads • High Accuracy • Full Coverage

  27. Related Work On-chip test circuits 27

  28. Related Work Detect aging in select near-critical paths 27

  29. Related Work Periodic testing (offline) using on-chip test vectors 27

  30. Related Work Measure + Analyze (online) 27

  31. Related Work Reduce Vdd + Expose Faults 27

  32. Contributions Thank You 28 • Virtually Aged Sampling-DMR • Microprocessor Failure Prediction • Full logic coverage • With < 0.7% energy overhead • Negligible performance overhead • A new state-of-the-art in evaluation • Accurate wearout models at the gate level • And impact on full system (running full benchmarks)

  33. How Devices Degrade Target failure mechanisms for which delay degradation is a symptom 33 NBTI, HCI, TDDB Over time, Threshold Voltage Increases Propagation Delay Increases NOT covered: Electromigration, thermal runaway

  34. Variations • Process variations (Static) • Some processors are more susceptible • Voltage variations (Dynamic) • Variations ~1 order of magnitude smaller compared to degradation • Similar conditions in actual failure& virtual aging Reddi, Vijay Janapa, et al. "Voltage noise in production processors." Micro, IEEE 31.1 (2011).

  35. When does this not work? • Only when the conditions change drastically between prediction and actual failure • Change in program behavior • Operating conditions (Temperature, Voltage etc.,) • Program hides fault exposure (but stresses it) • As long as the fault is manifested 0.4 days before the actual failure – Aged-SDMR works.

  36. Evaluation setup

  37. Do the manifested faults get exposed to the μ-arch and cause timing faults? We saw timing faults appear during the sampling windows • Delay Aware Simulation • Input sequences from OpenRISC FPGA • 10 benchmarks (6 SPEC INT, 4 SPEC FP) • 5 million cycle traces x 3 phases of the program • Cycle accurate fault vectors

  38. Do the faults exposed in the microarchitecture translate to architectural errors, then detected? Architecture error rate using 100000 cycle sampling windows Fault vector from delay aware simulation Injected on OpenRISC on FPGA + DMR emulation

  39. Canary based 39/29

  40. Age Detection Latches 40/29

  41. BIST/DFT Based (Offline) 41/29

  42. Continuous Degradation Tracking 42/29

  43. Evaluation Methodology : Key Challenges Run full applications on a full system simulator& model wearout at the device level • Aged-SDMR is a Cross-layered Approach • Wearout is a gate-level phenomenon • Sampling-DMR works at the architecture level • Application dependency • Technique relies on the application to expose faults

More Related