1 / 19

Dark Silicon Phenomenon

Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margin.

long
Télécharger la présentation

Dark Silicon Phenomenon

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margin GulayYalcin, Anita Sobe,AlexeyVoronin, Jons-TobiasWamhoff, DerinHarmanci,Adrián Cristal,OsmanUnsal, Pascal Felber,ChristofFetzerPDP2014, Turin, Italy13 February 2014

  2. Dark Silicon Phenomenon • Number of transistors can be increased. • In order to stay within a chip’s power budget, some must remain “dark”. • One solution: Downscale the voltage.

  3. How about Reliability? When the Vdd is reduced, the error rate increases exponentially [1]. Our goal is: Investigating the edge cases on voltage reduction while the error recovery still leads to a reduced energy consumption. [1] Dan Ernst et al. “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation.” In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, pages 7–18, 2003

  4. Agenda / Overview • Motivation • Experiment: Scaling Vdd in a Real System • Basics of Reliability • Error Recovery with TM • Error Detection Schemes • Analysis • Conclusion

  5. Reducing Vdd in a Real System • AMD FX-6100 • 6-core CPU • CPU-heavy execution • Every 10 seconds reduce Vdd by 12.5mV • Monitor • Incorrect Result • System Crash • Machine Check Architecture Errors are ininstruction cache (37%), execution unit (61%) and others (less than 2%). The system encounters errors which can not be corrected by MCA even only after 10% reduction in Vdd

  6. Basics of Reliability Transactional Memory can provide a lightweight Coordinated Local Checkpoitning [2] [2] Gulay Yalcin et al. “FaulTM: Fault Tolerance Using Hardware Transactional Memory , DATE 2013

  7. TM provides checkpointing/rollback Pn P4 Processor 1 P3 P2 Synchronize checkpoints Checkpoint (Log Area) Checkpoint (Log Area) Checkpoint (Log Area) Checkpoint (Log Area) Checkpoint (Log Area) Data-Versioning provides a synchronization mechanism between checkpoints. TM write-sets log the tentative memory updates.

  8. Error Detection Schemes - Replication • Execute instruction streams multiple times • Compare the results of executions • Less comparison with TM. • Dual/Triple Modular Redundancy • + High Error Detection Rate • - High Energy Overhead

  9. Error Detection Schemes-Assertions/Invariants • Assertions: Conditions referring to the current and previous state of the program. • Check the state • Adding manually or automatic • TM facilitates inserting invariants • Ex:

  10. Error Detection Schemes - Symptoms • Monitor program executions to inspect if there is a symptom of hardware faults. • Symptoms: • Mispredictions in high confidence branches, • high OS activity, • fatal traps (e.g. undefined instruction code) • Reliability at a low cost

  11. Error Detection Schemes- Encoded Processing • Apply software coding (ECC-like) techniques • The redundancy is added by applying arithmetic codes to the values. • Arithmetic codes: AN, ANBDmem etc. • With TM, the validation of a code word can be deferred until a TX commits. • Ex:

  12. Comparing Error Detection Schemes

  13. Analysis • Gem5 full system simulator • 1GHz in-order cores • 4 cores • X86 ISA • 64KB L1 data and instruction caches • Unified 2MB L2 cache • SPLASH2 benchmark suite.

  14. Energy Analysis Error Detection Rate Vdd Fault Injection TX size Recovery Overhead E ≈ C x Vdd2 Error-free Overhead

  15. Energy Reduction

  16. Reliability of the System

  17. Conclusion • The energy consumption of CPUs can be reduced if we have efficient hardware support for Transactional Memory and for Error Detection.

  18. Future Work: Combining DMR and Symptoms

  19. Thanks! GulayYalcin gyalcin@bsc.es

More Related