330 likes | 451 Vues
This presentation explores the various aspects of microprocessor reliability, focusing on both hard and soft errors that can impact performance and longevity. It details the causes of these errors, including radiation and process variabilities, as well as their implications for device yield and reliability under extreme operating conditions. The discussion includes various solutions at the device, circuit, and architectural levels to mitigate these issues, highlighting the importance of error detection and correction mechanisms to enhance microprocessor performance and lifespan.
E N D
Microprocessor Reliability Robert Pawlowski ECE 570 – 2/19/2013
Reliability • Involves different aspects about a processor that can affect performance and functionality. • Ultimately can reduce the lifetime of the processor. • Issues typically manifest themselves at the device level. • Solutions can be implemented at multiple design levels.
Why the concern? • Operating at highest frequencies and/or lowest power possible increases sensitivity to process-related variabilities. • Gate length/doping concentration variations • Temperature • Supply voltage droops • This decreases processor yield • Decreasing device sizes Increased effect of external issues
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
Processor Error Classification • Hard Errors will result in permanent processor failure. • Processor lifetime is inversely proportional to hard error rate. • Soft Errors do not permanently damage the device.
Hard Errors • Extrinsic failures • Caused by process and manufacturing defects • Occur with decreasing rate over time • No impact from micro-architecture • Intrinsic failures • Related to processor wear-out • Occur with increasing rate over time • Related to wafer packaging, process parameters, and processor design.
Soft Errors • Occur in both memory and logic • External radiation main issue in memory • Alpha particles • High energy neutrons • Thermal neutrons • Different causes of transient errors in logic • External radiation • Supply voltage droop • Power supply fluctuations • Ground bounce, cross-talk • Process variation, temperature • Affect delay of computational paths
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
Radiation-Induced Soft Errors • Ionized particle strike causing a state change • No permanent damage (Hard-error) • Combo logic – Single Event Transients (SET) • Memory cells – Single Bit Upset (SBU) Multi Bit Upset (MBU) • Three causes of soft errors • Alpha particles • Thermal neutrons • High-energy neutrons
Alpha-Particles • Emitted from impurities in packaging materials. • Create electron-hole pairs through direct ionization • Range for a 10 MeV particle < 100um • Typical energy 4-9MeV • Improved manufacturing trends Reduced effect • Purified materials • Shielding layers
Neutrons • Result of cosmic ray reactions with atmosphere • High-Energy neutrons react with chip materials. • Concrete only shielding material • 1.4x lower flux/foot of thickness
Neutrons • Thermal neutrons (<<< 1MeV) react with Boron-Doped Phosphosilicate Glass (BPSG) dielectric layer. • Produce ionized particles that can cause soft-errors • Solution Remove BPSG from advanced processes • Mostly solved – SEU’s still found in 45nm, 90nm
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
Device-level solutions • Larger device sizes Larger capacitance • Increase the amount of charge necessary to flip bit (critical charge) • Multiple VT design • Sensitivity to variation at low-VDD may limit effectiveness. • Body biasing also common to both radiation hardening and variation tolerance
Circuit-level solutions • DICE cell • Used for SRAM, FF’s, latches • Built-in currentsensors on supply lines of memory cells.
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
Modular redundancy • Dual Modular Redundancy • Triple Modular Redundancy
Redundant Circuits • Redundancy increases area/power • DMR/TMR in sub/near-VT • Timing variation between circuits increases • Utilization of redundant lanes for parallel operation can increase throughput at low-VDD
Self-Checking Circuits • Partition circuit into smaller blocks • Error checker for each block • Use error detection codes • Berger codes • Arithmetic codes • Increases circuit delay for error computation
Circuit-Level Speculation • Uses approximated circuit implementation • Goal is to reduce critical path
Tunable Replica Circuits • Mirrors delay of critical path • Monitors for errors over voltage/frequency changes
Timing Speculation • Razor timing error detection • Designed for transient faults • Effective against SET’s and SBU’s on flip-flops • Requires error recovery
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
Error Recovery Options in Scalar Processors • Clock Gating: • Global error signal • Clock gating • 1-cycle penalty
Error Recovery Options in Scalar Processors • Multiple Issue: • Error signals propagated to control unit • Instructions must be flushed • Error instruction then replayed • 2N-cycle penalty
Error Recovery Options in Scalar Processors • Counter-flow pipelining • Micro-rollback
Error correcting codes for memories • Most common is Hamming code • Check bits stored when data written • Identifies error and erroneous bit position
Error correcting codes for memories • Single-bit ECC adds area/power and delay • Low-VDD Increased delay • Hybrid VDD operation will reduce delay • Overhead increases for multi-bit ECC • Increased memory density higher probability of MBU • Current research increase in ratio of MBU to total SER in sub-VT
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
System-Level Impact • Soft errors can have a large affect on processor functionality • Increasing issue with further device scaling • All methods off error detection/correction are costly • Need to be added to system blocks wisely • SEU distribution • Effects of process variation
System-Level Impact • How to determine what blocks have the highest system-level impact? • Mostly through simulation • For radiation: all-encompassing • Includes fault injection @ circuit level • Different models have been developed • ReStore – University of Illinois at Urbana-Champaign • Focuses on system level effect of radiation-induced errors • RAMP – IBM • Directed more towards hard-errors and processor failure.