Microprocessor Reliability

Microprocessor Reliability Robert Pawlowski ECE 570 – 2/19/2013

Reliability • Involves different aspects about a processor that can affect performance and functionality. • Ultimately can reduce the lifetime of the processor. • Issues typically manifest themselves at the device level. • Solutions can be implemented at multiple design levels.

Why the concern? • Operating at highest frequencies and/or lowest power possible increases sensitivity to process-related variabilities. • Gate length/doping concentration variations • Temperature • Supply voltage droops • This decreases processor yield • Decreasing device sizes  Increased effect of external issues

Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact

Processor Error Classification • Hard Errors will result in permanent processor failure. • Processor lifetime is inversely proportional to hard error rate. • Soft Errors do not permanently damage the device.

Hard Errors • Extrinsic failures • Caused by process and manufacturing defects • Occur with decreasing rate over time • No impact from micro-architecture • Intrinsic failures • Related to processor wear-out • Occur with increasing rate over time • Related to wafer packaging, process parameters, and processor design.

Hard Errors

Soft Errors • Occur in both memory and logic • External radiation main issue in memory • Alpha particles • High energy neutrons • Thermal neutrons • Different causes of transient errors in logic • External radiation • Supply voltage droop • Power supply fluctuations • Ground bounce, cross-talk • Process variation, temperature • Affect delay of computational paths

Radiation-Induced Soft Errors • Ionized particle strike causing a state change • No permanent damage (Hard-error) • Combo logic – Single Event Transients (SET) • Memory cells – Single Bit Upset (SBU) Multi Bit Upset (MBU) • Three causes of soft errors • Alpha particles • Thermal neutrons • High-energy neutrons

Alpha-Particles • Emitted from impurities in packaging materials. • Create electron-hole pairs through direct ionization • Range for a 10 MeV particle < 100um • Typical energy 4-9MeV • Improved manufacturing trends  Reduced effect • Purified materials • Shielding layers

Neutrons • Result of cosmic ray reactions with atmosphere • High-Energy neutrons react with chip materials. • Concrete only shielding material • 1.4x lower flux/foot of thickness

Neutrons • Thermal neutrons (<<< 1MeV) react with Boron-Doped Phosphosilicate Glass (BPSG) dielectric layer. • Produce ionized particles that can cause soft-errors • Solution  Remove BPSG from advanced processes • Mostly solved – SEU’s still found in 45nm, 90nm

Device-level solutions • Larger device sizes  Larger capacitance • Increase the amount of charge necessary to flip bit (critical charge) • Multiple VT design • Sensitivity to variation at low-VDD may limit effectiveness. • Body biasing also common to both radiation hardening and variation tolerance

Circuit-level solutions • DICE cell • Used for SRAM, FF’s, latches • Built-in currentsensors on supply lines of memory cells.

Modular redundancy • Dual Modular Redundancy • Triple Modular Redundancy

Redundant Circuits • Redundancy increases area/power • DMR/TMR in sub/near-VT • Timing variation between circuits increases • Utilization of redundant lanes for parallel operation can increase throughput at low-VDD

Self-Checking Circuits • Partition circuit into smaller blocks • Error checker for each block • Use error detection codes • Berger codes • Arithmetic codes • Increases circuit delay for error computation

Circuit-Level Speculation • Uses approximated circuit implementation • Goal is to reduce critical path

Tunable Replica Circuits • Mirrors delay of critical path • Monitors for errors over voltage/frequency changes

Timing Speculation • Razor timing error detection • Designed for transient faults • Effective against SET’s and SBU’s on flip-flops • Requires error recovery

Error Recovery Options in Scalar Processors • Clock Gating: • Global error signal • Clock gating • 1-cycle penalty

Error Recovery Options in Scalar Processors • Multiple Issue: • Error signals propagated to control unit • Instructions must be flushed • Error instruction then replayed • 2N-cycle penalty

Error Recovery Options in Scalar Processors • Counter-flow pipelining • Micro-rollback

Error correcting codes for memories • Most common is Hamming code • Check bits stored when data written • Identifies error and erroneous bit position

Error correcting codes for memories • Single-bit ECC adds area/power and delay • Low-VDD Increased delay • Hybrid VDD operation will reduce delay • Overhead increases for multi-bit ECC • Increased memory density  higher probability of MBU • Current research increase in ratio of MBU to total SER in sub-VT

System-Level Impact • Soft errors can have a large affect on processor functionality • Increasing issue with further device scaling • All methods off error detection/correction are costly • Need to be added to system blocks wisely • SEU distribution • Effects of process variation

System-Level Impact • How to determine what blocks have the highest system-level impact? • Mostly through simulation • For radiation: all-encompassing • Includes fault injection @ circuit level • Different models have been developed • ReStore – University of Illinois at Urbana-Champaign • Focuses on system level effect of radiation-induced errors • RAMP – IBM • Directed more towards hard-errors and processor failure.

Questions?

Microprocessor Reliability

Microprocessor Reliability

Presentation Transcript

Microprocessor

MICROPROCESSOR

MICROPROCESSOR

MICROPROCESSOR

Microprocessor

Microprocessor

Microprocessor/Microcomputer

Microprocessor

ARM Microprocessor

Microprocessor

80486 Microprocessor

Microprocessor

Microprocessor

Microprocessor

Microprocessor

Microprocessor

Microprocessor

Microprocessor

Microprocessor

Microprocessor

Microprocessor

Microprocessor