Transient and Permanent Faults in Nanoelectronic ICs: Compensation and Repair

Transient and Permanent Faults in Nanoelectronic ICs: Compensation and Repair Problems, Solutions, Limitations H. T. Vierhaus BTU Cottbus Computer Engineering

Outline 1. Introduction: Nanostructure Problems 2. Transient Faults 3. Repair of Permanent Faults 4. Bus Structures and NoCs 5. Diagnostic Test 6. A Lot of Things to do ...

1. Introduction A bunch of new problems from nanostructures ...

Nanoelectronic Problems Lithography: The wavelength used to „map“ structural information from masks to wafers is larger (4 times of more) than the minimum structural features (193 versus 90 / 65 / 45 nm). Adaptation of layouts for correction of mapping faults Parameter variations: The number of atoms in MOS- transistor channels becomes so small that statistical variations of doping densities have an impact on device parameters such as threshold voltages.

Poly Poly - - Si Si n n n n doping atom doping atom p p - - Substrate Substrate Doping Fluctuations in MOS Transistors Density and distribution of doping atoms cause shifts in transistor threshold voltages!

Nanostructure Problems Individual device characteristics such as Vth are more dependent on statistical variations of underlying physical features such as doping profiles. A significant share of basic devices will be „out or specs“ and needs a replacement by backup elements for yield improvement after production. As smaller features mean higher stress (field strength, current density), also early failures „in the field“ are more likely and must be compensated. Transient error recognition and compensation „in time“ is becoming a must due to e. g. charged particles that can discharge circuit nodes.

Key Technologies Fault tolerant computing Is required to handle intermittent and transient fault effects, e.g. induced by radiation. An old technology that is already heavily used in every day computing (e.g. memory interfaces with ECC- check and correction). Can handle only a limited number of permanent faults! Built-in self test (BIST) and self-repair (BISR) Is required to handle permanent faults by self-repair using redundant elements. State-of-the-art for memories, not for logic. Can handle multiple faults (sequentially) until the resource of redundancy is exhausted. Algorithms that are fully or partially „fault hard“ Most DSP algorithms show an inherent „stability“ and work even under fault conditions with reduced precision. The effect can be „HW-enhanced“.

System-on-a Chip (SoC) SoCs are heterogeneous systems that require test & repair strategies for: - logic (also in processors) - memory blocks - interconnects • analog and D/A • components

Fault Tolerant Computing Works only for transient faults! Software-based fault detection & compensation specific Fault event HW logic & RT-level detection & compensation Typically works for transient and permanent faults! universal very specific Typically works for specific types of transient faults only! Transistor-and switch level compensation

2. Transient Fault Effects

Storage Nodes and Particles Q / fC 100 Alpha - Part. 10 1 0,18 0,09 0,35 0,25 Technology 1 MeV Alpha - Particle generates 42 fC Charge!

Contribution to Soft-Error Rates Static combinational logic: 11 % Sequential elements (FFs, Latches): 49 % Unprotected SRAM: 40 % Source: S. Mitra, N. Seifert, M. Zhang, Q. Shi, K. S. Kim, „Robust System Design with Built-In Soft Error Resilience“ IEEE Computer, Vol. 38, No.2, Febr. 2005, pp. 43-52

Spikes and Clock Rates in Logic Charge-/status Source: Pulse of 100 ps restoration is possible clock t Charge-/status clock restoration is impossible t Fault probability is digital logic is about proportional to clock frequency!

Logic Structures and Fault Events Particle- radiation Output Input - FFs FFs Flip-flops need fault tolerance / fault hardening in the first place, logic close-to outputs comes next.

Muller-C-Element

Fault-Tolerant Latch Design outl1 Latch 1 out Muller C-Element in Latch 2 outl2 If clock is high: out = in CL outl1= in outl1= in outl1, outl2 latched outl2= in outl2= in v(t) clock t

Fault Handling Muller-C-Element: If both inputs are equal: out = outl1, outl2 If both element are not equal: out = previous (outl1, outl2) Under local fault conditions on the latch outputs (one of 2 latches false), the C-element preserves the output condition from the „charge“ phase of the latch. Essentially 3 latches!

Intel‘s Scan Path Element

Intel‘s Scan Path Element plus Fault Compensation

TMR-Latch / Flip-Flop in FF1 Out = L1out with cout = 1 MUX Out = L2out with cout = 0 FF2 XOR cout FF3 clock Works with latches or flip-flops - Can compensate static or dynamic faults in latches / FFs! FF1 is untestable (active redundancy)

TMR-Scan-Element

TMR Scan-Element Fault tolerant in functional mode Fault tolerant in scan-mode Optional support of test strategies that require a specific sequence of 2 input bits!

Fault tolerant Latches and FFs

Particle - radiation Fault Compensation in Combinational Logic Input - FFs MC D MC D MC D

Fault Compensation in Combinational Logic fault-free signal V(t) t Signal with glitch V(t) t Latch close Signal with delayed glitch Time left to capture! V(t) t MC no capture / hold MC capture MC capture

3. Repair of Permanents Faults Compensation of transient faults is not enough. Some technologies for transient compensation can handle permanent faults, too, but not on the long run and with additional transient faults!

Memory Test & Repair Read-/ Write lines Lines Line address spare column columns

Memory Test & Repair (2) Read-/ Write lines Lines Line address spare column Memory BIST controller columns ... is already state-of-the-art!

Logic Self Repair

Granularity of Replacement

Levels of Repair

Replacement in Regular Structures (e.g. for DSP)

Parallel Backup Transistors VDD VDD out in1 in1 out redundant transistors in2 in2 GND GND Basic gate Gate with redundant transistors

Redundancy by „Active“ Parallel Transistors Active redundancy is not testable. Therefore there is no way to monitor the status of „available“ redundancy in a logic circuit. Parallel transistors cannot compensate a fault of the „stuck-on“ type (transistor always conducting). Faulty „backup“-transistors may produce additional faults that cannot be corrected! Adding redundancy is not enough, fault isolation is a real problem!

Configuration and Fault Isolation VDD stuck-on fault in1 out in2 GND

The Gate-Short-Problem Load 1 Driver Load 2 Gate- short GND-shorts of input gates affect the whole fan-in network and make redundancy obsolete!!

Gate Turn-off

Schematic Layout with VDD/GND Switches Gate with parallel redundancy and fault isolation Gate with parallel redundancy

Transistor-Level Overhead Redundancy parallel transistors VDD / GND switches separate gate poly lines Overhead (cells only) 30-40% 60-80 % 100-150 % estimates stuck-off coverage yes yes yes stuck-on coverage no yes yes gate shorts cov. no no yes control none one wire mult. wires lines

Duplicate Standard Cells VDD Switch VDD - Switch Gate 2 control Gate 1 VDD1 VDD2 out out in1 in1 in2 in2 GND GND

Again: Fault Isolation VDD Switch VDD - Switch Gate 2 control Gate 1 VDD1 VDD2 out out in1 in1 in2 in2 GND GND Gate input short Output VDD / GND short

Administrated Duplicate Cells VDD power switches 1 X VDD1 X 1 VDD2 gate in gate in gate gate out out Gate 1 Gate 2 Gate short GND1 0 X GND2 X 0 0 1 1 0 Act 1 Act 2 GND switches 1 0 GND

Features Use „normal“ cell designs Four states of operation: Config. 1: Gate 1 active, Gate 2 isolated Config. 2: Gate 2 active, Gate 1 isolated Config. 3: Both Gates active operating in parallel Config. 4: Both Gates isolated from VDD / GND Operations like „high / low power“ possible. Cells can be put to temporary „sleep“ for stress relieve. Permanent repair functions. Active cell output is connected only to „floating“ outputs of the other cell. If twin tubs are used and cell-internal tubs are also disconnected, gate input / GND short prohibited.

Bistable Switching Cell VDD 0 1 1 0 Output separation 1 0 0 1 Gate 1 Gate 2 1 0 0 1 1 0 Act 0 1 GND

Cell Duplication and Power Switch Possible for all types of cells (also flip-flops). Granularity of partitioning for replacements (single gates, blocks) can be selected upon demand. Combination with dynamic circuit optimization is favorably possible. Good coverage potential for transistor faults. Significant overhead (above 100 %), but most likely below Triple Modular Redundancy (TMR). Redundancy may become exhausted and requires a further level of redundancy!

Gate - Replacement Gate- fault backup- cell Std cells (gates) Insertion of replacement cell

Regular Logic Wiring logic gates next cell link drive feed next cell Config Block backup cell link next cell

Faults on Irregular Interconnects Routing tree C signal source S C C single fault (line break) C

Redundant Wiring Routing tree with loops extra wire .. plus double vias! C signal source S C C single fault (line break) C Problem: classic delay calculation works well on trees only!

4. Bus Structures and „Networks on Chip“ (NoCs) Technology forecasts predict that nano-wires may become the most vulnerable and unreliable circuit elements ...

Transient and Permanent Faults in Nanoelectronic ICs: Compensation and Repair