1 / 21

SDRAM Error Modes— Characterization, Rate Calculation and Mitigation

SDRAM Error Modes— Characterization, Rate Calculation and Mitigation. –or–. Why Am I Doing This to Myself?. The Radiation Sins of the SDRAM. Single-Event Latchup (SEL) SEL seen for most devices Cannot predict susceptibility across revisions, let alone generations

ann
Télécharger la présentation

SDRAM Error Modes— Characterization, Rate Calculation and Mitigation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SDRAM Error Modes— Characterization, Rate Calculation and Mitigation –or– Why Am I Doing This to Myself? Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  2. The Radiation Sins of the SDRAM • Single-Event Latchup (SEL) • SEL seen for most devices • Cannot predict susceptibility across revisions, let alone generations • May be getting worse as device sizes shrink • Single-event functional interrupt (SEFI) • Increasingly complicated as device complexity increases • Power cycle required for recovery, so often results in loss of all data on chip. • Some SEFI are high current and can damage the IC • Single-Event Upset (SEU) • Smaller cell sizes mean lower single-bit rates, but higher multi-bit rates • Stuck bits • To date, mainly a problem at EOL, but for the next generation…? • Lot-to-lot variations in SEE rates and TID hardness can be >10x • IF you can even define a lot!! Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  3. So why am I doing this to myself? • Greater Density • SDRAMs provide memory densities 8-32x those of commercial SRAMs and 64-256x those of rad hard SRAMs • Need to use Commercial Software • Commercial solutions are often cheaper than dedicated development, but... • Commercial software developers assume adding memory is cheaper than adding processing power • Higher Speed • Lower Power (on a per bit basis, anyway) • Rapid development of commercial memories offers the possibility of organic growth to meet future system needs • That is, Moore’s law actually still applies in the commercial world!!! Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  4. The Trouble with SDRAMs I: Economics • Market Forces Can Drive Radiation Performance • Tech slowdown is not news to Memory manufacturers • Memory manufacturers were under pressure since before 1998. • Most companies now have negative profit- and operating-margin. • During PC crunch memory prices decreased 85% • Result is pressure to innovate and cut costs--at all costs • Short product cycles: If you find a part you like, grab it quickly! • Little attention to process parameters that don’t affect commercial performance—i.e. radiation response • Lack of stability in fabrication process • Important: SEL design rules result in a penalty for efficiency and density • Note: Some memory mfgrs. have begun to pay attention to the effects of  particles and sometimes neutrons. This may mean future memories exhibit somewhat greater radiation hardness. Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  5. The Trouble with SDRAMs II: Technology • Memory Cells • Not bistable--must be continually refreshed to preserve data • leads to asymmetric upset susceptibilities: 0 1  1 0 • Susceptibility may change for different modes of operation (Refresh, Read, Write…) • Access FET is vulnerable element • Very small dimensions • small vulnerable area for upset, but • vulnerable to multi-bit upsets • Logical organization may not correspond to physical layout • Some memories store data as ‘0’ = charged, ‘1’ = uncharged; some use opposite scheme; some are mixed!! Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  6. Architecture of a 256-Mbit SDRAM Trouble with SDRAMs II: Technology (Cont.) • Control Logic is implemented in deep submicron bulk CMOS • High SEL risk • To date: all 256 Mbit SDRAMs exhibit SEL or high-current SEFI • a few 64 Mbit parts do not latch up • small vulnerable areas, but low LET thresholds mean moderately high SEFI rates • Complex architecture • results in complex SEFI behavior of the device • Multibit upsets may also result from control logic upsets • Single and Multi-bit SEU probabilities may be mode dependent Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  7. Error Modes and Effects—SEL • SEL--potentially destructive condition seen in parasitic bipolar elements of CMOS (top right) caused by passage of a single ion. • In DRAMs, only control logic is vulnerable • No consistent trends across generations even for the same manufacturer • If not destructive, may still result in latent damage that compromises reliability (see lower right) • Recovery from nondestructive SEL results in loss of all data stored on the memory • SEL protection circuitry can upset, results in spurious SEL indications and loss of data. • Mitigation is limited • SEL detection and protection doesn’t always work and can be very disruptive • Redundancy can mitigate against loss of capability if system is flexible enough. See Becker et al., to be published in Dec. 2002 IEEE TNS. Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  8. Error Modes and Effects—SEU • SEU--flip of a single bit or multiple bits caused by passage of a single ion through sensitive volume(s). • Effect--depends on application; may range from negligible degradation of data integrity and reliability to catastrophic failure. • Rate and susceptibility to single and multi-bit upsets depend on memory organization, mode and even memory contents. • Multi-bit upsets can result two causes: • upsets in control logic affecting multiple bits, or • if logically adjacent bits are also physically adjacent--from a single ion passing through multiple adjacent bits. • Mitigation measures can include: • Error Detection and Correction (EDAC) --effectiveness depends on number of bits in error that can be detected or corrected. Effectiveness may also depend on application conditions--even on memory contents. • Memory organization--effective against multi-bit upsets • Triplicate and Vote--depends on hardness of voting circuitry; can be very effective, but negates density advantage of SDRAM Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  9. SDRAM Scaling and Rate Calculation • Simplicity of DRAM memory cells means they are among the smallest of electronic devices, with physical cross sections<1 m2. • Should be among the first to exhibit new phenomena and scaling effects. • Charge collection may be one important example • Charge collected by “drift” in the high-field depletion region • Outside depletion region, charge collected by diffusion Cell 111111 Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  10. SDRAM Scaling and Rate Calculation • High SDRAM densities also increase the importance of diffusion for both single- and multi-bit upsets (note increase of multi-bit upset cross section wrt. single-bit cross section at high vs. low LET for DRAM in plot below). • Current rate calculation tools largely ignore diffusion. • Future devices may require new tools to calculate both single- and multi-bit rates • Again, trend is expected to worsen as device sizes shrink. Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  11. Error Modes and Effects—SEFI • Single-Event Functional Interrupt—Any interruption of the normal memory operations resulting from an ion strike to a vulnerable element--usually control logic. • Any function of memory control logic--even those with no external access--can be affected by SEFI • Loss of functionality can be localized, partial or total, depending on hit location • Recovery usually involves power cycling the chip with consequent loss of some or all data on the chip. • Some SEFIs involve high current states that could cause permanent damage. • Mitigation is difficult • Memory Organization + EDAC (see next chart) • Triplicate and Vote • Rapid detection and recovery • All mitigation of SEFIs carries penalties for performance and size of memory required Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  12. SEFI Mitigation Nibble 1 Nibble 2 Nibble 3 Nibble 4 Spare EDAC ... Nibble 5 Nibble 6 Nibble 7 Nibble 8 • Storing a single nibble on each of several memory chips makes it feasible to use single or multi nibble EDAC to correct even SEFIs resulting in the loss of all data on a single chip. • NOTE: In this configuration, destructive failure of one chip negates single-nibbleEDAC; two failures render the memory inoperable. This may be mitigated by the ability to swap spare chips in as primaries fail. Similar considerations apply to triplicate and vote strategies. Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  13. Stuck Bits--Showstopper of the Future?? • Stuck Bits--localized TID failures of access FETs for individual bits.. • To date: only an issue at end of life, after part has sustained TID damage. • New devices with smaller access FETs may be more vulnerable. • LET threshold and TID tolerance could both be lower. • Bits may become leaky before they fail, requiring increased refresh rate. • Note: Higher refresh rate may change SEU vulnerabilities. • Failures will tend to be randomly distributed throughout memory • Mitigation • Annealing may automatically bring about partial recovery over time. • Implement ability to map around bad bits/words/sectors • EDAC--note that stuck bits decrease EDAC effectiveness: Hamming code becomes like a parity check. • All active mitigation significantly increases system complexity. Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  14. Mitigation and Dominant Error Modes • With no error mitigation, the single-bit SEU rate dominates the system SEU rate. • Introduction of Hamming Code type EDAC results in significant improvement, but multi-bit SEUs now dominate • Note I: In current generation devices multi-bit upsets can occur at ~10% of the rate of single-bit upsets. This number will rise as device sizes shrink. • Note II: This depends on the physicallogical mapping of the SDRAM. If logically adjacent bits are never physically adjacent, multi-bit rate is low. • Multi-bit/nibble EDAC results in improvement of several orders of magnitude. SEFIs will typically dominate. • Mitigation against SEFIs again gives significant improvement. System error rate now dominated by multiple independent, concurrent error/failure conditions. • Important: Implementation of mitigation techniques must consider device and application characteristics. Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  15. Mitigation: Unintended Effects • Mitigation measures can adversely affect system performance • SEL detection and protection circuitry can give rise to spurious SEL indications, resulting in data loss • SEFI mitigation by memory organization can result in greater vulnerability of the entire memory array to failure of a single chip. • Stuck bit mitigation by increased refresh rate may change the upset vulnerabilities of the chip. • Proper implementation of mitigation measures requires knowledge of device characteristics: • physicallogical bit mapping, failure modes (especially SEL), types of SEFIs, probabilities of different error modes • “Preemptive” implementation without this knowledge could actually make things worse. • Testing (e. g. SEU or laser testing, DPA) is usually the only way to gather this information, even with vendor assistance. Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  16. Example: Multi-bit Upsets and Hamming Code • This example is intended to be purely illustrative; • Multi-bit Upsets may occur in two ways: 1) Upsets to control logic, especially during Refresh or WRITE operations 2) Single ion strike through multiple physically (and logically) adjacent bits • Asymmetry of upset characteristics means 2nd mechanism depends on memory contents as well as mode; 1st depends only on mode • Hamming type EDAC: 1 upset bit--corrected and accepted; 2 upset bits--upset is detected and value rejected; 3 or more--not detected, value accepted. • Assume: • System uses 8 bit words • “bit=1” uncharged storage element; “bit=0” charged storage element • Upsets 01 occur via mechanisms 1 and 2; 10 by mechanism 2 only • Probability of double upset  number of adjacent ‘0’s in the word • Use of Hamming code can introduce a systematic error. • Unless operation modes of memory are known in detail; error is truly systematic and not correctable Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  17. Example (continued) • Even if errors occur with a moderately low bit-error rate, asymmetry of upset characteristics may introduce systematic error. Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  18. Effect on Small Signal + Large Background Differential acceptance of EDAC can introduce a systematic “noise” into measurements. This noise can obscure small signals, particularly if they rest on large backgrounds. Also, Because error rates are mode dependent, the systematic bias is effectively uncorrectable. Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  19. Discussion • Several factors make it very difficult to correct for the errors introduced by the differential acceptance across values • Upset rates and (more important) the asymmetry in upset rates (01 vs. 10) depend on the mode history of the memory device. • Details of SDRAM operation are typically proprietary. • In effect, improper use of Hamming Code introduces an uncorrectable systematic bias. • Effect of triple-and higher bit upsets under assessment • Not detectable by Hamming Code, so would be accepted • Expected to exhibit a similar differential bias; • Also dependent on mode and charge collection characteristics • Could contribute at a level of a few percent of the per-bit error rate • Multibit upset rates will worsen as device sizes shrink. • Different memory organization or mixture of representations of ‘0’ and ‘1’ will change the outcome drastically Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  20. Conclusions • Although Commercial nature of SDRAMs poses inherent challenges • Market forces impose rapid pace of development and process instability • SDRAM architecture and technology impose radiation vulnerabilities. • SDRAM densities are strong motivation to develop mitigation.strategies • Lower weight, power, faster development pace… • Successful mitigation necessitates consideration of error mode characteristics • SEL susceptibility, upset asymmetries, mode dependence, etc. • Failure to do so can introduce systematic biases--or even more serious, system-level failure modes--into the resulting data set. • Examples similar to SEU/Hamming Code example possible for other error modes—SEL, SEFIs, stuck bits—and other mitigation strategies given is not unique. • Testing is the only way to identify and measure these characteristics. • Future commercial memories promise new issues • Greater stuck bit vulnerability; diffusion as dominant charge collection process... Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

  21. Something to Think About • Question for future consideration: • Could the same characteristics that introduce such unintended consequences be used to optimize mitigation techniques and implementation? Presented by Ray Ladbury at MAPLD, JHU/APL, Laurel, MD September 10-12, 2002

More Related