1 / 80

1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques

1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques. A. Marchioro / PH-ESE-ME. Outline. SEU Basic Facts Special technologies for SEU protection Mitigation techniques Circuit Techniques In logic In registers In RAMS Logic (Redundancy) Techniques

zalika
Télécharger la présentation

1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1st Combined R2E Workshop & School-DaysError Detection andCorrection Techniques A. Marchioro / PH-ESE-ME

  2. Outline • SEU Basic Facts • Special technologies for SEU protection • Mitigation techniques • Circuit Techniques • In logic • In registers • In RAMS • Logic (Redundancy) Techniques • Coding techniques • Error detection only techniques • Conclusions A. Marchioro / PH-ESE

  3. Significant also in industry Terrestrial cosmic rays and soft errors Vol. 40, No. 1, 1996 Soft Errors in Circuits and Systems Vol. 52, No. 3, 2008 A. Marchioro / PH-ESE

  4. SEU errors in “analog” circuitry • We live in a (mostly) digital world: • (Occasional) errors in analog circuitry will be ignored or will be fixed at the digital level • Particle strike at sensing elements: • Happens all the time at particle detectors • System should be designed to cope with single wrong measurement • Can happen easily in photo-receivers • Particle strikes at critical nodes • Biasing nodes • Self-recovery • Hits at high current nodes are probably going to remain unobserved • DAC registers • Not self recovered, but detectable in digital way • Oscillator circuits and PLLs: • Recovery could take ms, but should eventually occur • May require training or synchronization sequences to be sent • Can cause long sequences of errors in applications such as self-clocking serial streams A. Marchioro / PH-ESE

  5. SEU Basics

  6. SEU: where does it occur “0” from Darracq et al.: IEEE Trans. on Nuclear Science, VOL. 49, NO. 3, JUNE 2002 A. Marchioro / PH-ESE

  7. All all particles equally “dangerous” for SEU? • Energy loss (dE/dx) for protons in Si Bethe-Bloch energy loss equation for reference see: http://pdg.lbl.gov/2008/reviews/rpp2008-rev-passage-particles-matter.pdf A. Marchioro / PH-ESE

  8. When and where should we care? “I have this particular component in my system, should I be worried about SEU?”

  9. SEU: Impact on components A. Marchioro / PH-ESE (*) Both user and configuration logic are sensitive (**) Only user logic is sensitive

  10. SEU in a circuit • SEU can occur in several places in a circuit: • In a storage node (Register, Latch or RAM) • Along a logic path (needs to be synchronized with clock sampling to be relevant) • On a clock line (rather bad!) • On a global line such as Reset (catastrophic!) • Different techniques are necessary to protect from these different events • No one-size fits-all solution! A. Marchioro / PH-ESE

  11. Device Techniques

  12. Device level SEU protection: SOI + - - + + - STI Oxide well WARNING: Drawing not to scale! + - - + substrate + - - + The majority of commercial ICs are fabricated on bulk technologies. Charge can be collected from several microns of silicon under a device. In thin-film SOI, the active silicon layer can be very thin, < 300 nm, therefore little free charge can be produced. A. Marchioro / PH-ESE

  13. SOI and SEU Bulk SRAM - A SOI SRAM 1 Bulk SRAM - B SOI SRAM 2 A. Marchioro / PH-ESE From J. Doff, TNS, 8/2007

  14. SOI based ASIC design • SOI could be considered for specific and very demanding custom designs, but: • Requires special technology (few vendors) • Has virtually no library support • Has few if any IP available • Requires high volume • Price: Expensive to very expensive, no second source • What about the other chips in your system? • Still, it is used in space and military applications A. Marchioro / PH-ESE

  15. Circuit Techniques

  16. Single Event Upset in logic A Y A B B Y A B Y CLK A. Marchioro / PH-ESE If the length of the spike is longer than the typical gate delay, it will propagate down the logic path and possible be sampled in the next FF This used to be a very rare event in logic up to the .25 um generation Unfortunately it is common in 130, 90 and 65 nm (which means in most commercial chips today)

  17. Protection against SEU in logic Register Regular (fast) gates Slow gates (filter glitches) .. or double sample at register A. Marchioro / PH-ESE

  18. Circuit level mitigation techniques Normal Latch Strong Feedback Latch CK* CK* Din Din CK CK Extra Cap Latch Large Size Latch CK* CK* Din Din CK CK A. Marchioro / PH-ESE

  19. Special topology D-FF cell SEU robust FF: DICE cell From Calin et al. IEEE TNS Dec 1996 A. Marchioro / PH-ESE

  20. Single Event Upset in SRAM BL BL* 1 0 WL A. Marchioro / PH-ESE Sensitive nodes are the drains of off-state transistors

  21. Circuit level protection from Canaris, Whitaker: Circuit Techniques for the Radiation Environment of Space, IEEE 1995 CUSTOM INTEGRATED CIRCUITS CONFERENCE A. Marchioro / PH-ESE

  22. Remarks about SEU in RAMs • In today’s technologies, cells are so small (< 1 m2) that single ions can hit two or more locations at once, multiple SEU are common. • Single bit EDAC is likely not sufficient! • While it is true that most of the memory area is covered by the matrix of cells, hits in other areas (decoder, sense-amp), though rare, can be even more catastrophic A. Marchioro / PH-ESE

  23. A 65 nm 2-Billion Transistor Itanium A. Marchioro / PH-ESE

  24. More on SER… A. Marchioro / PH-ESE

  25. Logic Techniques

  26. Redundancy • Redundancy is actually a coding techniques, technically a simple “repetition” code, where the information is duplicated or triplicated and checked at convenient boundaries • Redundancy is well applicable in control blocks • Data paths are better protected by other techniques, such as parity etc. A. Marchioro / PH-ESE

  27. Repetition Code Take each symbol si in S and repeat it n times. This is an (n, 1) code. For example the word {s1s2s3}becomes the codeword {s1s1s1s2s2s2s3s3s3} Efficiency (= rate) of the code is: 1/n The minimum distance (see later) is n and the number of errors t that can be corrected is: t = ½ (n – 1) A. Marchioro / PH-ESE

  28. Triple redundancy Three copies of same user logic + state_register Voting logic decides 2 out of three (majority) Used regularly in: High reliability electronics Mainframes Problems: 300% area and power corrects only 1 error can get very wrong with two errors Problem: How do you make sure that the voting logic itself is not affected by SEU? Triple Module Redundancy FSM1 Output FSM2 Input Voting logic FSM3 A B CLK A C B C Logic for Voting A. Marchioro / PH-ESE

  29. Example of triplicated design • Gigabit Optical Link (CERN design: GOL • 0.8 and 1.60 Gb/s optical link • Unidirectional • < 300 mW • G-Link and Gigabit Ethernet protocol • Redundant logic • More than 20,000 units in Atlas, CMS, LHCb and Alice • http://proj-gol.web.cern.ch/proj-gol/) A. Marchioro / PH-ESE

  30. Double redundancy Two copies of same user logic + state_register Voting logic decides if outputs are unequal If mismatch: Report to system Problems: 200% area and power Can’t be used in “real-time” but may be sufficient for many applications Reduced Module Redundancy FSM1 Input Output FSM2 Comparison logic Reset Request CLK A. Marchioro / PH-ESE

  31. What to duplicate? Input Logic Input Logic Reg Reg Output Logic Output Reg Reg Comparison logic Comparison logic Logic Reg Reg • Use this: If clock frequency is low and technology is “old”. • Use this: • If clock frequency is high and technology is “advanced”. A. Marchioro / PH-ESE

  32. FSM general structure Input Logic Input Logic Reg Reg Logic Output Logic Output Reg Reg Comparison logic Comparison logic Logic Logic Reg Reg • Not This. • Do this! A. Marchioro / PH-ESE

  33. Redundancy in time: Single user logic block and two state_registers Two clocks (F1 and F2) Voting logic decides if outputs are unequal at completion of F2 If error: Compute again Problems: Needs time for 3 evaluations (…not really, three transients time constants are enough) No problem at 40 MHz and “modern” technology Needs multi-phase clock Temporal Redundancy Reg1 Input Logic Output CLK1 Comparison logic Reg2 Re-evaluate Request CLK2 CLK1 CLK2 A. Marchioro / PH-ESE

  34. Check for consistency only when results will be committed to memory: For instance when two computers/microcontrollers perform a STORE operation Advantages: Processors can be “standard” Write operations are relatively rare and therefore requirements on comparison resources are small Less resources needed for checking Used in some mainframes with triple redundancy Problem: if you detect an error in processor, how do you resync it? Memory Boundary Redundancy uP 1 Shared Memory uP 2 Comparison logic Error … A. Marchioro / PH-ESE

  35. Check for consistency only when results will become used by external devices: For instance when two computers/microcontrollers want to commit results to disk Advantages: Synchronization is less of a problem Less resources needed for checking In some cases it could even be done in software uP Architectures and/or hardware could even be different Used in high-reliability computer boxes and avionics I/O Boundary Redundancy uP 1 I/O Intf1 I/O device Mem1 I/O Intfc2 Re-evaluate Request uP 1 Comparison logic I/O CLK Mem1 … A. Marchioro / PH-ESE

  36. Mission critical redundancy Various computer configurations used during a Shuttle mission. from: NASA Shuttle documentation A. Marchioro / PH-ESE

  37. Redundancy in avionics from: IEEE Aerospace & Electronic Systems Magazine, October 2000 A. Marchioro / PH-ESE

  38. Coding Techniques

  39. Hamming Coding “Two weekends in a row I came in and found that all my stuff had been dumped and nothing was done. I was really aroused and annoyed and I wanted those answers and two weekends had been lost. And so I said, ‘Damn it, if the machine can detect an error, why can’t it locate the position of the error and correct it?’” from an interview with R. Hamming, February 3-4, 1977, quoted in T. Thompson, p.17 “The purpose of this memorandum is to give some practical codes which may detect and correct all errors of a given probability of occurrence, and which detect errors of even a rarer occurrence”. from R. Hamming, ‘Self-Correcting Codes – Case 20878, Memorandum 1130-RWH-MFW, Bell Telephone Laboratories, July 27, 1947 A. Marchioro / PH-ESE

  40. Coding for memory repair A. Marchioro / PH-ESE

  41. Mitigating SEU: Forward Error Correction T D Transmitter f(D) TP D R Receiver f(R) RP OK/NotOK =? A. Marchioro / PH-ESE • Examples of FEC: • Simple Parity (actually only error detection) • EDC: Hamming coding • single error detection capability, popular in computer DRAM • BCH • Sophisticated multiple bit error detection and correction; requires complex logic • Reed-Solomon • Sophisticated and efficient multi-word error detection and correction; requires complex logic

  42. Mitigating SEU: FEC (2) D =R f-1(R) f-1(R) R Receiver f(R) RP =? OK/NotOK A. Marchioro / PH-ESE The “parity” function must be such that, if an error is detected, one can also use it to recover the right data!

  43. Families of Error Control Methods • Block Codes: codeword built only on current message-word • Non-block codes: codeword depends on current message word and of some past words, ex: • Convolutional, used (obviously) in streaming channels • Examples of codes: • Hamming • Bose-Chauduri-Hocqueghem (BCH) • Golay • Reed-Solomon (RS) • Reed-Müller • Low Density Parity Check Codes • Turbo Codes • … A. Marchioro / PH-ESE

  44. Parity In B = {0,1}, start with a message word: S = {s1s2s3s4s5s6s7} Compute a “Parity” character s8 defined as: whereis the exclusive-OR (or the sum mod 2). Parity check can detect all single errors (but can not give the position) Parity check can not detect double (or even count) errors Used: - often in computer memories - in serial terminals data transmission A. Marchioro / PH-ESE

  45. Two-Dimensional Parity ParityX ParityY 2 Errors A. Marchioro / PH-ESE

  46. Two-Dimensional Parity A. Marchioro / PH-ESE

  47. Hamming (intuitive version) source parity c5 s1 s2 s3 c7 c6 s4 Definition: cj = computed to give even parity in the circle • Notice: • the 16 code words in Hamming(7,4) differ from each other by at least 3 bits. A. Marchioro / PH-ESE

  48. Hamming Codes (3) Hardware for encoder a0 a0 a1 a1 a2 a2 a3 a3 p0 p1 p2 A. Marchioro / PH-ESE

  49. Hamming Codes (4) Hardware for decoder a’0 a0 + a’1 a1 + a’2 a2 + a’3 a3 + p’0 Correction Logic p’1 p’2 A. Marchioro / PH-ESE

  50. Cost of Hamming SEC A. Marchioro / PH-ESE

More Related