Towards Soft Error

Towards Soft Error Feb 16, 2006 ACES (Architectures and Compilers for Embedded Systems) Laboratory Kyoungwoo Lee

Agenda • Soft Error • Definition • Causes • Trend • Challenges • Related Works • The Effects of Energy Management on Reliability • Cache Size Selection for Performance, Energy and Reliability

1. Towards Soft Error • What is soft error? • Why is soft error important? • How to recover soft error?

Definition of Soft Error • Soft Error (SE) • Transient Fault = Bit Flip = Single Event Upset (SEU) • A charged particle strikes electronic circuits and changes the amount of charge stored at sensitive nodes, hence affects the logic state (e.g.: ‘0’ to ‘1’ or vice versa) • Random, non-catastrophic, non-destructive, recoverable • Caused by Radiation • Neutrons • Alpha particles • High-energy cosmic rays • Solar particles Robert Bauman, “Soft Errors in Advanced Computer Systems” in IEEE Design and Test of Computers 2005

Soft Error Rate and Importance • Soft Error Rate (SER) • FIT: Failure in Time (one billion hours) • (e.g.) 1,000 FITs per Mbits ≒ 114 years MTTF (Mean Time To Failure) • SER ∝ Nflux * CS * exp{-(Qcritical/Qs)} • Nflux : intensity of the Neutron Flux • CS : the area of the cross section of the node • QS : the charge collection efficiency • Qcritical : the min required charge for a cell to retain data, Qcirtical = C * V where Capacitance (C) and Voltage (V) • Critical SE • High Integration and Density • e.g.: 1 GB memory with 1,000 FIT per Mbits  8 * 106 FITs/memory  5 days MTTF • Technology Advancements • e.g.: 1,000 FIT per Mbits in 0.18 µm tech  10,000 to 100,000 FIT per Mbits in 0.13 µm tech • Latitude and Altitude • e.g.: 10 to 100 times higher SER at flight than at ground • Voltage Scaling • e.g.: lower voltage decreases Qcritical, which increases SER exponentially

SER Trend C. Core Logic B. SRAM A. DRAM D. Contributions in Processors Robert Bauman, “Soft Errors in Advanced Computer Systems” in IEEE Design and Test of Computers 2005 S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” IEEE Computer 2005

SE Detection and Recovery Coding Data Decoding • Information Redundancy • E.g.: ECC (Error Correction Coding) and Parity • Hardware Redundancy • E.g.: TMR (Triple Modular Redundancy) • Temporal Redundancy • E.g.: Checkpointing and Recovery • Effects of Redundancy on Cost, Performance and Power • E.g.: ECC • implemented by Hamming Code using 250 nm libraries • Coding/Decoding modules and extra bits • 1.45 ns for Coding and 2.66 ns for Decoding • 14.5 mW for Coding and 26.3 mW for Decoding Extra L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft Error and Energy Consumption Interactions: A Data Cache Perspective,” Proc. of ISLPED, pp. 132-137, 2004

Related Works • Reliability and Power Management • Dr. D. Mossé group in Univ. of Pittsburgh • D. Zhu, R. Melhem, and D. Mossé, “The Effects of Energy Management on Reliability in Real-Time Embedded Systems,” Proc. of ICCAD, Nov. 2004. • D. Zhu, R. Melhem, D. Mossé, and E. Elnozahy, “Analysis of an Energy Efficient Optimistic TMR Scheme,” Proc. of ICPDS, Jul. 2004. • Dr. G. De Micheli group in Stanford Univ. • K. Mihic, T. Simunic, and G. De Micheli, “Reliability and Power Management of Integrated Systems,” Proc. of EuroMicro Systems on Digital System Design, 2004. • T. Simunic, K. Mihic, and G. De Micheli, “Optimization of Reliability and Power Consumption in Systems on a Chip,” Proc. of PATMOS, 2005. • (Cache) Architecture • Dr. M. J. Irwin and Dr. N. Vijaykrishnan group in PSU • L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft Error and Energy Consumption Interactions: A Data Cache Perspective,” Proc. of ISLPED, pp. 132-137, 2004 • Dr. S. M. Reddy group in Univ. of Iowa • Y. Cai, M. T. Schmitz, A. Ejlali, B. M. Al-Hashimi & S. M. Reddy“Cache Size Selection for Performance, Energy and Reliability of Time-Constrained Systems" in ASP-DAC 2006 • Soft Error and Core Logic • Intel • S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” IEEE Computer, pp. 43-51, 2005 • Dr. K. Roy group in Purdue Univ. • A. Goel, S. Bhunia, H. Mahmoodi and K. Roy, “Low-Overhead Design of Soft-Error-Tolerant Scan Flip-Flops with Enhanced-Scan Capability”, in ASP-DAC2006

(1) "The Effects of Energy Management on Reliability in Real-Time Embedded Systems" in ICCAD 2004 Dakai Zhu, Rami Melhem and Daniel Moss´e from PARTS project in University of Pittsburgh

Outline • Motivation • Problem • Main Idea • Simulation Results • Conclusion and Contribution

Motivation • For autonomous critical real-time embedded applications (satellite and surveillance systems), both high reliability and low energy consumption are desired • Slack time can be used for reliability and energy saving in real-time applications • Temporal redundancy like recovery through rollback execution • DVS (Dynamic Voltage Scaling) • Tradeoff between power management and reliability • Frequent fault-tolerance increases energy consumption • More energy saving by DVS or DFS causes less reliability since less slack time for fault-tolerance • The rate of soft errors depends on operating frequency and supply voltage

Problem and Main Work • Explore the tradeoff b/w reliability and energy consumption in real-time embedded systems considering fault-rate changes • Propose two fault rate models related to frequency and voltage scaling • Application model: A frame-based real-time applications (slack time) • Power model: Frequency-dependent/-independent models • Fault model: Exponential relationship for DVS • Analyze the effects of energy management on reliability • Performability: the probability of finishing the application correctly within its deadline in the presence of faults • Expected Energy Consumption

Application Model • A frame-based real-time application which is executed repeatedly within every frame • L: worst case execution time at fmax • D: deadline (L <= D) • Assumption: Execution time is linearly related to the frequency • (eg) frequency is reduced by half, execution time doubles

Power Model • P = Ps + h*( Pind + Pd ) where Pd = C*V2*f • Ps: sleep power (e.g.: maintain basic circuits, keep the clock running) • Pind: frequency-independent active power (e.g.: memory and components power) • Pd: frequency-dependent active power (e.g.: processor dynamic power) • h = 0 during sleep state and h = 1 otherwise • Assumption • The system is always on due to the huge overhead of turning on/off system • Ps is zero since it’ll not affect energy saving • P = Pind + C*V2*f

Energy Model (Voltage Scaling) • Reduce the supply voltage for lower frequency • E = P*T = Pind*(L/f) + C*f2*L • P = Pind + C*V2*f = Pind + C*f3 • For frequency f, the corresponding V = f*Vmax=f since Vmax = 1 for normalization • T = L*(fmax/f) • Lower supply voltage lower frequency less frequency-dependent energy more time for execution more freq-independent energy

Fault Model • Average fault rate at f and V • λ(f,V) = λ0g(f,V) • λ0 is the average fault rate corresponding to Vmax and fmax • Exponential Fault Model for DVS • The fault rates in processors as well as in memory increase exponentially when supply voltage decreases • (why) with reduced supply voltage, the critical charge becomes smaller resulting in exponentially increased fault rate • λ(f,V) = λ0g(f,V) = λ0*10d{(1-f)/(1-fmin)} • when f and V = f*Vmax = f (Vmax = 1) • λmax = λ010d at f = fmin and λmin = λ0 at f = fmax • Reducing the supply voltage for lower frequency results in exponentially increased fault rates • Larger d indicates that the fault-rate is more sensitive to voltage scaling

Perfomability • Rf = 1 – ρfkf+1 • ρfkf+1 is the probability of having fault(s) during every execution • Original (1) + Recovery executions (kf) • kf = └ D/(L/f) ┘-1 : the number of possible recovery executions at f • ρf = 1 – e-λ(f,V)L/f • The probability of having at least one fault during one running of the application • With voltage scaling, • ρf = 1 – e-λ(f,V)L/f = 1 – e-λ0 10^{d(1-f)/(1-fmin)}L/f • ρf increases when frequency f decreases • Lower frequencies result in fewer number of recoveries • Therefore, Rf decreases when supply voltage is reduced for lower frequency

Energy Consumption • Expected Energy Consumption • Ef = (Pind + CV2f)L/f • the energy consumption to execute the application once at f • ρf Ef • the energy consumption to execute the first recovery when original execution fails with ρf (the probability of having faults during the original execution) • ρf iEf • the energy consumption to execute the ith recovery with ρf , which is the probability of having faults during the original execution as well as every previous recovery execution • EEf = Ef (1 + ρf + ρf2 + … + ρfkf ) = Ef(1- ρfkf+1)/(1- ρf) • the energy consumption to execute the original application and kf recoveries • Under voltage scaling, • EEf is always less than that of fmax (EEfmax) if Pind = 0 and f3*D/L ≤ 1 • Otherwise , EEf may increase for lower frequency and voltage, when the probability of executing the recoveries is rather high due to fast increased fault rates

Simulation Setup • Assumption • Normalized frequency and voltage • fmax = 1 and Vmax = 1 • Pdmax = CV2maxfmax = 1 : max frequency-dependent active power • Power consumption • Pentium M processor: 25 W peak power, 1 W sleep power • Rambus memory: 300 mW active, 30 mW sleep power • Pind = 0.0, 0.2, and 0.4 • The rate of radiation induced faults • 10-6 faults per second at Vmax and fmax = 3.6*106 FITs per system • Sensibility to supply voltage : d = 0, 2, 4, or 6 • Application with D = 100 time units and L = 30 at fmax

Simulation Results (DVS) • Decreasing freq. and vol. decreases • the energy consumption • For d = 6, energy consumption increases • at lower freq. when 0.6<f<0.7 •  high probability of recovery • Lower frequency, lower performability • Larger d leads to worse performability • The performability for fixed fault rate • (d=0) is much better than that of variable • fault rate • Ignoring the effects of voltage scaling may lead to unsatisfied performability

Conclusion and Contribution • Energy management through frequency and voltage scaling has significant effects on system reliability • Ignoring the effects of energy management on fault rate is too optimistic and may lead to unsatisfied reliability goals • The first attempt to model the relationship between reliability and power management with fault-rate changes caused by voltage scaling

(2) “Cache Size Selection for Performance, Energy and Reliability of Time-Constrained Systems" in ASP-DAC 2006 Y. Cai, M. T. Schmitz, A. Ejlali, B. M. Al-Hashimi & S. M. Reddy from University of Iowa and University of Southampton in UK

Outline • Motivation • Problem • Main Idea • Simulation Results • Conclusion and Contribution

Motivation • Cache size has the largest impact on performance and energy consumption • Also cache size affects the reliability since • The probability of particle-hits in the smaller active area is reduced • The probability of particle-hits during a longer execution time increases • It affects the number of possible re-executions

Problem • Goal • Examine the combined effect of cache size selection on energy consumption, performance and reliability • Transient Fault Model • Performability Model • Cache Energy Consumption

Simulation • Platform : MPARM – a cycle-accurate simulator including ARM7 • Cache size : 32 bytes to 256 Kbytes • Benchmark : fixed point FFT, CRC, matrix multiplication, matrix addition and quicksort algorithm • Inject 106 faults for each benchmark • Faults occur at a bit, which is randomly chosen at X-Y domain • Randomly determine the clock cycle when a fault occurs

Simulation Results • Experiments for FPFFT on Data Cache Larger energy per cache access App do not use extra cache Larger area of active cache High miss rate increases the # of cache accesses Not enough of slack for re-execution 210 bytes is an appropriate cache size for FPFFT, which minimizes cache energy and maximizes performability

Conclusion • Conclusion • Optimal cache size choices considering • Performability • Energy consumption • Satisfies application requirements • Dynamic change of cache size to suit the particular application is not only beneficial from an energy point of view but also to improve the system’s peformability

Summary • Soft Error • Related Works considering: • Energy Consumption • Performance • Reliability • Challenges against Soft Error • Take into account soft error • Peformance+Power + Reliability • Recovery for Core Logic • Soft error of core is increasingly important • Dynamic adjustments of system configuration • How to adapt the system satisfying reliability with minimum energy consumption at a given application

References • T. Simunic, K. Mihic, and G. De Micheli, “Optimization of Reliability and Power Consumption in Systems on a Chip,” Proc. of PATMOS, 2005. • L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft Error and Energy Consumption Interactions: A Data Cache Perspective,” Proc. of ISLPED, pp. 132-137, 2004 • D. Zhu, R. Melhem, and D. Mossé, “The Effects of Energy Management on Reliability in Real-Time Embedded Systems,” Proc. of ICCAD, Nov. 2004. • K. Mihic, T. Simunic, and G. De Micheli, “Reliability and Power Management of Integrated Systems,” Proc. of EuroMicro Systems on Digital System Design, 2004. • D. Zhu, R. Melhem, D. Mossé, and E. Elnozahy, “Analysis of an Energy Efficient Optimistic TMR Scheme,” Proc. of ICPDS, Jul. 2004. • W. Leung, F. C. Hsu, and M. E. Jones, “The Ideal SoC Memory: 1T-SRAM,” Proc. of IEEE SoC/ASIC Conference, pp. 32-36, Sep. 2000. • N. Seifert, D. Moyer, N. Leland, and R. Hokinson, “Historical Trend in Alpha-Particle Induced Soft Error Rates for the Alpha Microprocessor,” IEEE annual IRPS, pp. 259-265, 2001 • Dhiraj K. Pradhan, “Fault-Tolerant Computer System Design”, Prentice Hall, 1996, ISBN 0-13-057887-8 • Jean-Claude Geffroy and Gilles Motet, “Design of Dependable Computing Systems”, Kluwer Academic Publishers, 2002, ISBN 1-4020-0437-0 • S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” IEEE Computer, pp. 43-51, 2005 • R. Bauman, “Soft Errors in Advanced Computer Systems,” IEEE Design and Test of Computers, pp. 258-266, 2005

Towards Soft Error