1 / 23

A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor

A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor. Jason Blome, Scott Mahlke, Daryl Bradley*, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. Embedded Everywhere. Not just cellphones

emmet
Télécharger la présentation

A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor Jason Blome, Scott Mahlke, Daryl Bradley*, Krisztián Flautner* Advanced Computer Architecture Lab, University of Michigan *ARM Ltd. 1

  2. Embedded Everywhere • Not just cellphones • Safety critical applications: • Automotive • Healthcare Patterson and Hennessy 2005 2

  3. Embedded Domain Constraints • Power efficient performance • Longer clock cycle times • Increased logic depth between stages • Higher area ratio of combinational logic to state elements • Less speculative state • Potentially less masking • Limited real estate All of these high level constraints affect the behavior of faults and the potential of fault tolerance techniques 3

  4. Objectives • Understand the effects of transient faults on a typical embedded design • Architectural contributions to soft error effects • Production-grade core • Reference synthesis flow • Design for test methodologies • Simulate faults in both combinational and sequential logic 4

  5. Soft Error Rate Contributions Soft Error Rate Contributions Mitra 2005 Shivakumar 2002 Increasing contribution of faults in combinational logic to the overall soft error rate 5

  6. ALU Processor Model • ARM926EJ-S • Cell library characterized for 130 nm • 5 ns clock cycle time ARM926EJ-S Instruction Fetch Instruction Decode Data cache Data Interface MMU Instruction Address Logic Register Bank Mux Array Instruction cache Shift MMU Write Buffer/ Bus Interface Multiply Bus Interface Data Address Logic 6

  7. Analysis Infrastructure testbench reference design test design benchmark error checking and logging fault injection scheduler fault injection/error analysis framework report generation 7

  8. 0 0 CLK tsetup thold Fault Masking • Logical: faulted value does not affect logical operation of the circuit • Architectural/Software: incorrect state is written before it is read • Latching-Window: the fault pulse does not reach a state element within the latching window • Electrical: the fault pulse is electrically attenuated by subsequent gates in the circuit 8

  9. 94% 7% 16% 4% Observed Error Rates Faults Occurring in Registers Faults Occurring in Combinational Logic At the software interface, error rates within 3% 9

  10. Observed Error Rates Faults Occurring in Registers Faults Occurring in Combinational Logic Faults in combinational logic have a much more dramatic effect on system state 10

  11. Architectural Errors per Cycle Faults Occurring in Registers Faults Occurring in Combinational Logic 11

  12. Architectural Corruption Characteristics Bits per Architectural Register Corrupted Number of Architectural Registers Corrupted 12

  13. Results Summary • Faults occurring in logic: • Will likely be much more frequent in embedded design • Tend to have a more dramatic effect on system state • Multi-bit/multi-register architectural errors common • Design for test methodologies can greatly impact soft error characteristics • Error rates at the software interface consistent with those observed in high-performance microprocessors 13

  14. Traditional Error Detection/Protection • Reliable Encoding • ECC/Parity • Limited use for faults in logic • Unclear where/how much to protect • Redundant Computation • In space • Area/energy overhead • In time • Energy overhead • Requires performance slack 14

  15. Cycle 1: 51 Errors instr_reg_ID[0, 16, 22, 31] ID_decode_info[0, 16, 31] stored_instr[29, 30] Cycle 2: 51 Errors instr_reg_EX[0, 16, 22, 31] EX_decode_info[0, 16, 31] Cycle 3: 17 Errors ALU_out[0, 1, 2, 3, 4, 5, 6] Cycle 5: 29 Errors Reg0_reg[0, 1, 2, 3, 4, 5, 6] Cycle 4: 18 Errors ALU_result_wb[0,1,2,3,4,5,6] ALU Case Study I IRoute Instruction Fetch Instruction Decode Data cache Data Interface MMU Instruction Address Logic Register Bank Mux Array Instruction cache Shift MMU Write Buffer/ Bus Interface Multiply Bus Interface Data Address Logic 15

  16. Cycle 1: 9 Errors instr_reg_ID[3,12,17, 18,24,26,29,30,31] Cycle 2: 62 Errors instr_reg_EX shifter_data_opEx_reg Shifter_data_reg alu_cc_reg Cycle 3: 49 Errors Shifter_data_EX alu_out_reg ALU Cycle 4: 183 Errors writeback and forwarding state register bank Case Study II IPipe Instruction Fetch Instruction Decode Data cache Data Interface MMU Instruction Address Logic Register Bank Mux Array Instruction cache Shift MMU Write Buffer/ Bus Interface Multiply Bus Interface Data Address Logic 16

  17. Fault Characteristics • Case Study I: uCORE.uIRoute.U600 • First cycle error sites: 51 errors • uIRoute.INSTRHeld_reg[0] • uIRoute.INSTRHeld_reg[16] • uIRoute.INSTRHeld_reg[22] • uIRoute.INSTRHeld_reg[31] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[0] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[16] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[31] • u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg[29] • u9EJ.uARM9.uCORECTL.uIPIPE.StoredInstrInt_reg[30] • Case Study II: uCORE.u9EJ.uARM9.uCORECTL.uIPIPE.U3626 • First cycle error sites:9 errors • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[3] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[12] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[17] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[18] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[24] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[26] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[29] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[30] • u9EJ.uARM9.uCORECTL.uIPIPE.IDarmDeint_reg[31] 17

  18. Embedded Design Space Potential • Leverage significant signal fanout • Determine that a fault has occurred during the cycle that it occurs • Transition detection circuits • Selectively deploy fault detection units • Intersection of high fanout fault targets • No roll-back necessary – simply flush the pipeline • Low cost/area overhead critical for embedded designs 18

  19. Conclusion • Design domain critical: • Affects fault behavior • Limits applicable tolerance techiques • Key observations: • Faults in combinational logic much more likely in embedded designs • Faults in combinational logic behave dramatically different than those in state elements • Fault fanout offers potential for low overhead detection 19

  20. transient fault soft error Soft Error Terminology transistor 20

  21. Dependence on Fault Duration 21

  22. Pulse Detection flip-flop D Q CLK ~Q error shadow latch 22

  23. Microarchitectural Errors per Cycle Faults Occurring in Registers Faults Occurring in Combinational Logic Multi-bit errors common for Faults in combinational logic 23

More Related