1 / 36

StageWeb : Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric

StageWeb : Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric. Shantanu Gupta Amin Ansari Shuguang Feng Scott Mahlke University of Michigan - Ann Arbor June 29, 2010. Reliability Threats. Transient Faults due to Cosmic Rays & Alpha Particles

nitsa
Télécharger la présentation

StageWeb : Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric Shantanu Gupta AminAnsari ShuguangFeng Scott Mahlke University of Michigan - Ann Arbor June 29, 2010

  2. Reliability Threats Transient Faults due to Cosmic Rays & Alpha Particles (Increase exponentially with number of devices on chip) Silicon Defects (Manufacturing defects and device wear-out) Electromigration Process Variation (random and systematic variations Frequency Negative Bias Threshold Inversion C C C C C C C C C Oxide Breakdown Speed binning on a die Intra-die ILD thickness

  3. Fault Tolerance Aspects Detect and Diagnose Reconfigure Recover Has anything gone wrong? Figure out the cause Isolate the broken components Resume execution from a safe point

  4. Reconfiguring a Multi-core • At the coarsest level, cores can be disabled. • Rumors that industry already uses this…. • IBM Cell w/ 7 SPEs, AMD Tri-Core • Can’t scale to higher failure rates! C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Year 1 Year 3 Year 5 Year 7

  5. Reconfiguration Granularity For 100% area overhead (redundancy) Better resource utilization Lower complexity CORE level STAGE level MODULE level FETCH DEC EXEC MEM WB • ElastIC, DT’ 06 • Reunion, MICRO’06 • Configurable Isolation, ISCA’07 • Online Diagnosis of Hard Faults, MICRO’ 05 • Ultra Low-Cost Defect Protection, ASPLOS’ 06 100% MTTF ↑ 170% MTTF ↑ 200% MTTF ↑ + Good MTTF gains + Circuit / Architectural boundary + Full coverage -- Poor MTTF gains + Easy to implement + Best MTTF gains -- Complex implementation

  6. Stage1 Latch Stage2 Latch Stage3 StageN CMP Fabric Stage1 Stage2 Stage1 Stage2 Stage3 Stage3 StageN StageN Core 1 Core 0 Stage1 Stage2 Stage1 Stage2 Stage3 Stage3 StageN StageN Core 2 Core 3

  7. Wearout Sensors • Delay • Temperature • Current The StageNet (SN) Fabric Crossbar Switch StageNet Slice (SNS) Inputs Stage1 Stage2 Stage3 StageN Outputs Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Stage1 Stage2 Stage3 StageN Configuration Manager

  8. A 4-Slice SN chip Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Configuration Manager

  9. 1 2 3 4 BR 5 6 7 register dependency 8 9 10 Performance Comparison: Pipline vs. SN Slice register wb Fetch Decode Issue Ex/Mem WB Gen PC Branch Predictor Register File LATCH LATCH LATCH LATCH branch resolution bypass > 5X slowdown Commit Time 5 stage pipeline 1 2 3 6 7 8 9 10 SN Slice Fetch Decode Issue Ex/Mem Register File Gen PC Branch Predictor buffer buffer buffer buffer buffer buffer buffer 1 2 3 6 7 8 9 10 double double double double double double double 2. Data forwarding 3. Transmission delays 1. Control stall

  10. >> LD LD + / & + >> << ST ST SN Slice Microarchitecture [MICRO’08] Fetch Decode Issue Ex/Mem Macro-op Generator Bypass $ Register File Gen PC Branch Predictor buffer buffer buffer buffer buffer buffer buffer SID SID double double double double double double double 1. Control Handling 2. Data Forwarding 3. Transmission Delays • Bypass $ • Stores previous results • Fully associative structure • Emulates data forwarding • Stream ID • Control flow handling • Eliminates flush signals • Macro-Ops • Send instruction • bundles • Amortizes transfer • delay • Increases system • utilization 0 1

  11. SN Slice Performance [MICRO’08] SNS + StreamID SNS + StreamID + Bypass$ SNS + Stream ID + Bypass$ + MOPs 6 10% slowdown 5 4 Normalized Runtime 3 2 1 0 wc mcf idct eqn grep 3des Mean rijndael rawcaudio rawdaudio g721encode g721decode

  12. SN System - scaling to 100+ cores? D I E/M F 1. Crossbars don’t scale well due to wiring / layout complexity - Area - Delay - Power D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F 2. Interconnection prone to failures - Single point of failure - Links have no redundancy D I E/M F D I E/M F D I E/M F

  13. StageWeb: Scaling to 100+ cores SN Island SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN Traditional many-core StageWeb many-core In a large many-core system, small groups of cores can form SN What’s the right size for a SN island?

  14. StageWeb: Scaling to 100+ cores Good scaling Poor scaling In a large many-core system, small groups of cores can form SN What’s the right size for a SN island? Unfortunately, a single crossbar can’t scale to 8-10 pipelines!

  15. Interconnection Alternatives • 1. Connectivity • Single • Single + Front-Back • Overlap • Overlap + Front-Back Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Back-end Front-end Fetch Decode Issue Ex/Mem Island 2 Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Island 3 Fetch Decode Issue Ex/Mem Front-end Back-end Fetch Decode Issue Ex/Mem Island 4 Fetch Decode Issue Ex/Mem

  16. Interconnection Alternatives • 1. Connectivity • Single • Single + Front-Back • Overlap • Overlap + Front-Back Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Island 2 Fetch Decode Issue Ex/Mem 2. Reliability Fetch Decode Issue Ex/Mem Island 3 Fetch Decode Issue Ex/Mem Inputs Fetch Decode Issue Ex/Mem Inputs Island 4 Inputs Fetch Decode Issue Ex/Mem Outputs Outputs Outputs c) fault-tolerant crossbar a) crossbar b) crossbar with spares

  17. Interconnection Configuration Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Island 2 Fetch Decode Issue Ex/Mem Faults in stages, crossbar ports, links, force a reconfiguration….

  18. Interconnection Configuration Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Island 2 Fetch Decode Issue Ex/Mem • Single crossbar configuration • Local to every island

  19. Interconnection Configuration Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Island 2 Fetch Decode Issue Ex/Mem Island 3 Fetch Decode Issue Ex/Mem • Overlap crossbar configuration • Sweep islands, forming pipelines opportunistically

  20. StageWeb Benefits • Scalability • Scaling SN to benefit 100+ core systems • Interconnection Reliability • Handling faults in crossbars and links • Process Variation • Slower components can be isolated in a multi-core chip

  21. Mitigating Process Variation Frequency Fast Fetch Decode Issue Ex/Mem Issue Ex/Mem Slow Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Medium Fetch Decode Fetch Decode Issue Ex/Mem Fast Severe process variation and lifetime wearout can result in a disparity of health for various resources StageNet can effectively isolate strong/weak resources

  22. Evaluation Interconnections Crossbar types • Open RISC 1200 cores (4-stage in-order) • 12 configurations compared, 64-cores each • Experiments • Lifetime evaluations - throughput and total work • Process variation - speed binning on a die

  23. Lifetime Reliability Evaluations • Monte Carlo simulation with 300+ lifetime experiments • Where, each lifetime experiment involves - • Assigning a time-to-failure to all stages • Killing components at their failure times • Reconfiguring system to isolate broken components • Repeating this until no logical pipeline can be formed • Cumulative work and throughput are recorded • Number of cores: 64 • Technology node: 90 nm

  24. Cumulative Work ~70% more work!

  25. Cumulative Work (area neutral) 52 cores • Best StageWebConfiguration • Overlapping interconnection network • 52 cores • 6 adjacent slices connected by each crossbar • Fault-tolerant crossbars

  26. Throughput over time

  27. Mitigating Process Variation Freq 27 For a given frequency target, StageWeb can operate: More cores, OR Same # of cores at lower voltage 45

  28. Conclusions • Architectural innovations will be crucial in tackling technological uncertainties • StageWebis a potential solution • Allows fine-grained isolation of failures • Most reliability gains from grouping 8-10 pipelines • Scalable to 100+ cores • StageWebcan also mitigate process variation by grouping together faster and slower parts

  29. Thank You http://cccp.eecs.umich.edu

  30. Back up slides

  31. Impact of Defects on CMP Yield

  32. Overlapping Network

  33. Simple + 2nd Level Crossbars

  34. Overlapping + 2nd Level Crossbar

  35. Interconnection Alternatives • 1. Connectivity • Simple • Simple + Front-Back • Overlap • Overlap + Front-Back 2. Reliability Front-end Back-end Fetch Decode Issue Ex/Mem Island 1 Fetch Decode Issue Ex/Mem Inputs Fetch Decode Issue Ex/Mem Inputs Island 2 Inputs Fetch Decode Issue Ex/Mem Outputs Outputs Outputs c) fault-tolerant crossbar a) crossbar b) crossbar with spares

  36. SN System Level Issues D I E/M F 1. Crossbars don’t scale well due to wiring / layout complexity - Area - Delay - Power D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F D I E/M F 2. Interconnection prone to failures - Single point of failure - Links have no redundancy D I E/M F D I E/M F D I E/M F

More Related