1 / 19

Impact of Soft Errors on Large-Scale FPGA Cloud Computing

Impact of Soft Errors on Large-Scale FPGA Cloud Computing. Andrew M. Keller and Michael J. Wirthlin National Science Foundation SHREC Brigham Young University Provo, Utah. Configurable Computing Lab Electrical & Computer Engineering.

maitland
Télécharger la présentation

Impact of Soft Errors on Large-Scale FPGA Cloud Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Impact of Soft Errors on Large-Scale FPGA Cloud Computing Andrew M. Keller and Michael J. WirthlinNational Science Foundation SHRECBrigham Young UniversityProvo, Utah Configurable Computing LabElectrical & Computer Engineering Supported by Utah Space Grant Consortium and by the I/UCRC Program of the National Science Foundation under Grant No. 1738550

  2. Overview • Terrestrial Soft Errors • Soft Errors in FPGAs • FPGAs in the Cloud • Large-Scale System Failure Rates • SEU Detection and Recovery

  3. Primary Cosmic Ray Terrestrial Soft Errors • Radiation on Earth can cause soft errors • Primary sources of terrestrial radiation: • Cosmic Rays • High-energy neutrons (>10 MeV) • Thermal neutrons (<0.4eV) • Alpha particles (4 to 9 MeV) • NYC reference flux • 13 high-energy neutrons cm-2 h-1 • 6.6 to 10 thermal neutrons cm-2 h-1 neutron particle in IC Complex Cascadeof Secondary, Tertiary, … Particles secondary • Muons • Pions • Protons • Neutrons

  4. Single Event Upset (SEU) • Change in the value of a memory element caused by a single energetic particle strike Ionizing Particle N-type MOSFET Drain Gate Source 0 1 - + + - - + - + - Holes Current Electron Current Substrate + - Charge in Transistor Bit Flipped – SEU 1 0 Original Value

  5. Soft Error Rates (SER) Relative Neutron Flux Compared to NYC Neutron Configuration Memory (CRAM) SER for Two 28-nm FPGAs

  6. Soft Errors in FPGAs Breakdown of State in a Stratix V GX A7 FPGA • Most state is dedicated to configuration • Upsets directly affect underlying circuitry CLK D Q CLK D 0 Q 0 0 0 1 0 0 0 1 1 1 1 1 0 1 1 1 0

  7. FPGAs in Cloud Computing • Hundreds of thousands of FPGAs arebeing used in cloud computing • 50,000 Stratix V FPGA in Catapult • 200,000+ Stratix 10 FPGAs in Brainwave • AWS F1 instances available in 4 sites • Common to network FPGAs and/orcouple with a CPU Host • With so many FPGAs deployed, soft errors must be addressed Photo credit: Microsoft Photo credit: Intel

  8. MTTU of a Single Upset within an FPGA Cluster More devices means more upsets • Catapult example, MTTU 1025 days per node (~8x flux), 50 K nodes, 2 per hour

  9. System Failure Modes Host unresponsive FPGA unavailable Silent data corruption • Difficult to detect 2 1 Host CPU FPGA Shared DRAM Acceleration Task 3 PCIe Controller DDRController 2 3 1 3

  10. How Susceptible are Designs to Failure? Not all upsets in CRAM cause failure (unused resources, masked faults…) The probability that an upset in CRAM results in failure varies by application Probability of failure given an upset = Architectural vulnerability factor (AVF) Experimental goals: • Measure the AVF of several designs for each failure mode • Compare AVF against critical bits predictor • Apply failure rates to large scale system

  11. Failure Rate Estimation Through Fault Injection • Goal of fault injection: Examine the AVF of several designs • Test infrastructure: • Host – Dell Precision T7610, Xeon, 16GB RAM • FPGA – Intel Stratix V GX A7 (DE5-Net Board) • Injected faults into CRAM using Quartus FPGA JTAG to FPGA Ethernet to Host USB To JTAG FPGA PowerControl TestOperator Host Fault Injection Debugger Experiment Setup for Fault Injection Testing

  12. Yes Working State Inject Fault Application Execution Diagnostics Repair /Recover Stop? Start End Fault Injection Flow No • Takes about 1 minute to test each fault • Includes reprogramming FPGA, injecting fault, running test vectors, and recovery • Injected 130,000+ faults across 15 designs – statistically significant • Represents 100 days of continuous testing

  13. Device Resource Utilization Benchmark Designs • Intel’s OpenCL example FPGAdesigns • Using 20-80% of device Critical Bits • Bits used by the design thatmay cause failure if upset • Ratio is ALMs used percentageto critical bits percentage

  14. Results • 10% sensitivity equals a MTTF of approx. 180 years • Dominant failure is SDC • Critical bits percentage to AVF – relative sensitivity • Fewer than 1 in 5 critical bits cause failure if upset • Larger designs tend to bemore sensitive, but variesby application

  15. Failure Rates for 100,000 Node System in Denver, CO Large-Scale System Failure Rates • Consider a 100,000 node systemdeployed in Denver, CO (3.8x flux increase) • FIT – Failures in time • Failures per billion hours of operation

  16. Reliable Computing and Mission Time • MTTF reflects overall failure rate • Mission time (MT) reflects amountof time above certain reliability, • 63% of failures occur before MTTF,some occur much earlier Reliability of 0.99 63% of failures occur before the MTTF MTTF Time

  17. SEU Detection and Recovery • Vendors provide SEU detection and repair mechanisms – CRAM Scrubbing • Stratix V scrub takes47 ms to 24.2 sec • 100 ms scrub cycle& 100 MHz clock • 50 ms average • 5 million clock cycles • Scrubbing does noteliminate failures Scrub cycle completes SEU occurs SDC propagates Another scrub cycle completes SEU detected and corrected SDC flushes out or remains Scrub cycle completes, reports SEU Configuration MemoryScrubbing

  18. SDC System Restoration SEU Response Options • Able to restore system, but SDCsthat occurred go unaddressed • Possible responses: • Disable scrubbing, respond to detectable failures • Enable scrubbing, respond onlywhen a failure is detected • Respond every time an SEU isdetected • Respond only to upsets in mostcritical regions

  19. Conclusion • One FPGA in NYC – 20 year MTTU; 100,000 in Denver, CO – 25 min MTTU • 0.3% to 11.6% AVF on 15 designs depending on size and application • Less than 20% of critical bits cause failure when upset • Dominant failure mode: Silent data corruption (3.8 hour MTTF in example) • MTTF vs Mission Time: 3.8 hour MTTF = 2 minutes for MT(.99) • Configuration scrubbing helps restore system but does not eliminate SDC • User should consider failure rates, use available SEU features, and respond Ongoing work: • Neutron radiation testing of same experiments – December 2018 – NSREC • Identifying critical regions and protecting them with mitigation techniques • Reliability modeling of partial replication for mission critical applications

More Related