Impact of Soft Errors on Large-Scale FPGA Cloud Computing

Impact of Soft Errors on Large-Scale FPGA Cloud Computing Andrew M. Keller and Michael J. WirthlinNational Science Foundation SHRECBrigham Young UniversityProvo, Utah Configurable Computing LabElectrical & Computer Engineering Supported by Utah Space Grant Consortium and by the I/UCRC Program of the National Science Foundation under Grant No. 1738550

Overview • Terrestrial Soft Errors • Soft Errors in FPGAs • FPGAs in the Cloud • Large-Scale System Failure Rates • SEU Detection and Recovery

Primary Cosmic Ray Terrestrial Soft Errors • Radiation on Earth can cause soft errors • Primary sources of terrestrial radiation: • Cosmic Rays • High-energy neutrons (>10 MeV) • Thermal neutrons (<0.4eV) • Alpha particles (4 to 9 MeV) • NYC reference flux • 13 high-energy neutrons cm-2 h-1 • 6.6 to 10 thermal neutrons cm-2 h-1 neutron particle in IC Complex Cascadeof Secondary, Tertiary, … Particles secondary • Muons • Pions • Protons • Neutrons

Single Event Upset (SEU) • Change in the value of a memory element caused by a single energetic particle strike Ionizing Particle N-type MOSFET Drain Gate Source 0 1 - + + - - + - + - Holes Current Electron Current Substrate + - Charge in Transistor Bit Flipped – SEU 1 0 Original Value

Soft Error Rates (SER) Relative Neutron Flux Compared to NYC Neutron Configuration Memory (CRAM) SER for Two 28-nm FPGAs

Soft Errors in FPGAs Breakdown of State in a Stratix V GX A7 FPGA • Most state is dedicated to configuration • Upsets directly affect underlying circuitry CLK D Q CLK D 0 Q 0 0 0 1 0 0 0 1 1 1 1 1 0 1 1 1 0

FPGAs in Cloud Computing • Hundreds of thousands of FPGAs arebeing used in cloud computing • 50,000 Stratix V FPGA in Catapult • 200,000+ Stratix 10 FPGAs in Brainwave • AWS F1 instances available in 4 sites • Common to network FPGAs and/orcouple with a CPU Host • With so many FPGAs deployed, soft errors must be addressed Photo credit: Microsoft Photo credit: Intel

MTTU of a Single Upset within an FPGA Cluster More devices means more upsets • Catapult example, MTTU 1025 days per node (~8x flux), 50 K nodes, 2 per hour

System Failure Modes Host unresponsive FPGA unavailable Silent data corruption • Difficult to detect 2 1 Host CPU FPGA Shared DRAM Acceleration Task 3 PCIe Controller DDRController 2 3 1 3

How Susceptible are Designs to Failure? Not all upsets in CRAM cause failure (unused resources, masked faults…) The probability that an upset in CRAM results in failure varies by application Probability of failure given an upset = Architectural vulnerability factor (AVF) Experimental goals: • Measure the AVF of several designs for each failure mode • Compare AVF against critical bits predictor • Apply failure rates to large scale system

Failure Rate Estimation Through Fault Injection • Goal of fault injection: Examine the AVF of several designs • Test infrastructure: • Host – Dell Precision T7610, Xeon, 16GB RAM • FPGA – Intel Stratix V GX A7 (DE5-Net Board) • Injected faults into CRAM using Quartus FPGA JTAG to FPGA Ethernet to Host USB To JTAG FPGA PowerControl TestOperator Host Fault Injection Debugger Experiment Setup for Fault Injection Testing

Yes Working State Inject Fault Application Execution Diagnostics Repair /Recover Stop? Start End Fault Injection Flow No • Takes about 1 minute to test each fault • Includes reprogramming FPGA, injecting fault, running test vectors, and recovery • Injected 130,000+ faults across 15 designs – statistically significant • Represents 100 days of continuous testing

Device Resource Utilization Benchmark Designs • Intel’s OpenCL example FPGAdesigns • Using 20-80% of device Critical Bits • Bits used by the design thatmay cause failure if upset • Ratio is ALMs used percentageto critical bits percentage

Results • 10% sensitivity equals a MTTF of approx. 180 years • Dominant failure is SDC • Critical bits percentage to AVF – relative sensitivity • Fewer than 1 in 5 critical bits cause failure if upset • Larger designs tend to bemore sensitive, but variesby application

Failure Rates for 100,000 Node System in Denver, CO Large-Scale System Failure Rates • Consider a 100,000 node systemdeployed in Denver, CO (3.8x flux increase) • FIT – Failures in time • Failures per billion hours of operation

Reliable Computing and Mission Time • MTTF reflects overall failure rate • Mission time (MT) reflects amountof time above certain reliability, • 63% of failures occur before MTTF,some occur much earlier Reliability of 0.99 63% of failures occur before the MTTF MTTF Time

SEU Detection and Recovery • Vendors provide SEU detection and repair mechanisms – CRAM Scrubbing • Stratix V scrub takes47 ms to 24.2 sec • 100 ms scrub cycle& 100 MHz clock • 50 ms average • 5 million clock cycles • Scrubbing does noteliminate failures Scrub cycle completes SEU occurs SDC propagates Another scrub cycle completes SEU detected and corrected SDC flushes out or remains Scrub cycle completes, reports SEU Configuration MemoryScrubbing

SDC System Restoration SEU Response Options • Able to restore system, but SDCsthat occurred go unaddressed • Possible responses: • Disable scrubbing, respond to detectable failures • Enable scrubbing, respond onlywhen a failure is detected • Respond every time an SEU isdetected • Respond only to upsets in mostcritical regions

Conclusion • One FPGA in NYC – 20 year MTTU; 100,000 in Denver, CO – 25 min MTTU • 0.3% to 11.6% AVF on 15 designs depending on size and application • Less than 20% of critical bits cause failure when upset • Dominant failure mode: Silent data corruption (3.8 hour MTTF in example) • MTTF vs Mission Time: 3.8 hour MTTF = 2 minutes for MT(.99) • Configuration scrubbing helps restore system but does not eliminate SDC • User should consider failure rates, use available SEU features, and respond Ongoing work: • Neutron radiation testing of same experiments – December 2018 – NSREC • Identifying critical regions and protecting them with mitigation techniques • Reliability modeling of partial replication for mission critical applications

Impact of Soft Errors on Large-Scale FPGA Cloud Computing

Impact of Soft Errors on Large-Scale FPGA Cloud Computing

Presentation Transcript

Impact of Cloud Computing on Enterprise Architecture

Impact, Washback and Consequences of Large-scale Testing

Cloud Computing Economies of Scale

Dimension Network: Impact of Cloud Computing

Cloud Computing and its Impact on Software Licensing

Cloud Computing: The Impact on the Practice of Law

Virtualized FPGA accelerators in Cloud Computing Systems

Evaluating Impact of Soft-Errors in an Embedded System

Large Scale Distributed Computing Systems

Impact of different downscaling methods on large scale hydrologic simulations

Large Scale Computing Systems

On Large Scale Modeling

Large-Scale Computing with Grids

Impact Of Cloud Computing On Modern Day Business Operations

Impact of Soft Drinks on Teenagers

The Impact of Cloud Computing on Enterprise Mobility Development Services

Impact: Cloud Computing

Goals of the Large-Scale Cluster Computing Workshop

Impact, Washback and Consequences of Large-scale Testing

Impact of Secrecy on Capacity in Large-Scale Wireless Networks