From Anonymity to Ubiquity: A Study of Our Increasing Reliance on Fault-Tolerant Computing

From Anonymity to Ubiquity: A Study of Our Increasing Reliance on Fault-Tolerant Computing Elwin Ong MIT SERL NASA Goddard OLD December 9, 2003 1

Abstract This presentation will introduce the role of fault tolerance in major computing systems. A literature review will be conducted, outlining some fundamental elements of the field. A comparison and discussion of the application of fault tolerance in the three safety-critical systems will follow. Aerospace systems to be discussed in addition to those already mentioned include the Space Shuttle, Hubble Space Telescope, Galileo, Landsat7, ST-5, New Horizons, and C-17. There will also be a short overview of the Time Triggered protocols TTP/C and FlexRay to be used in automotive drive-by-wire systems. 2

Background • How I came to be at Goddard and OLD • Educational Background • UCLA Aerospace Engineering • Boeing Satellite Systems • MIT Aero/Astro • Systems Engineering Research Lab • Nancy Leveson • Safety-Critical Systems • Fault Tolerant Systems 3

Purpose of Study • What I hope to gain for myself • In depth review of fault tolerance • Catch up on State-of-the-Art • Investigate applications of fault tolerance • Become more familiar with spacecraft design process 4

Purpose of Study • What I hope you will gain • A review of fault tolerance • An overview of fault tolerance in various safety-critical industries • Opportunities to learn and improve upon existing techniques 5

Purpose of Study • What I hope to gain from you • An active discussion of fault tolerance as it is currently practiced in your projects • What are good practices? What works? What doesn’t? • Suggestions for advancements in the field 6

Presentation Outline • Literature Review • Spacecraft Fault Tolerance • Aircraft Fault Tolerance • Automotive Fault Tolerance • Discussion & Conclusion 7

Literature Review Outline • What is Fault Tolerance? • Define scope of study • Fault Tolerance Techniques • Fault Intolerance • Fault Detection and Reconfiguration • Fault Masking and Reconfiguration • What about Software? 8

What is a Fault? • There are various definitions • Must first identify scope: • Computationally intensive systems • Real Time and Safety-Critical (and Distributed) • Spacecraft • Modern Aircraft Systems • Automotive x-by-Wire, drive train controllers • Nuclear and Chemical Processing, Maritime systems, IT Networks, etc. 9

Definition of a Fault Fault: An incorrect state of hardware or software resulting from failures of components, physical interference from the environment, operator error, or incorrect design. Error: The manifestation of a fault. Failure: A result of a delivered service deviating from the specified service caused by an error or fault. 10

Fault Classifications There are various classification methods: Based on Lala & Harper, IEEE 1994 11

Fault Classifications 12

Fault Distribution Models • Permanent Fault Distribution Models • Exponential Distribution • Weibull Distribution • Geometric Distribution • Must match sampled data to distribution models • MIL-HDBK-217 Model • Various Intermittent and Transient Fault Models 13

How to Defeat Faults • Fault Intolerance/Prevention Methods • Fault Tolerant Methods • Redundancy • Fault Detection and Reconfiguration • Fault Masking • Software Fault Tolerance 14

Fault Tolerance Taxonomy 15

Fault Intolerant Techniques • Increase Signal to Noise Ratio • Lower Power Dissipation • Burn in Testing • Factors that most affect failure rates • Environment • Quality • Complexity • See MIL-HDBK-217E, NASA Standards 16

Fault Tolerant Systems • Redundancy • Fault Detection & Reconfiguration • Duplication, Error Detecting Codes, Self-tests, Self-Checking Pairs, etc. • Fault Masking & Reconfiguration • Error Correcting Codes, TMR, NMR • Issues related to Fault Tolerant Systems 17

Redundancy • All Fault Tolerant Systems employ redundancy • Forms of Redundancy • Temporal (Retry, Restart) • Physical (Duplication) • Functional (Analytical Modeling) • “The only thing (redundancy) guarantees is a higher fault arrival rate compared to a non-redundant system…” [Lala & Harper, IEEE 1994] 18

Fault Detection & Reconfig. • Based on simplex systems with active or passive backups. • Requires accurate fault detection • Employs all 3 types of redundancy • Common on unmanned spacecraft 19

Duplication • Simplest technique • Compare two identical copies • Fault identified when copies diverge • Does not identify which copy has failed • Use in conjunction with other techniques 20

Error Detecting Codes • Employ physical redundancy • Use extra bits in transmission • Hamming Distance: • The number of bit positions on which two code words differ. • Minimum distance, d, of a code is defined as the minimum Hamming distance found between any 2 code words. • Number of errors detectable = t < d 21

Hamming Distance 22

Parity Checks • Use 1 extra bit at the end of a word • Simplest and least expensive • Detects all single bit errors and all errors that involve an odd number of bits • Odd parity or even parity check • All 0’s failure • All 1’s failure • Ex. MIL-STD-1553 23

Checksums • Form block of s words by adding together all of the words in the block modulo-n, n is arbitrary. • Takes a long time to detect faults, not well suited to online processing. • Low diagnostic resolution, fault can be in the block of s words, the stored checksum, or the checking circuitry. • Ex. Hard Drives 24

Checksum Example 25

Cyclic Codes • Cyclic Redundancy Check (CRC) • Easy to Implement with XOR gates • Detects all single errors, all burst errors of length b  (n-k) • Ex. CDs, TTP/C, FlexRay Protocols 26

Control Flow Monitoring • Used to detect Sequential Errors 27

Self-Tests • Built-in-Tests • Exercise part or all of circuit and logic and compare to oracle • Extensive use in aerospace systems • Consistency & Sanity Checks • Capability Checks • Watchdog Timers • Implemented in Hardware or Software 28

Self-Checking Pairs • Combination of Duplication and Self Tests 29

Self-Checking Variations 30

Model-Based Diagnosis • Employs Analytic Redundancy • Compare actual components with an analytic model (mathematical model) • Depends on the validity of the model, and the ability to accurately model a system • Relatively straight forward for linear systems, difficult for nonlinear systems (most software-based systems) 31

Analytical Redundancy 32

Model-Based Diagnosis • Residual Generation & Decision-Making 33

Parameter Estimation • Based on assumption that faults are reflected in the physical system parameters such as friction, mass, viscosity, resistance, capacitance, etc. • Compare online estimations and measurements with parameters of model to identify faults. 34

Livingstone Engine • Developed at NASA AMES • Livingstone accepts a model of the components of a complex system such as a spacecraft or chemical plant and infers from them the overall behavior of the system. 35

Fault Masking Techniques • Mask faults by “out-voting” failed components • Error Correcting Codes • Triple Modular Redundancy (TMR) • NMR • Extensive applications in aircraft and manned spacecraft 36

Error Correcting Codes • Hamming SEC/DED Codes • Extensive usage in memories • High performance vs. cost ratio • Reed-Solomon • There are other more advanced ECCs employed including convolution codes (communication, coding theory) 37

Hamming SEC Code 38

TMR & NMR • Very simple concept, includes many different variations 39

TMR & NMR Variations 40

Redundancy Issues • Large Overhead? • More difficult to validate • Asynchronous vs. Synchronous • Near Coincidence Errors • Generic Faults 41

Asynchronous Issues • Voted value is mean, median, or some other heuristic-based value. • Must set thresholds so that failures are caught, but also limit false alarms • Can be very difficult to guarantee robustness • Requires extensive analyses and testing • Ex. F-16B FBW 42

Synchronous Issues • Inputs must be the same for each channel • Each channel must be synchronized • Fault detection is simple, unless… • Interactive Consistency • Near Coincidence • Generic Faults • Most systems are what are termed “loosely synchronous” 43

Byzantine Generals • Affects inputs to synchronous system as well as cross-channel voting • Stop and restart errors • Babbling Idiot Problem • Failed component sends different outputs to voting elements, confuses good components. • Intentional or intelligent malicious attacks • See Lamport et al. ACM 1982 44

Interactive Consistency 45

Byzantine Resiliency • Fault Containment Region (FCR) • A FCR is a collection of components that operate correctly regardless of any arbitrary logical fault outside the region. • Each FCR requires at least an independent power supply and clock signal. • May also need to be physically separated 46

Byzantine Resiliency • To tolerate f Byzantine faults requires: • 3f+1 FCRs • FCRs must be interconnected through 2f+1 disjoint paths • Inputs must be exchanged f+1 times between FCRs • FCRs must be synchronized to bounded skew • Simple TMR majority voter circuit is not Byzantine Resilient 47

Near Coincidence • Possibility that a second fault will occur before the system can recover from the first fault. • Must be accounted for in the design of redundancy management, eg. 777 FBW 48

Generic Faults • Externally Induced • Physical damage • Lightning strike • Power transients • Internally Induced • Hardware & Firmware defects, COTS O/S • Latent failures • Clock anomalies • Bad Design? 49

What about Software? • Software faults are much more difficult to characterize • Software is • an abstract mathematical object or • a concept of “how to make a group of hardware (system) work together in order to perform a specified function” • includes Hardware design as well • Software fault = Design fault 50

From Anonymity to Ubiquity: A Study of Our Increasing Reliance on Fault-Tolerant Computing