Fault Tolerant Computing Basics
E N D
Presentation Transcript
Fault Tolerant ComputingBasics Dan Siewiorek Carnegie Mellon University June 2012
Preview • Many terms have multiple usage that can lead to confusion when used out of context • Sources of error • Faults go through at least ten stages from inception to repair - so designer better plan for all ten stages • Relationship between sequence of events in handling a fault and mathematical measures
Outline • Introduction • Definitions • Sources of Errors
WHY RELIABILITY? • Three of the driving factors: • Critical applications • computer outage or error can cause loss of money, time, life • No longer just in aerospace, but in more mundane applications – customer expectations • Increasing system complexity • more components, more likelihood of failure (counter: increased rel. of | VLSI) • Lower signal/noise ratios in ↑ VLSI speed more likelihood of transient errors • Diagnosis more difficult, downtime is longer, repair costs ↑ increased inventory costs too • Relative cost is less
AVAILABILITY EXAMPLE • 90 MINUTES DOWNTIME PER WEEK • AVAILABILITY 0.991 • RESERVATION SYSTEM -- $36,000/MINUTE DOWN • $3.24 MILLION PER WEEK • .1% AVAILABILITY = 10 MINUTES = $360,000.00
Univac I Checkers • Parity • Memory • Input to function table • Output from function table, odd number of selected gates. Dummy lines preserve parity • Unitypes • 1-of-n • Intermediate line function table • Memory bank select
Univac I Checkers (cont’d) • Duplication • Registers • Adder • Comparitor • Multiplier-quotient coupler • Bus amplifier • Bus interface • Automatic voltage monitoring system tests every DC voltage at rate of one per minute • “720 checker” counts 720 characters per I/O block
Definitions • RELIABILITY:SURVIVAL PROBABILITY • When repair is costly or function is critical • AVAILABILITY:THE FRACTION OF TIME A SYSTEM MEETS ITS SPECIFICATION • When service can be delayed or denied • REDUNDANCY:EXTRA HARDWARE, SOFTWARE, TIME
Stages in the development of a system STAGEERROR SOURCESERROR DETECTION Specification Algorithm Design Simulation & design Formal Specification Consistency checks, model checking Prototype Algorithm design Stimulus/response Wiring & assembly testing Timing Component Failure Manufacture Wiring & assembly System testing Component failure Diagnostics Installation Assembly System Testing Component failure Diagnostics Field Operation Component failure Diagnostics Operator errors Environmental factors
Cause-effect sequence • FAILURE: component does not provide service • FAULT:deviation of logic function from design value • Hard, Transient • ERROR: manifestation of a fault by incorrect value
Fault Classification • DURATION: • Transient- design errors, environment • Intermittent- repair by replacement • Permanent- repair by replacement • EXTENT: • Local (independent) • Distributed (related) • VALUE: • Determinate (stuck at X) • Indeterminate (variable)
Fault Confinement -- contain it before it can spread Fault Detection -- find out about it to prevent acting on bad data Fault Masking -- mask effects Retry -- since most problems are transient, just try again Diagnosis -- figure out what went wrong as prelude to correction Reconfiguration -- work around a defective component Recovery -- resume operation after reconfiguration in degraded mode Restart -- re-initialize (warm restart; cold restart) Repair -- repair defective component Reintegration -- after repair, go from degraded to full operation Basic Steps in Fault Handling
MTBF -- MTTD -- MTTR Availability = MTTF ______________ MTTF + MTTR
Error Containment Levels • For distributed systems there are additional levels • Containment to a single node or FTU • Containment to a single bus or subsystem • Containment to a single vehicle/piece of equipment in a national infrastructure
“Mainframe”Outage Sources (* the sum of these sources was 0.75)
Tandem Causes of System Failures (Up is good; down is bad)
Tandem Hardware Causes of Outage • Disks 49% • Communications 24% • Processors 18% • Timing 9% • Spares 1%
Tandem Operations Causes of Outage • Procedures 42% • Configurations 39% • Move 13% • Overflow 4% • Upgrade 1%
Tandem Maintenance Causes of Outage • Disk 67% • Communication 20% • Processor 13%
Tandem Environmental Outages • Extended Power Loss 80% • Earthquake 5% • Flood 4% • Fire 3% • Lightning 3% • Halon Activation 2% • Air Conditioning 2% • Total MTBF about 20 years • MTBAoG* about 100 years • Roadside highway equipment will be more exposed than this * (AoG= “Act Of God”)
CMU Andrew File Server Study • Configuration • 13 SUN II Workstations with 68010 processor • 4 Fujitsu Eagle Disk Drives • Observations • 21 Workstation Years • Frequency of events • Permanent Failures 29 • Intermittent Faults 610 • Transient Faults 446 • System Crashes 298 • Mean Time To • Permanent Failures 6552 hours • Intermittent Faults 58 hours • Transient Faults 354 hours • System Crash 689 hours
Some Interesting Ratios • Permanent Outages/Total Crashes = 0.1 • Intermittent Faults/Permanent Failures = 21 • Thus first symptom appears over 1200 hours prior to repair • (Crashes - Permanent)/Total Faults = 0.255 • 14/29 failures had three or fewer error log entries • 8/29 had no error log entries