fault-tolerant

fault-tolerant Csc 8320 Xiuyun shen

Fault Tolerance • Computer systerms can fail due to a fault in some component,such as a processor,momory,I/O device,cable,or software.a fault is a malfunction,can be caused by:design error • programming error • physical damage • operator error etc. • OK, we still have faults • Either: • We did not find them (Design faults), or • They arise during operation (Degradation faults) • What can we do? • Only one thing—tolerate them: • This is the least desirable option • It is difficult to do • It requires additional resources during operation

Fault tolerance techniques for distributed systems • Fault tolerance is the ability of a system to perform its function correctly even in the presence of internal faults. The purpose of fault tolerance is to increase the dependability of a system. A complementary but separate approach to increasing dependability is fault prevention. This consists of techniques, such as inspection, whose intent is to eliminate the circumstances by which faults arise.

Fault Classifications

Figure 1: Different Classifications of Faults • Based on duration, faults can be classified as transient or permanent. A transient fault will eventually disappear without any apparent intervention, whereas a permanent one will remain unless it is removed by some external agency. While it may seem that permanent faults are more severe, from an engineering perspective, they are much easier to diagnose and handle. A particularly problematic type of transient fault is the intermittent fault that recurs, often unpredictably. • A different way to classify faults is by their underlying cause. Design faults are the result of design failures, like our coding example above. While it may appear that in a carefully designed system all such faults should be eliminated through fault prevention, this is usually not realistic in practice. For this reason, many fault-tolerant systems are built with the assumption that design faults are inevitable, and theta mechanisms need to be put in place to protect the system against them. Operational faults, on the other hand, are faults that occur during the lifetime of the system and are invariably due to physical causes, such as processor failures or disk crashes. • Finally, based on how a failed component behaves once it has failed, faults can be classified into the following categories: • Crash faults -- the component either completely stops operating or never returns to a valid state; • Omission faults -- the component completely fails to perform its service; • Timing faults -- the component does not complete its service on time; • Byzantine faults -- these are faults of an arbitrary nature.

fault tolerant design • Every fault tolerant design must deal with one or more of the following aspects ([Nelson 90], • [Anderson 81]): • · Detection:A basic element of a fault tolerant design is error detection. Error detection is • a critical prerequisite for other fault tolerant mechanisms. • · Containment:In order to be able to deal with the large number of possible effects of faults in a complex computer system it is necessary to define confinement boundaries for the propagation of errors. Containment regions are usually arranged hierarchicallythroughout the modular structure of the system. Each boundary protects the rest of the • system from errors occurred within it and enable the designer to count on a certain number of correctly operating components by means of which the system can continue toperform its function.

· Masking:For some applications, the timely flow of information is a critical design issue.In such cases, it is not possible to just stop the information processing to deal with detected errors. Masking is the dynamic correction of errors. In general, masking errors is difficult to perform inline with a complex component. Masking, however, is much simpler when redundant copies of the data in question are available. • · Diagnosis:After an error is detected, the system must assess its health in order to decide how to proceed. If the containment boundaries are highly secure, diagnosis is reduced to just identifying the enclosed components. If the established boundaries are not completely secure, then more involved diagnosis is required to identify which other areas are affected by propagated errors. • · Repair/reconfiguration:In general, systems do not actually try to repair component-level faults in order to continue operating. Because faults are either physical or design-related,repair techniques are based on finding ways to work around faults by either effectively removing from operation the affected components or by rearranging the activity within the system in order to prevent the activation of the faults. • · Recovery and Continued Service:After an error is detected, a system must be returned to proper service by ensuring an error-free state. This usually involves the restoration to a previous or predefined state, or rebuilding the state by means of known-good external information.

Hardware and software fault tolerance • Redundancy Fault tolerance depends entirely on redundancy Redundancy in some form facilitates all phases Redundancy does not necessarily mean duplication Redundancy applied at any level: • Small component redundancy harder but better survival • Large component redundancy easier but worse

Static And Dynamic Redundancy • Static: • System Uses Several ‘‘Parallel’’ Units • All Parallel Units Operate All The Time • Parallel Units Detect Errors By Observing Discrepancy • Loss Of Unit Implies: • Removal Or Containment • Service Provided By Those That Remain • Dynamic: • System Operates With Fewer ‘‘Parallel’’ Units • Some Form Of Redundant Units In Standby Mode • Loss Of Unit Implies: • Removal Or Containment • Introducing Standby Unit

N Modular Redundancy • Error detection: • Incomplete agreement in the voter • Damage assessment: • Assume that the minority is suspect • State restoration: • Remove faulty unit • Continued service: • Use remaining modules • Special case: • N = 3, triple modular redundancy (TMR) • After one failure, shut down one remaining processor

Dynamic Redundancy

Dynamic Redundancy In duplication with comparison, error detection is achieved by comparing the outputs of two modules performing the same function. If the outputs of the modules disagree, an error condition is raised followed by diagnosis and repair actions to return the system to operation. In a similar approach only one module would actually perform the intended function with the other component being a dissimilar monitor that checks the outputs looking for errors

Hybrid Redundancy • N-S modular redundancy with ‘‘S’’ spares • As members of the N-S fail, spares switched in • Able to tolerate up to N-2 failures • Spares may be unpowered: • Saves power • Unpowered units much more reliable than powered • Attention required to infant mortality • Clearly applicable to: • Long-duration systems • Systems with no repair opportunity

Standby Synchronization For redundancy to work, the standby unit needs to be kept synchronized with the active unit at all times. This is required so that the standby can fit into the active's boots in case the active fails. The standby synchronization can be achieved in the following ways: • Bus Cycle Level Synchronization • Memory Mirroring • Message Level Synchronization • Checkpoint Level Synchronization • Reconciliation on Takeover

Software Fault tolerance architectures • Four fault-tolerant architectures are introduced in this section, Recovery Block, N version programming, N self-checking programming, and consensus recovery blocks • N version architecture

Distributed Systems • Distributed Systems • We define a distributed software system (Figure ) as: a system with two or more independent processing sites that communicate with each other over a medium whose transmission delays may exceed the time between successive state changes. • Figure : A Distributed System • From a fault-tolerance perspective, distributed systems have a major advantage: They can easily be made redundant, which, as we have seen, is at the core of all fault-tolerance techniques. Unfortunately, distribution also means that the imperfect and fault-prone physical world cannot be ignored, so that as much as they help in supporting fault-tolerance, distributed systems may also be the source of many failures. In this section we briefly review these problems.

Failures in distributed system • Processing Site Failures • Communication Media Failures • Transmission Delays

A Fault-Tolerant Pattern for Distributed Systems • We now examine a specific pattern that has been successfully used to construct complex, fault-tolerant embedded systems in a distributed environment. This pattern is suitable for a class of distributed applications that is characterized by the star-like topology shown in Figure 6, which commonly occurs in practice.

In this system, the permanent failure of the central controller would lead to the loss of all global functionality. Hence, it is a single point of failure that needs to be made fault tolerant. However, we would like to avoid the overhead of a full hot standby -- or even a warm standby -- for this component. • The essential feature of this approach is to distribute the state information about the system as a whole between the controller and the agents. This is done in such a way that (a) the controller holds the global state information that comprises the state information for each agent, and (b) each agent keeps its own copy of its state information. Every time the local state of an agent changes, it informs the controller of the change. The controller caches this information. Thus, we have state redundancy. Note that there is no need for a centralized consistent checkpoint of the entire system, which, as we have mentioned, is a complex and high-overhead operation. • Obviously, the state redundancy does not protect us from permanent failures of the controller. Note however, that any local functions performed by the agents are unaffected by the failure of the agents. Global functions, on the other hand, may have to be put on hold until the controller recovers. Thus, it is critical to be able to recover the controller. • Controller recovery can be easily achieved by using a simple, low-overhead, cold standby scheme. The standby controller can recover the global state information of the failed controller simply by querying each agent in turn. Once it has the full set, the system can resume its operation; the only effect is the delay incurred during the recovery of the controller. Of course, it is also possible to use other standby schemes as well with this approach. • If an agent fails and recovers, it can restore its local state information from the controller. One interesting aspect to this topology is that it is not necessary for the controller to monitor its agents. A recovering agent needs merely to contact its controller to get its state information. This eliminates costly polling.

conclusion • Conclusion • Fault-tolerance is achieved by applying a set of analysis and design techniques to create • systems with dramatically improved dependability. As new technologies are developed • and new applications arise, new fault-tolerance approaches are also needed. In the early • days of fault-tolerant computing, it was possible to craft specific hardware and software • solutions from the ground up, but now chips contain complex, highly-integrated • functions, and hardware and software must be crafted to meet a variety of standards to be • economically viable. Thus a great deal of current research focuses on implementing fault- • tolerance using COTS (Commercial-Off-The-Shelf) technology. • Recent developments include the adaptation of existing fault-tolerance techniques to • RAID disks where information is striped across several disks to improve bandwidth and a • redundant disk is used to hold encoded information so that data can be reconstructed if a • disk fails. Another area is the use of application-based fault-tolerance techniques to • detect errors in high performance parallel processors. Fault-tolerance techniques are • expected to become increasingly important in deep sub-micron VLSI devices to combat • increasing noise problems and improve yield by tolerating defects that are likely to occur • on very large, complex chips.

Referance • http://en.wikipedia.org/wiki/Fault-tolerance • http://hissa.nist.gov/chissa/SEI_Framework/framework_1.html • http://www.cs.utexas.edu/~lorenzo/lft.html • http://www.pmg.csail.mit.edu/papers/osdi99.pdf • http://www.eventhelix.com/RealtimeMantra/HardwareFaultTolerance.htm • Fault Tolerance in Bluetooth Scatternet TopologiesLeigh E. Hodge, Cardiff UniversityRoger M. Whitaker, Cardiff University

fault-tolerant

fault-tolerant

Presentation Transcript

Fault-Tolerant Softcore Processors Part I: Fault-Tolerant Instruction Memory

Fault-Tolerant Broadcast

Fault-Tolerant Broadcast

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault-Tolerant Consensus

Fault Tolerant Backplane

Fault-Tolerant Facility Location

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-Tolerant Broadcast

Fault-tolerant Computing

Fault-Tolerant Computing Basics