5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors

5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors 5.1 Static, or Masking Redundancy N copies of each processor are used and the minimum degree of replication is the triplication. The replicated results are voted on. 5.2 Dynamic, or Standby Redundancy First, the presence of a faulty processoris detected. Then it is replaced with a spare by performing network reconfiguration and error recovery.

6. Fault Tolerance Through Static Redundancy • Three forms of Static Redundancy: • Redundancy for Availability • Redundancy for Safety • Redundancy for Non-Classical Faults

6. Fault Tolerance Through Static Redundancy • 6.1 Redundancy for Availability Used in the form of HW, SW, Time, or Information Redundancy : n copies of a module perform the computation simultaneously to be voted. The scheme is combined with the use of a disagreement detector (voter/comparator) and a switching unit that produces a hybrid redundant system. This approach can be applied at several levels in a distributed system : each processor can be replicated and the result of each processor’s computation voted on, or the entire multiprocessor can be replicated and the combined result voted on. A third option divides the P processors of the multiprocessor into P/N groups of N processors, each group voting on its results before communicating to other groups. To provide robust communication, all critical transactions between groups may be replicated and voted upon.

6. Fault Tolerance Through Static Redundancy • 6.2 Redundancy for Safety Reliability refers to the probability that the system produces correct output. Safety is defined as the probability that the system output is either correct, or that the error in the output is detectable [Johnson, 1989]. High safety is ensured by making negligible the probability of an undetected error in the output. When an uncorrectable error in the output is detected, a recovery or safe shutdown can be carried out. A fault-tolerance scheme must be, in practice, chosen which meets the reliability-safetyrequirement.

6. Fault Tolerance Through Static Redundancy • 6.2 Redundancy for Safety 5 R = ( 5i).pigood.(1 – pgood)5 -i i=k 5 S = 1 - ( 5i).p5 - igood.(1 – pgood)i i=k

Module 1 Module 2 Module n Safe Modular Redundant (SMR) System Arbiter Data Unsafe 6. Fault Tolerance Through Static Redundancy • 6.2 Redundancy for Safety Design strategies can achieve both high reliability and safety using the generic model below. The outputs of the arbiter constitutes the system outputs which consists of two components: (I)data output, (II)unsafe flag. An arbitration strategy is the function implemented by the arbiter to decide what the correct output is and when the errors in the module outputs exceed the correction capability, so that the correct output cannot be provided.

6. Fault Tolerance Through Static Redundancy • 6.3 Redundancy for Tolerating Non-Classical Faults Even malicious failures, where two or more faulty nodes may cooperate and attempt to foil the operation, must be tolerated. • Byzantine Fault Model (BFM) Protocol was proposed for precisely such an environment by Pease, Shostak, and Lamport (1982). • BFM considers that a faulty node may not onlyproduce incorrect values, but also send  values to  destinations instead of identical values, as expected. • Typically, timing-related complex failures, resulting in difficult agreements between good processors in the presence of faulty processors. • BFM does not require foreknowledge of component misbehavior and can tolerate faulty components with even the most malevolent behavior, thus avoiding the need for the costly task of providing the validity of assumptions regarding faulty component misbehavior.

6. Fault Tolerance Through Static Redundancy • 6.3 Redundancy for Tolerating Non-Classical Faults For a BFM Protocol to tolerate m faults, the following requirements must be met: • At least (3m + 1) nodes must participate; • At least (2m + 1) disjoint communication paths must exist between nodes; • At least (m + 1) rounds of communication must take place; • All nodes must be synchronized within a well-known skew of each other. D D D a r a a r a r r a r r a r a A C r A C A C a B r r B r B r a Byzantine agreement Byzantine disagreement

6. Fault Tolerance Through Static Redundancy • 6.3 Redundancy for Tolerating Non-Classical Faults Example: m = 7 … At least (3 x 7 + 1) = 22 nodes At least (2 x 7 + 1) = 15 disjoint communication paths At least (7 + 1) = 8 rounds of communication 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 16 17 18 15 16 17 18

5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors

5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

CHAPTER 5 Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance