Fault Tolerant Computing Based on Diversity

Fault Tolerant Computing Based on Diversity by Seda Demirağ 2005701688

INTRODUCTION • The software faults in a real-time system: • Concurency-control faults: • These fault involve inter-process communication and syncronization, data coherence and protection, adn deadlock. • Timing: • a task is not completed in the specified amount of time • Error-detection and error-recovery: • These faults occur when the detection and recovery mechanism could not handle an error or invoked when no error exists.

INTRODUCTION • Software fault tolerance is techniques: • are designed to allow a system to tolerate software faults that remain in the system after its development • provide mechanisms to the software system to prevent system failure from occurring • have been used mostly in the aerospace, nuclear power, healthcare, telecommunications and ground transportation industries whose faults can be catastrophic. • In this term paper, I will discuss the fault tolerance techniques based on design and data diversity.

SOFTWARE FAULT TOLERANT TECHNIQUES: DATA and DESIGN DIVERSITY • Multiple data representation enviroment: • Data diverse techniques are used in a multiple data representation environment • utilize different representations of input data to provide tolerance to software design faults • Multiple version software enviroment: • Designdiverse techniques are used in a multiple version software environment • use the functionally of independently developed softwareversions to provide tolerance to software design faults

Design Diversity Techniques • Two or more variants of software developed by different teams but to a common specification are used. • These variants are then used in a time or space redundant manner to achieve fault tolerance. • Disadvantages of design diversity is the high cost involved in developing multiple variants of software

Design Diversity Techniques • Popular techniques which are based on the design diversity concept for fault tolerance in software are: • Recovery Block • N-Version Programming • N-Self-Checking Programming

Design Diversity Techniques: Recovery Block (RcB) • It was introduced in 1974 by Horning, with early implementations developed by Randell in 1975 and Hecht in 1981 • Its selectionis made during program execution based on the result of the acceptance test (AT) • The basic RcB scheme consists of an executive, an acceptance test, and primary and alternate try blocks (variants) • Many implementations of RcB, especially for real-time applications, include a watchdog timer • The RcB is categorized as a dynamic technique

Design Diversity Techniques: Recovery Block (RcB) This figure illustrates the structure and operation ofthe basic RcB technique with a watchdog timer. The RcB figure states that the technique will first attempt to ensure the AT by using the primary alternate If the primary algorithm’s result does not pass the AT, then n-1 alternates will be attempted until an alternate’s results pass the AT. If no alternates are successful, an error occurs.

Design Diversity Techniques: N-Version Programming (NVP) • NVP was suggested by Elmendorf in 1972 and developed by Avizienis and Chen in 1977–1978 • Compared with RcB, NVP is s a static technique. That means a task: • is executed by severalprocesses or programs and a result is accepted only if it is adjudicated as an acceptable result, usually via a majority vote.

Design Diversity Techniques: N-Version Programming (NVP) This figure illustrates the structure and operation ofthe basic NVP technique The NVP technique uses a decision mechanism (DM) and forward recovery to accomplish fault tolerance. The technique uses at least two independently designed, functionally equivalent versions (variants) of a program developed from the same specification. The variants are run in parallel and a DM examines the results and selects the “best” result, if one exists

Design Diversity Techniques: N-Version Programming (NVP) General syntax: run Version 1, Version 2, ..., Version n if (Decision Mechanism (Result1, Result2,...,Result n)) return Result else failure exception The NVP syntax above states that the technique executes the n versions concurrently. The results of these executions are provided to the DM, which operates upon them to determine if a correct result can be adjudicated. If one can, then it is returned. If a correct result cannot be determined, then an error occurs.

Design Diversity Techniques: N Self-Checking Programming (NSCP) • NSCP is a design diverse technique developed by Laprie. • The hardware fault tolerance architecture related to NSCP is active dynamic redundancy. • It results from either theapplication of an AT to a variant’s results or from the application of a comparator to the results of two variants.

Design Diversity Techniques: N Self-Checking Programming (NSCP) This figure illustrates the structure and operation ofthe basic NSCP technique

Design Diversity Techniques: N Self-Checking Programming (NSCP) General syntax: run Variants 1 and 2 on Hardware Pair 1,Variants 3 and 4 on Hardware Pair 2 compare Results 1 and 2 compare Results 3 and 4 if not (match) if not (match) set NoMatch1 set NoMatch2 else set Result Pair 1 else set Result Pair 2 if NoMatch1 and not NoMatch2, Result = Result Pair 2 else if NoMatch2 and not NoMatch1, Result =Result Pair 1 else if NoMatch1 and NoMatch2, raise exception else if not NoMatch1 and not NoMatch2 then compare Result Pair 1 and 2 if not (match), raise exception if (match), Result = Result Pair 1 or 2 return Result The NSCP syntax above states that the technique executes the n variants concurrently, on n/2 hardware pairs. The results of the paired variants are compared. If any pair’s results do not match, a flag is set indicating pair failure. If a single pair failure has occurred, then the nonfailing pair’s results are returned as the NSCP result. If both pairs failed to match, thenan exception is raised. If pair results match then the results of the pairs are compared. If they match, then the result is set as one of the matching values and returned as the NSCP result. If the result of the pair matches does not match, then an exception is raised.

Data Diversity Techniques • Data diversity, a technique for fault tolerance in software, was introduced by Amman and Knight. • While the design diversity approaches to provide fault tolerance rely on multiple versions of the software written to the same specifications, the data diversity approach uses only one version of the software. • This approach relies on the observation that a software sometime fails for certain values in the input space and • this failure could be averted if there is a minor perturbation of input data which is acceptable to the software.

Data Diversity Techniques • This technique is cheaper to implement than the design diversity tecghnique. • Popular techniques which are based on the data diversity concept for fault tolerance in software are: • Retry Blocks • N-Copy Programming

Data Diversity Techniques: Retry Blocks (RtB) • A retry block is a modification of the recovery block structure that uses data diversity instead of design diversity. • Rather than themultiple alternate algorithms used in a recovery block, a retry block use only one algorithm. • A retry block's acceptance test has the same form and purpose as a recovery block's acceptance test.

Data Diversity Techniques: Retry Blocks (RtB) This figure illustrates the structure and operation ofthe basic RtB technique A retry block executes the single algorithm normally and evaluates the acceptance test. If the acceptance test passes, the retry block is complete. If the acceptance test fails, the algorithm executes again after the data have been reexpressed. The system repeats this process until it violates a deadline or produces a satisfactory output.

Data Diversity Techniques: Retry Blocks (RtB) General syntax: ensure Acceptance Test by Primary Algorithm(Original Input) else by Primary Algorithm(Re-expressedInput) else by Primary Algorithm(Re-expressedInput) ... ... [Deadline Expires] else by Backup Algorithm (Original Input) else failure exception The RtB syntax above states that the technique willfirst attempt to ensure the AT by using the primary algorithm. If the primary algorithm’s result does not pass the AT, then the input data will be reexpressed and the same algorithm attempted until a result passes the AT or the WDT deadline expires. Ifthe deadline expire, the backup algorithm is invoked with the original inputs. If this backup algorithm is not successful, an error occurs.

Data Diversity Techniques: N-Copy Programming (NCP) • An N-copy system is similar to an N-version system but uses data diversity instead of design diversity. • N copies of a programexecute in parallel, each on a set of data produced by reexpression. • The system selects the output to be used by an enhanced voting scheme.

Data Diversity Techniques: N-Copy Programming (NCP) This figure illustrates the structure and operation ofthe basic NCP technique The NCP technique uses a decision mechanism (DM) and forward recovery to accomplish fault tolerance. The technique uses one or more Data re-expression algorithms(DRAs) and at least two copies of a program. The system inputs are run through the DRA(s) to re-express the inputs. The copies execute in parallel using the re-expressed data as input. A DM examines the results of the copy executions and selects the “best” result, if one exists.

Data Diversity Techniques: N-Copy Programming (NCP) The basic NCP technique consists of an executive, 1 to n DRA, n copies of the program or function, and a DM. The executive orchestrates the NCP technique operation, which has the general syntax: run DRA 1, DRA 2, ..., DRA n run Copy 1(result of DRA 1), Copy 2(result of DRA 2), ..., Copy n(result of DRA n) if (Decision Mechanism (Result 1, Result 2, ...,Result n)) return Result else failure exception The NCP syntax above states that the technique first runs the DRA concurrently to re-express the input data, then executes the n copies concurrently. The results of the copy executions are provided to the DM, which operates upon the results to determine if a correct result can be adjudicated. If one can (i.e., the Decision Mechanism statement above evaluates to TRUE), then it is returned. If a correct result cannot be determined, then an error occurs.

Enviroment Diversity Techniques • Environment diversity is the newest approach to fault tolerance in software. • The environment diversity approach requires reexecuting the software in a different environment. • Transient faults typically occur in computer systems due to design faults in software which result in unacceptable and erroneous states in the OS environment. • When the software fails, it is restarted in a different, error-free OS environment state which is achieved by some clean up operations.

CONCLUSION • A lot of techniques have beendeveloped for achieving fault tolerance in software. • The application of all of these techniques is relativelynew to the area of fault tolerance. • Furthermore, eachtechnique will need to be tailored to particularapplications. • This should also be based on the cost ofthe fault tolerance effort required by the customer. • Thedifferences between each technique provide someflexibility of application.

REFERENCES • [1] “Data Diversity: An Approach to Software Fault Tolerance”,R. E. Ammann and J. C. Knight, IEEE Transactions on Computers, April 1988 (Vol. 37, No. 4) pp. 418-425. • [2] “Software Fault Tolerance”; Chris Inacio, Carnegie Mellon University 18-849b Depandable Embedded Systems, Spring 1998. • [3] “Design Diversity: an Update from Research on Reliability Modelling”; Peter Popov, Bev Littlewood, Lorenzo Strigini; Safety Critical Symposium 2001(Springer 2001) • [4] “Modelling software design diversity: a review”; Littlewood, B., Popov, P., and Strigini, L. (2001); ACM Computing Surveys, 33(2):177—208 • [5] “A Survey of Software Fault Tolerance Techniques”; Zaipeng Xie, Hongyu Sun, Kewal Saluja.

Thank You!! Any Questions?

Fault Tolerant Computing Based on Diversity

Fault Tolerant Computing Based on Diversity

Presentation Transcript

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

Fault Tolerant Distributed Computing system.

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault-Tolerant Computing Basics

fault-tolerant

ECE 753: FAULT-TOLERANT COMPUTING

Fault Tolerant Distributed Computing system.

Fault-Tolerant Computing Systems #1 Introduction

Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance

Fault-tolerant Computing

Fault-Tolerant Computing Basics