Presentation-2 Group-A1

Presentation-2Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao

Fault Tolerance in Distributed Systems

Outline • Overview of Fault Tolerance • Importance in Distributed systems • Types of Faults • Measurement of Faults • Failure Models • Redundancy and Forms of Redundancy • Software Fault Tolerance Techniques • Reliable Communication • Distributed Commit • Failure Recovery • On-going Research • References

Fault Tolerance • The ability of a system to respond gracefully to an unexpected Hardware or Software Failure • There are many levels of Fault tolerance, the lowest being the ability to continue operation in the event of power failure.

Importance in Distributed Systems • Computer systems are not very reliable ---OS crashes frequently(Windows),buggy software,unreliable Hardware,SW/HW incompatibilities ---Growing popularity of Internet/World Wide Web ---Example: what if your TV(or car) broke down every day? Users don’t want to restart TV or fix it by opening it up. • So we need to make our computer systems more Reliable and Dependable.

Types of Faults • Nature ---Systematic ---Random • Duration ---Transient ---Intermittent ---Permanent • Extent ---Global ---Local

Measurement of Faults • Fault Removal Coverage • Fault Detection Coverage • Fault Tolerance Coverage

Failure Models

Redundancy and its Forms • Redundancy does same computation for ‘n’ number of times. So if one fails the other will operate • Forms of Redundancy ---Hardware Redundancy ---Software Redundancy ---Information Redundancy ---Temporal(time) Redundancy

Software Fault tolerance Techniques • N-Version Programming --- Different implementations of same program in order to avoid identical design faults • Block Recovery --- Duplication of various critical software modules

Reliable Communication • One-one communication --- Use reliable transport protocols(TCP) of handle at the application layer --- Possibilities • Client unable to locate server • Lost request messages • Server crashes after receiving request • Lost reply messages • Client crashes after sending request

Cont’d • One-many Communication ---Reliable Multicast • Lost messages need to retransmit ---Possibilities • ACK-based Schemes-Sender can become bottleneck • NACK-based systems

Distributed Commit • Atomic multicast-all processes in a group perform an operation or not at all • Problem of Distributed commit---all or nothing operations in a group of processes • Possible approaches---2-phase commit and 3-phase commit

Cont’d 2-Phase & 3-Phase commit • Coordinator process coordinates the operation • Involves 2 phases ---Voting phase-processes vote on whether to commit ---Decision phase-actually commit or abort • Problem- If coordinator crashes then processes block • 3-Phase commit – Variant of 2-phase that avoids blocking

Recovery • Techniques thus far allow Failure handling • Recovery means operations to a correct state that must be performed after a failure to recover to a correct state • Techniques: • Check Pointing • Message Logging

Check Pointing • Periodically checkpoint state • Upon crash roll back to a previous checkpoint with a consistent state • Types: -- Independent Checking -- Coordinated Checking

Message Logging • Check pointing is expensive • All processes restart from previous consistent cut • Taking a snapshot is expensive • All computations from previous snapshot have to be redone. • Combine check pointing(expensive) with message logging(cheap) • Take infrequent checkpoints • Log all messages between checkpoints to local stable storage • To recover: Simply replay messages from previous checkpoint and avoid recomputations from previous checkpoint

On-going Research • Intelligent / Adaptive Fault Tolerance • Summary

References: • “Fault Tolerance in Distributed Systems” by Pankaj Jalote • “Adaptive Fault tolerance in Distributed Systems” by Roger Bharath, Melanie Dumas and Mevlut Erdem Kurul

Questions ?

Presentation-2 Group-A1

Presentation-2 Group-A1

Presentation Transcript

ERESE SUMMER WORKSHOP GROUP 2 PRESENTATION

Como Zoo Group Presentation - Draft #2

YOURSAY PRESENTATION Group 2

Group 2 Collaboration Presentation

PRESENTATION ON ANTHRAX DISEASE Group 2

GROUP 2 –Presentation-

Group 2 Presentation

eCAL Workshop GROUP 2 Presentation

Telecom Base Station Group 2 Presentation

Lesson A1-2

Group Presentation #2

GROUP 2 PRESENTATION

Workshop 3 Group 2 Presentation

GROUP-2 PRESENTATION

GROUP 2 PRESENTATION

Group 2's Presentation

A1 Practice EOC Presentation

Group 2 Presentation: Mineral and materials

Group 2 Presentation

PRESENTATION GROUP 2

Lesson A1-2