1 / 20

Presentation-2 Group-A1

Presentation-2 Group-A1. Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao. Fault Tolerance in Distributed Systems. Outline. Overview of Fault Tolerance Importance in Distributed systems Types of Faults Measurement of Faults Failure Models Redundancy and Forms of Redundancy

elia
Télécharger la présentation

Presentation-2 Group-A1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presentation-2Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao

  2. Fault Tolerance in Distributed Systems

  3. Outline • Overview of Fault Tolerance • Importance in Distributed systems • Types of Faults • Measurement of Faults • Failure Models • Redundancy and Forms of Redundancy • Software Fault Tolerance Techniques • Reliable Communication • Distributed Commit • Failure Recovery • On-going Research • References

  4. Fault Tolerance • The ability of a system to respond gracefully to an unexpected Hardware or Software Failure • There are many levels of Fault tolerance, the lowest being the ability to continue operation in the event of power failure.

  5. Importance in Distributed Systems • Computer systems are not very reliable ---OS crashes frequently(Windows),buggy software,unreliable Hardware,SW/HW incompatibilities ---Growing popularity of Internet/World Wide Web ---Example: what if your TV(or car) broke down every day? Users don’t want to restart TV or fix it by opening it up. • So we need to make our computer systems more Reliable and Dependable.

  6. Types of Faults • Nature ---Systematic ---Random • Duration ---Transient ---Intermittent ---Permanent • Extent ---Global ---Local

  7. Measurement of Faults • Fault Removal Coverage • Fault Detection Coverage • Fault Tolerance Coverage

  8. Failure Models

  9. Redundancy and its Forms • Redundancy does same computation for ‘n’ number of times. So if one fails the other will operate • Forms of Redundancy ---Hardware Redundancy ---Software Redundancy ---Information Redundancy ---Temporal(time) Redundancy

  10. Software Fault tolerance Techniques • N-Version Programming --- Different implementations of same program in order to avoid identical design faults • Block Recovery --- Duplication of various critical software modules

  11. Reliable Communication • One-one communication --- Use reliable transport protocols(TCP) of handle at the application layer --- Possibilities • Client unable to locate server • Lost request messages • Server crashes after receiving request • Lost reply messages • Client crashes after sending request

  12. Cont’d • One-many Communication ---Reliable Multicast • Lost messages need to retransmit ---Possibilities • ACK-based Schemes-Sender can become bottleneck • NACK-based systems

  13. Distributed Commit • Atomic multicast-all processes in a group perform an operation or not at all • Problem of Distributed commit---all or nothing operations in a group of processes • Possible approaches---2-phase commit and 3-phase commit

  14. Cont’d 2-Phase & 3-Phase commit • Coordinator process coordinates the operation • Involves 2 phases ---Voting phase-processes vote on whether to commit ---Decision phase-actually commit or abort • Problem- If coordinator crashes then processes block • 3-Phase commit – Variant of 2-phase that avoids blocking

  15. Recovery • Techniques thus far allow Failure handling • Recovery means operations to a correct state that must be performed after a failure to recover to a correct state • Techniques: • Check Pointing • Message Logging

  16. Check Pointing • Periodically checkpoint state • Upon crash roll back to a previous checkpoint with a consistent state • Types: -- Independent Checking -- Coordinated Checking

  17. Message Logging • Check pointing is expensive • All processes restart from previous consistent cut • Taking a snapshot is expensive • All computations from previous snapshot have to be redone. • Combine check pointing(expensive) with message logging(cheap) • Take infrequent checkpoints • Log all messages between checkpoints to local stable storage • To recover: Simply replay messages from previous checkpoint and avoid recomputations from previous checkpoint

  18. On-going Research • Intelligent / Adaptive Fault Tolerance • Summary

  19. References: • “Fault Tolerance in Distributed Systems” by Pankaj Jalote • “Adaptive Fault tolerance in Distributed Systems” by Roger Bharath, Melanie Dumas and Mevlut Erdem Kurul

  20. Questions ?

More Related