1 / 17

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING. Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery Introductory material. Overview. Introduction and basic concept Fault model and fault coverage

joanne
Télécharger la présentation

ECE 753: FAULT-TOLERANT COMPUTING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery Introductory material

  2. Overview • Introduction and basic concept • Fault model and fault coverage • Checkpointing and backward error recovery (rollback) • General principles • Uniprocessor systems • Summary • Cost, Overhead, Latency issues • Distributed Systems ECE 753 Fault Tolerant Computing

  3. Introduction • References • Text Chapter 6 • [Prad:96] Chapter 3 – sections on rollback and reconfiguration ECE 753 Fault Tolerant Computing

  4. Introduction (contd.) • Some what higher level than ECC and watchdog, uses re-execution as basic recovery strategy • It is a hardware assisted software method in practice • Basic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-execute ECE 753 Fault Tolerant Computing

  5. Introduction - Basic Concept (contd.) • Three phases of recovery • Error detection • Damage assessment • Recovery – error elimination and arrival at the point where error was detected • often entails re-starting fresh on a system presumably fault free • Backward error recovery • Current process is rolled back to some error-free point and re-executes • Trivial solution – start afresh from the beginning of the program ECE 753 Fault Tolerant Computing

  6. Fault model and fault coverage • Possible scenarios • Hardware is faulty, software is fault-free • Fault detection mechanism exists – in hardware or in software form • Hardware fault-free, software is faulty • Both hardware software faulty • Assumptions for backward error recovery • Reliable error detection mechanism exists • Error can be removed by re-execution • Process state can be restored to a previous error-free state ECE 753 Fault Tolerant Computing

  7. Fault model and fault coverage (contd.) • Based on the assumptions stated: • The method is normally applicable when: error detection mechanism exists, transient hardware faults, and no-software faults • Methods to address other fault scenario are • Re-configuration • Software fault-tolerance: e.g. recovery block and n-version programming ECE 753 Fault Tolerant Computing

  8. Checkpointing and Rollback • General principles • Time redundancy is permissible • Transient hardware errors • If software errors (design or otherwise) alternative modules exist or there are timing errors that may be solved during re-execution • Reliable error detection mechanism • It is feasible to determine checkpoints (system states that need to be saved) in an application • Method can apply to redundant as well as nonredundant systems ECE 753 Fault Tolerant Computing

  9. Checkpointing and Rollback (contd.) • General issues: checkpointing & rollback • Save system state at regular interval • How often to save - checkpoint interval • How much to save - can be as little as PC and status flags, just one instruction or as mush as log of all messages, the complete program and associated data values at a given time • How long between fault occurrence and its detection (error latency) is tolerable – often large error latency may make this method less than an ideal method ECE 753 Fault Tolerant Computing

  10. Checkpointing and Rollback (contd.) • General issues: checkpointing & rollback • Rollback recovery • Where do we go back to: damage assessment • Rollback: load the state vector (state of the processor, the data that may have been altered or corrupted) • Restart the computation ECE 753 Fault Tolerant Computing

  11. Checkpointing and Rollback (contd.) • What do we need • Error detection mechanism • Various self-checking mechanisms, e.g. error detection, timers, watchdog, acceptance tests. • Storage for state/data saving • Large enough storage – PC, stack, data segments (static and dynamic), information about user and system files that may be open • Access time – issue during storing and retrieval • Volatility and stability of the storage ECE 753 Fault Tolerant Computing

  12. Checkpointing and Rollback (contd.) • What do we need (contd.) • Events • Messages and transactions that should be logged and replayed • Procedures to handle errors and restart computation • What if errors continue to exist? – mechanism to handle this ECE 753 Fault Tolerant Computing

  13. Checkpointing: Uniprocessor systems • Uniprocess and uniprocessor systems equivalence • Simplest scheme • Instruction re-execution • Hardware (parity, self-checking, duplication) reports error • Instruction is re-executed using previous data and state • Issues • Register file update (commit) • Latency, especially in pipeline systems • Key is to determine the state to be saved ECE 753 Fault Tolerant Computing

  14. Checkpointing: Uniprocessor systems (contd.) • Process control systems • Program that monitors a process behaves in a predetermined manner – known control flow and typically periodic • Define checkpoints statically ECE 753 Fault Tolerant Computing

  15. Checkpointing: Uniprocessor systems (contd.) • Process control systems (contd.) • Typical objectives • Recovery possible in a given time • Minimize the total number of checkpoints • Methods of this nature studied in 60’s ECE 753 Fault Tolerant Computing

  16. Checkpointing: Uniprocessor systems (contd.) • General purpose systems • How much information to save • System state consisting of register file, PC, stack, etc. • Data? • All of it? Can be prohibitive (space and time) • So? • Only that data which is modified after the last checkpoint • How do we do this efficiently? • Caches provide a nice boundary to achieve this ECE 753 Fault Tolerant Computing

  17. Summary • Discussed checkpointing classical studies ECE 753 Fault Tolerant Computing

More Related