EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing

EEC 693/793Special Topics in Electrical EngineeringSecure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org

Outline • Midterm#2 result • Group communication systems • Membership protocols • Agreed and safe delivery • Checkpointing and recovery • Reference: • Reliable distributed systems, by K. P. Birman, Springer; Chapter 14-16 EEC693: Secure & Dependable Computing

Midterm#2 Result • High 98, low 79, mean 92.7 • Average Q1-18.9, Q2-17.6, Q3-18.3, Q4-19.1, Q5-18.9 EEC693: Secure & Dependable Computing

Unreliable Failure Detection • Recall that failures are hard to distinguish from network delay • So we accept risk of mistake • If p is running a protocol to exclude q because “q has failed”, all processes that hear from p will cut channels to q • Avoids “messages from the dead” • q must rejoin to participate in GMS again EEC693: Secure & Dependable Computing

Basic GMP • Someone reports that “q has failed” • Leader (process p) runs a 2-phase commit protocol • Announces a “proposed new GMS view” • Excludes q, or might add some members who are joining, or could do both at once • Waits until a majority of members of current view have voted “ok” • Then commits the change EEC693: Secure & Dependable Computing

GMP Example • Proposes new view: {p,r} [-q] • Needs majority consent: p itself, plus one more (“current” view had 3 members) • Can add members at the same time Proposed V1 = {p,r} Commit V1 p q r OK V0 = {p,q,r} V1 = {p,r} EEC693: Secure & Dependable Computing

Special Concerns? • What if someone doesn’t respond? • P can tolerate failures of a minority of members of the current view • New first-round “overlaps” its commit: • “Commit that q has left. Propose add s and drop r” • P must wait if it can’t contact a majority • Avoids risk of partitioning EEC693: Secure & Dependable Computing

What If Leader Fails? • Here we do a 3-phase protocol • New leader identifies itself based on age ranking (oldest surviving process) • It runs an inquiry phase • “The adored leader has died. Did he say anything to you before passing away?” • Note that this causes participants to cut connections to the adored previous leader • Then run normal 2-phase protocol but “terminate” any interrupted view changes leader had initiated EEC693: Secure & Dependable Computing

GMP Example p • New leader first sends an inquiry • Then proposes new view: {r,s} [-p] • Needs majority consent: q itself, plus one more (“current” view had 3 members) • Again, can add members at the same time Inquire [-p] Proposed V1 = {r,s} Commit V1 q r OK: nothing was pending OK V0 = {p,q,r} V1 = {r,s} EEC693: Secure & Dependable Computing

Safe and Agreed Delivery • For totally ordered reliable multicast, there are two delivery policies • Safe delivery: a message is delivered only when all correct processes have received it • Agreed delivery: a message is delivered as long as it is the next message in total order EEC693: Secure & Dependable Computing

Safe and Agreed Delivery • Safe delivery guarantees the uniformity of multicast: • If a message is delivered to any process, it is delivered by all correct processes • Agreed delivery does not: • It is possible that a message is delivered in one (or more) process, but is not delivered by some correct process EEC693: Secure & Dependable Computing

Checkpointing • Checkpointing: the act of taking a snapshot of an entity so that we can restore it later • A replica is a process running in an operating system. The state of a process • Processes' memory, stack and registers • Threads • Open or mmap'ed files • Current working directory • Interprocess communication: • Semaphores, shared memory, pipes, sockets • Dynamic Load Libraries • … EEC693: Secure & Dependable Computing

Checkpointing • Many tools are available to perform checkpointing transparently or semi-transparently • http://www.checkpointing.org/ • Condor, libckpt, etc. • Checkpoints taken in general are not portable • Checkpoint size might be big EEC693: Secure & Dependable Computing

Checkpointing of Application State • Sometimes it is more efficient to save and store the application state only • Checkpoints can be very portable and compact in size • class Counter { int counter; Counter(int initVal) { counter = initVal; } void increment() {counter++; } void decrement() {counter--; } void setState(int c) {counter = c; } int getState() { return counter;}|} EEC693: Secure & Dependable Computing

Logging • Logging of messages • Checkpointing in general is expensive • Logging of messages is cheaper => we can periodically do checkpointing, or do checkpointing on demand and log all messages in between • Logging of other non-deterministic activities • Access order to shared data EEC693: Secure & Dependable Computing

Recovery • Roll-backward recovery • Used primarily by transaction processing • When a failure occurs, roll back using the most recent checkpoint (and retry) • Roll-forward recovery • Used primarily in space redundancy • To recover a repaired replica, transfer the state from a current replica to the recovering replica EEC693: Secure & Dependable Computing

Roll-Forward Recovery • With replication in space, it is possible to recover a fault while the system is progressing ahead • Roll-forward recovery is made possible by • Checkpointing of replica state • Logging of incoming messages • Reliable, totally ordered group communication system EEC693: Secure & Dependable Computing

Roll-Forward Recovery • We want to ensure the newly admitted replica to have a consistent state with others when it starts • Steps of adding a new replica into a group (with on-demand checkpointing) • A recovered (or a new) replica joins a group • A join message is multicast in total order • On receiving the join message, it is put into incoming message queue and wait for processing • When the join message is at the head of the queue, a checkpoint is taken and it is transferred to the new replica EEC693: Secure & Dependable Computing

Roll-Forward Recovery • At the new replica, it starts queueing messages after it receives the join messages (sent by itself) • When the checkpoint is received by the new replica, its state is restored using the received checkpoint (the checkpoint is delivered out of order!) • The queued messages are delivered in order, at the new replica • Other replicas do not stop and wait for the new replica • Steps of adding a new replica into a group with periodic checkpointing is similar EEC693: Secure & Dependable Computing

Steps of Roll-Forward Recovery EEC693: Secure & Dependable Computing

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing

Presentation Transcript

Special topics in electrical and systems engineering: Systems Biology

EEE 498/598 Overview of Electrical Engineering

Cluster Computing in the Classroom: Topics, Guidelines, and Experiences

Wind Turbine Simulation Phase IV

NEW STUDENT ORIENTATION Welcome to the College of Engineering!

802307 Electrical Engineering for Civil Engineers

From Dependable Architectures To Dependable Systems

Electrical Engineering

802307 Electrical Engineering for Mechinacal Engineers

A Large-Scale Study of Failures in High-Performance Computing Systems

A secure broadcasting cryptosystem and its application to grid computing

Fundamental of Electrical Engineering EMT 113/4

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing

Twin Clouds: An Architecture for Secure Cloud Computing

Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering

Secure Processing On-Chip

Grid Computing (Special Topics in Computer Engineering)

Quantum Computing: Opportunities and Challenges

Is Your Web Server Suffering from Undue Stress due to Duplicate Requests?

Lynn Choi School of Electrical Engineering