310 likes | 327 Vues
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing. Lecture 11 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org. Outline. Reminder: wiki page due 4/5 Dependability concepts (some review)
E N D
EEC 693/793Special Topics in Electrical EngineeringSecure and Dependable Computing Lecture 11 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org
Outline • Reminder: wiki page due 4/5 • Dependability concepts (some review) • Fault, error and failure (some review) • Fault/failure detection in distributed systems • Consensus in asynchronous distributed systems EEC693: Secure & Dependable Computing
Dependable System • Dependability: • Ability to deliver service that can justifiably be trusted • Ability to avoid service failures that are more frequent or more severe than is acceptable • When service failures are more frequent or more severe than acceptable, we say there is a dependability failure • For a system to be dependable, it must be • Available - e.g., ready for use when we need it • Reliable - e.g., able to provide continuity of service while we are using it • Safe - e.g., does not have a catastrophic consequence on the environment • Secure - e.g., able to preserve confidentiality EEC693: Secure & Dependable Computing
Approaches to Achieving Dependability • Fault Avoidance - how to prevent, by construction, the fault occurrence or introduction • Fault Removal - how to minimize, by verification, the presence of faults • Fault Tolerance - how to provide, by redundancy, a service complying with the specification in spite of faults • Fault Forecasting - how to estimate, by evaluation, the presence, the creation, and the consequence of faults EEC693: Secure & Dependable Computing
Graceful Degradation • If a specified fault scenario develops, the system must still provide a specified level of service. Ideally, the performance of the system degrades gracefully • The system must not suddenly collapse when a fault occur, or as the size of the faults increases • Rather it should continue to execute part of the work load correctly EEC693: Secure & Dependable Computing
Quantitative Dependability Measures • Reliability -a measure of continuous delivery of proper service - or, equivalently, of the time to failure • It is the probability of surviving (potentially despite failures) over an interval of time • For example, the reliability requirement might be stated as a 0.999999 availability for a 10-hour mission. In other words, the probability of failure during the mission may be at most 10-6 • Hard real-time systems such as flight control and process control demand high reliability, in which a failure could mean loss of life EEC693: Secure & Dependable Computing
Quantitative Dependability Measures • Availability- a measure of the delivery of correct service with respect to the alternation of correct service and out-of-service • It is the probability of being operational at a given instant of time • A 0.999999 availability means that the system is not operational at most one hour in a million hours • A system with high availability may in fact fail. However, failure frequency and recovery time should be small enough to achieve the desired availability • Soft real-time systems such as telephone switching and airline reservation require high availability EEC693: Secure & Dependable Computing
Fault, Error, and Failure • The adjudged or hypothesized cause of an error is called a fault • An error is a manifestation of a fault in a system, in which the logical state of an element differs from its intended value • A service failure occurs if the error propagates to the service interface and causes the service delivered by the system to deviate from correct service • The failure of a component causes a permanent or transient fault in the system that contains the component • Service failure of a system causes a permanent or transient external fault for the other system(s) that receive service from the given system EEC693: Secure & Dependable Computing
Fault • Faults can arise during all stages in a computer system's evolution - specification, design, development, manufacturing, assembly, and installation - and throughout its operational life • Most faults that occur before full system deployment are discovered through testing and eliminated • Faults that are not removed can reduce a system's dependability when it is in the field • A fault can be classified by its duration, nature of output, and correlation to other faults EEC693: Secure & Dependable Computing
Fault Types - Based on Duration • Permanent faults are caused by irreversible device/software failures within a component due to damage, fatigue, or improper manufacturing, or bad design and implementation • Permanent software faults are also called Bohrbugs • Easier to detect • Transient/intermittent faults are triggered by environmental disturbances or incorrect design • Transient software faults are also referred to as Heisenbugs • Study shows that Heisenbugs are the majority software faults • Harder to detect EEC693: Secure & Dependable Computing
Fault Types - Based on Nature of Output • Malicious fault: The fault that causes a unit to behave arbitrarily or malicious. Also referred to as Byzantine fault • A sensor sending conflicting outputs to different processors • Compromised software system that attempts to cause service failure • Non-malicious faults: the opposite of malicious faults • Faults that are not caused with malicious intention • Faults that exhibit themselves consistently to all observers, e.g., fail-stop • Malicious faults are much harder to detect than non-malicious faults EEC693: Secure & Dependable Computing
Fail-Stop System • A system is said to be fail-stopif it responds to up to a certain maximum number of faults by simply stopping, rather than producing incorrect output • A fail-stop system typically has many processors running the same tasks and comparing the outputs. If the outputs do not agree, the whole unit turns itself off • A system is said to befail-safeif one or more safe states can be identified, that can be accessed in case of a system failure, in order to avoid catastrophe EEC693: Secure & Dependable Computing
Fault Types - Based on Correlation • Components fault may be independent of one another or correlated • A fault is said to be independentif it does not directly or indirectly cause another fault • Faults are said to be correlated if they are related. Faults could be correlated due to physical or electrical coupling of components • Correlated faults are more difficult to detect than independent faults
Fail Fast to Reduce Heisenbugs • The bugs that software developers hate most: • The ones that show up only after hours of successful operation, under unusual circumstances • The stack trace usually does not provide useful information • This kind of bugs might be caused by many reasons, such as • Not checking the boundary of an array • Invalid defensive programming <= what fail fast addresses • Reference • http://www.martinfowler.com/ieeeSoftware/failFast.pdf EEC693: Secure & Dependable Computing
Fail Fast to Reduce Heisenbugs • Invalid defensive programming • Making your software robust by working around problems automatically • This results in the software “failing slowly” • That is, it facilitates error propagation - the program continues working right after an error but fails in strange ways later on • Example: public int maxConnections() { string property = getProperty(“maxConnections”); if (property == null) { return 10; } else { return property.toInt(); } } EEC693: Secure & Dependable Computing
Fail Fast to Reduce Heisenbugs • Fail fast programming • When a problem occurs, it fails immediately & visibly • It may sound like it would make your software more fragile, but it actually makes it more robust • Bugs are easier to find and fix, so fewer go into production • Example: public int maxConnections() { string property = getProperty(“maxConnections”); if (property == null) { throw new NullReferenceException(“maxConnections property not found in “ + this.configFilePath); } else { return property.toInt(); } } EEC693: Secure & Dependable Computing
Failure Detection in Distributed Systems • Consider the failure detection problem in an asynchronous distributed system, where • No upper bound on process time • No upper bound on clock drift rate • No upper bound in networking delay • In an asynchronous distributed system, you cannot tell a crashed process from a slow one, even if you can assume that messages are sequenced and retransmitted (arbitrary numbers of times), so they eventually get through • This leads to Fischer, Lynch and Paterson to proof that it is impossible to reach a consensus in a fully asynchronous distributed system EEC693: Secure & Dependable Computing
Modeling Real Systems • Asynchronousmodel is too weak since they have no clocks (real systems have clocks, “most” timing meets expectations… but heavy tails) • Synchronousmodel is too strong (real systems usually lack a way to implement synchronize rounds) • Partially Synchronous Model: impose bounds on some properties • Timed Asynchronous Model: bounds on clock drift rates and message delays EEC693: Secure & Dependable Computing
Consensus Problem • Assumptions • Asynchronous distributed systems • Complete network graph • Reliable FIFO broadcast communication • Deterministic processes, {0,1} initial values • Fail-stop failures are possible • Solution requirement for consensus • Agreement: All processes decide on the same value • Validity: If a process decides on a value, then there was a process that started with that value • Termination: All processes that do not fail eventually decide EEC693: Secure & Dependable Computing
Impossibility Results • FLP Impossibility of Consensus • A single faulty process can prevent consensus • Because a slow process is indistinguishable from a crashed one • Chandra/Toueg Showed that FLP Impossibility applies to many problems, not just consensus • In particular, they show that FLP applies to group membership, reliable multicast • So these practical problems are impossible in asynchronous systems • They also look at the weakest condition under which consensus can be solved • Ways to bypass the impossibility result • Use unreliable failure detector • Use a randomized consensus algorithm EEC693: Secure & Dependable Computing
Chandra/Toueg Idea • Separate problem into • The consensus algorithm itself • A “failure detector” - a form of oracle that announces suspected failure • Aiming to determine the weakest oracle for which consensus is always solvable? EEC693: Secure & Dependable Computing
Failure Detector Properties • Completeness: detection of every crash • Strong completeness: Eventually, every process that crashes is permanently suspected by every correct process • Weak completeness: Eventually, every process that crashes is permanently suspected by some correct process EEC693: Secure & Dependable Computing
Failure Detector Properties • Accuracy: does it make mistakes? • Strong accuracy: No process is suspected before it crashes • Weak accuracy: Some correct process is never suspected • Eventual {strong/ weak} accuracy: there is a time after which {strong/weak} accuracy is satisfied EEC693: Secure & Dependable Computing
A Sampling of Failure Detectors EEC693: Secure & Dependable Computing
Perfect Detector • Named Perfect, written P • Strong completeness and strong accuracy • Immediately detects all failures • Never makes mistakes EEC693: Secure & Dependable Computing
Example of a Failure Detector • The detector they call W: “eventually weak” • More commonly: W: “diamond-W” • Defined by two properties: • There is a time after which every process that crashes is suspected by some correct process{weak completeness} • There is a time after which some correct process is never suspected by any correct process{weak accuracy} • E.g. we can eventually agree upon a leader. If it crashes, we eventually, accurately detect the crash EEC693: Secure & Dependable Computing
W: Weakest Failure Detector • W is the weakest failure detector for which consensus is guaranteed to be achieved • Algorithm • Rotate a token around a ring of processes • Decision can occur once token makes it around once without a change in failure-suspicion status for any process • Subsequently, as token is passed, each recipient learns the decision outcome EEC693: Secure & Dependable Computing
Building Systems with W • Unfortunately, this failure detector is not implementable • This is the weakest failure detector that solves consensus • Using timeouts we can make mistakes at arbitrary times • A correct process might be suspected • But timeout is the most widely used failure detection mechanism EEC693: Secure & Dependable Computing
A Randomize Algorithm for Consensus • Assumption n - total number of processes f - total number of faulty processes n > 2f • Algorithm Iteration=0; x = initial value (0 or 1) Do Forever: Iteration = Iteration + 1 Step 1 Step 2 EEC693: Secure & Dependable Computing
A Randomize Algorithmfor Consensus Step 1: Broadcast Proposal(Iteration,x) wait for n-f messages of type Proposal(Iteration,*) if at least n/2+1 messages have the same value v then x = v (that value) else x = undefined Step 2: Broadcast Bid(Iteration,x) wait for n-f messages of type Bid(Iteration,*) Let v be the real value (0/1) occurring most often (cannot be undefined) and m be the number of occurrences of v if m >= f then Decide (x=v) else if m >= 1 then x = v else x = random (0 or 1) EEC693: Secure & Dependable Computing
A Randomize Protocol for Consensus • For all round, either x = {1, undefined} for all processes, or x = {0, undefined} for all processes • If all correct processes start with a value v, then within one round they will all decide v • If for some round r, some correct process decides v in step 2, then all other correct processes will decide v within the next round • Number of rounds needed: • Landslide probability: 1/2n • Pr[landslide within k rounds] 1-(1-1/2n)k EEC693: Secure & Dependable Computing