Diverse Data for Fault Detection in Programs

Greg Bronevetsky ED4I: Error Detection by Diverse Data and Duplicated Instructions

ED4I Background • A code transformation system developed at the Stanford Center for Reliable Computing. • Authors: Nahmsuk Oh, Subhasish Mitra, Edward J. McCluskey • ED4I allows us to run a program on two slightly different inputs and still be able to compare results at the end.

Motivation • The simplest way to detect Byzantine Faults is to run the same program on multiple processors and compare results. • ED4I is Byzantine Fault detection for uniprocessors. • Must take into account both temporary and and permanent faults.

Definitions • Temporary Faults – any fault that temporarily affects a processor, long enough to execute several instructions. • Ex: Radiation hitting wires, frayed wires. • Permanent Faults – a fault that affects a processor for a long period of time. • Ex: Spilling Coke on the chip, cut wires.

Problem Statement • We can detect Byzantine Failures by running each program or procedure twice and comparing the results. • However, this does not guard against permanent faults since the results of both runs will be the same. • Need to make the two runs different so that the same fault will affect the results differently. • Overhead = 100%.

Key Idea • Lets feed into the program two different sets of data and then compare the results. • Key Insight: • If the program only uses arithmetic operations, we can alter the input by multiplying all input numbers by a constant. • Then the modified output will be the (real output) * (the constant). • Thus, you can verify that the two computations succeeded AND the two computations will be affected by errors differently.

New Program • If we alter the input to the program, we must alter the program to work with this modified input. • The transformation is given the constant k (called the “diversity factor”) and it creates the “k-factor diverse program”. • The new program will have the same control flow graph as the old program but all the variables will be k-multiples of the of original ones.

Transformations • If k<0, branches flip directions (> ↔ <, ≥ ↔ ≤) • All constants in code get multiplied by k. • Addition and Subtraction of variables unchanged. • Multiplication: v1*v2*....*vn → (v1*v2*....*vn)/kn-1 • Division: v1/v2 → (v1/v2)*k

Fault Detection Probability • For functional unit hi (such as the adder), fault f and diversity factor k: • Xi = is the set of inputs to hi • Ei = subset of X containing the inputs that will result in erroneous output due to the fault. • E'i = subset of Ei that will escape detection • Ci(k) = Probability of catching an error in hi.

Data Integrity Probability • For functional unit hi, fault f and diversity factor k: • Xi = is the set of inputs to hi • Ei = subset of X containing the inputs that will result in erroneous output due to the fault. • E'i = subset of Ei that will escape detection • Di(k) = Probability of missing no errors in hi.

Choosing the value of k • For some functional units we can derive Ci(k) and Di(k) analytically for each k. • This is too hard in general so we resort to trying out a range of k's empirically to determine Ci(k) and Di(k).

Bus Signal Line • Bus wire stuck at either 0 or 1. • Derived results for a 12-bit bus:

Adder • Experimental results for a 12-bit ripple carry adder: • Experimental results for a 12-bit carry look-ahead adder:

Multiplier & Divider • Experimental Results for • 12-bit array multiplier • 8-bit Wallace Tree multiplier • SRT divider

Shifter • Experimental Results for 16-bit multiplexer-based shifter:

Using Benchmarks to pick k • Need to determine how much each functional unit is used in the average program. • Add, sub, mult and shift use the obvious functional units. • “memory access” uses the memory bus • “branch” uses a carry-lookahead adder

Benchmarked Data Integrity • Calculated Data Integrity=Di(k) given above usage statistics. (high Di(k) top priority) • Highlighted columns provide the best data integrity for each benchmark.

Benchmarked Detection Probability • Calculated Detection Probability=Ci(k) given above usage statistics. • Highlighted columns provide the best detection probability for each benchmark.

Optimum k • Optimum k selected: • Must maximize the Data Integrity=Di(k). • Given maximum Di(k), maximize Ci(k). • For each program, should get an estimate of how it uses the different functional units and pick k accordingly.

Dealing with Overflow • By multiplying all variables by k, we may cause them to overflow. • Can scale variables up to next largest type. • Scale down variables by dividing by k. Must only check higher order bits when comparing new results to results of original program. • Can use compile-time range checking to determine vulnerability to overflow and pick k accordingly

Floating Point Numbers • Above technique fails for floating point numbers. • IEEE 754 format: • K=-2 will only change the sign bit and some bits in the exponent. • Solution: pick separate k's for the exponent and the mantissa and run the program once with each k. • Overhead = 200%.

Picking k for the mantissa • To find errors in mantissa, pick k to be 3/2. • A stuck-at-1 fault: • In original program, variable x's value corrupted to: • In transformed program,Since However, the mantissa must be <2, so if • the mantissa is right shifted by 1 and normalized.

Transformed variables • So now, the value in transformed program is: • Value in original program is:

Fault Detection in Mantissa • If there is a stuck-at-1 fault • Value in transformed program: • Value in original program * k (for checking):

We can detect Mantissa errors! • Note that the error values for the original and the transformed programs are different! • We actually use k= in order to flip the sign • bit for improved detection capability

k for exponents • In order to flip all the bits of the exponent, need to transform program to use k= and k= • If a fault invalidates a bit of the exponent, the fault will be detected by comparing to the exponents of one of the two transformed programs.

Effectiveness for Mantissa • Effectiveness of k= (for IEEE 754 single precision)

Effectiveness for Exponent • Effectiveness of k= (for IEEE 754 single precision)

Summary • ED4I effectively detects Byzantine Failures in numerical applications on uniprocessors. • Purely software solution using Data Diversity. • Detects permanent and temporary faults. • Works with fixed-point and floating point numbers. • Compatible with arithmetic and logical operations (probably with any bitwise logical operation if it can be recast into arithmetic) • High overhead: 100% or 200%.

Diverse Data for Fault Detection in Programs

Diverse Data for Fault Detection in Programs

Presentation Transcript

Data Link Layer: Overview; Error Detection

Error detection

Error Detection and Correction

Error Detection

Error Detection and Correction : Data Link Layer

Error Detection and Correction

Error Detection and Correction

Error detection and correction

Error Detection

Local Error-Detection and Error-correction

Error Detection and Correction

Error Detection by Fragile Watermarking

Error Detection and Correction

Error Detection and Correction in Data Collection

Local Error-Detection and Error-correction

Error Detection by Fragile Watermarking

Error Detection

Error Detection and Correction

Error Detection