Filter Creation for Application Level Fault Tolerance and Detection

Eric Ciocca, Israel Koren, C.M. Krishna ECE Department, UMass Amherst Filter Creation for Application Level Fault Tolerance and Detection

Overview • Our approach to fault detection and tolerance relies on an application’s inherent familiarity with its own data • Fault detection and tolerance in the application level • Applications do not need hardware or middleware to provide fault recovery • To realize trends in application data, the developer must be familiar with what the data represents • The existing trend can be used for fault detection, but needs to be quantitatively defined so the application may detect it.

What is ALFTD? ALFTD complements existing system or algorithm level fault tolerance by leveraging information available only at the application level Using such application level semantic information significantly reduces the overall cost of providing fault tolerance ALFTD may be used alone or to supplement other fault detection schemes ALFTD is tunable: It permits users to trade off fault tolerance against computation overhead Allowing more overhead for ALFTD produces better results

Principles of ALFTD P1 P2 P3 P4 S4 S1 S2 S3 • Every physical node will run its own work (P,primary) as well as a scaled-down copy of a neighboring node’s work (S,secondary) • If a fault corrupts a process, the corresponding secondary of that task will still produce output, albeit at a lower (but acceptable) quality Node 1 Node 2 Node 3 Node 4

Fault Detection • Faults do not always completely disable a node • Malformed and corrupted data are also possible • Hardware-disabling faults are easy to detect with watchdog hardware and “I am alive” messages • Faulty data is difficult to detect without application syntax • Fault detection is a necessary condition for ALFTD to schedule which secondary nodes to run • Secondary processes can provide verification for ambiguously faulty data

Principles of ALFTD Filters Filter 1 Filter 2 Data is OK Results from Primary Pass Fail Secondary Task Queue • Faults are detected by passing results through one or more acceptance filters • Filters are unique to applications with certain data characteristics. • Value bound tests are applicable to most applications • Sanity checks require knowledge of the expected output value and format.

OTIS Characteristics • ALFTD was applied to OTIS (Orbital Thermal Imaging Spectrometer), part of the REE suite • OTIS reads radiation values from various bands and calculates temperature data • The output can be viewed graphically or numerically • OTIS lends itself to ALFTD because the output data (temperature) has • Local Correlation: Data changes gradually over an area • Absolute Bounds: Data falls within some expected realistic range

ALFTD in OTIS • Local Correlation and Absolute Bounds on the data led to the creation of two data fault filters • Spatial Locality Filter: If the difference between pixel (x,y) and (x-1,y) is greater than some threshold , the pixel may be the result of faulty data • Absolute Bounds Filter: Any pixel not falling in the value range of  < value <  may be the result of faulty data • The filter thresholds are set based on the sample datasets provided

OTIS Datasets Faultless “Blob” “Stripe” “Spots” Faulty

OTIS Datasets with ALFTD ALFTD Corrected Faulty “Blob” “Stripe” “Spots” Faulty

Problem • ALFTD filters require calibration • Calibration constants are context sensitive • Filter values can be approximated, but gains can be made in detection efficiency with well-tuned filters • Heuristics are created based on characteristics of the most frequent data

Frequency Plots (bounds filter) • Frequency of temperature values

Frequency Plots (spatial locality filter) • Frequency of differences between adjacent pixels

Approach • To test the detection characteristics of a scheme, an erroneous case and a control case of the same data are needed • Errors may produce different kinds and intensities of faults. It is important to decide what sort of errors we want to detect • In the case of OTIS, intensely faulty data (set-to-zero errors, memory gibberish) is easily detected, as it seldom falls inside the prescribed filters • Our experiments include moderately faulty data: offsets in input values of up to 30% • These faults tend to blend in with non-faulty data, making them especially hard to detect

Approach • Filters can be adjusted in steps of increasing complexity • A single filter has a high and low cutoff • The “left” and “right” bounds of data are usually exclusive, therefore their detections act cumulatively • In each filter, a tradeoff must be resolved between the desired fault detection rate and the number of incurred false alarms • Multiple filters are independently calibrated • Multiple filters will not necessarily detect different faults • Many filters working at a low expected detection rate may detect the same or more faults for a system than a single filter working with a high expected detection rate

Detection Plots (single side) • Fault detections and false alarms on a left-sided filter

Detection Plots (both sides) • By overlaying the left and right filter plots, general detection traits can be observed

Fault Detections, Numerically • This table is used to find the possible configurations that satisfy a minimum fault detection rate. Columns = left filter, Rows = right filter Bounds Filter Fault Detections

False Alarms, Numerically • Of the possible combinations chosen from the previous table, we can choose the one with the minimum number of false alarms Columns = left filter, Rows = right filter Bounds Filter False Alarms

Detection Plots (both sides, spatial locality filter) • By overlaying the left and right filter plots, general detection traits can be observed

Multiple Filters • By combining multiple filters, fault detection is increased. • To be effective, filters should have distinct fault detection domains. Bounds filter Spatial Locality filter

Relation Between Datasets • “Blob” is an average data set, however we need to analyze the behavior of other datasets • “Stripe”: Any filter settings achieve the same false alarm and fault detection rate, within a few percent • “Spots”: Not for the bounds filter • It has an average temperature 10K less than the others, pushing it closer to the “faulty” region of the bounds filter • We can relax the filter and accept the cut in efficiency, or predict when the “Spots” climate should be expected and use modified filters • This is the downfall of using absolute, instead of differential, data as criteria for the filters

Extensions to Other Applications • OTIS was a likely candidate for ALFTD, due to regularity of data • Natural phenomena tends to have regular and predictable behavior. Other applications dealing with temperature, imaging (NGST), or even geological surveys could have success with these two basic filters • These filter settings are only useful when considering environments similar to our sample datasets, but the method of calibrating filters is general enough to apply to other datasets and similar applications

Extensions to Novel Datasets • Once a working set of filters is devised, it should be applicable to any dataset which has the same characteristics • Precalculated filter calibrations could be created to allow for higher fault detection in very specific, localized datasets • General purpose filters can also be extracted by running through many datasets, but incur performance penalties

Dynamic Filter Calibration • Approximate settings are possible, but these may perform poorly when encountering new data cases • The application may need to reconfigure its filters for the new data • This process could be automated – assuming the calibrating computer can obtain at least one control (fault free) dataset • Without prior exposure to these novel datasets,automated dynamic reconfiguration should be implemented as a numerically based decision process

Conclusion • Filters are a critical part of ALFTD • The efficiency of the ALFTD method is contingent on a having a successful method of fault detection • Careful calibration of filters can greatly improve the fault detection capability of ALFTD • Options for novel datasets • General Purpose filter calibrations • Precalculated filter calibrations • Dynamic calibration

Thank you! • For further information, please contact: • Israel Koren (koren@ecs.umass.edu) • C.M. Krishna (krishna@ecs.umass.edu) • Eric Ciocca (eciocca@cyberlore.com)

Filter Creation for Application Level Fault Tolerance and Detection

Filter Creation for Application Level Fault Tolerance and Detection

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Application Level Fault Tolerance and Detection

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Application Level Fault Tolerance and Detection

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance