Design of High Availability Systems and Networks Lecture 3 Error Detection Techniques

Design of High Availability Systems and NetworksLecture 3 Error Detection Techniques Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu http://www.crhc.uiuc.edu/DEPEND

Outline • Watchdog timer • Heartbeats • Data audits • Runtime generated assertions • Control flow checking • Application – DHCP server • Consistency and capability checking (exception handling) • Exception handling in Linux

Watchdog Timer (hardware/software) • An inexpensive method of error detection. • Process being watched must reset the timer before the timer expires, otherwise the watched process is assumed as faulty. • Watchdog timers only detect errors which manifest themselves as a control-flow error such that the system does not continue to reset the timer. • Only processes with relatively deterministic runtimes can be checked, since the error detection is based entirely on the time between timer resets. • A watchdog timer provides only an indication of possible process failure; a partially failed process may still be able to reset the timer. • Coverage is limited, as neither the data nor the results are checked. • When used to reset the system, a watchdog timer can improve availability (the mean time to recovery is shortened) but not reliability (failures are just as likely to occur). • when the availability of a system is more important than the loss of data under some conditions, the use of a watchdog timer to reset the system on the detection of an error is an appropriate choice.

Example Applications of Watchdog Timers • Pluribus Reliable Multiprocessor. • Hardware and software timers (range from 5s to 2 minutes in duration)monitor almost every subsystem. • Example: • Consider the failure of the mutual exclusion locks that are on each subsystem. A lock failure can cause the lock of a resource when no subsystem is using it. • Since the lock failed, the resource will never become free • A 1/15-second timer interrupts the processor, which arbitrarily unlocks the resource. • Aside form the temporary (1/15-second) degradation in system performance, the system is unaffected by the error.

Example Applications of Watchdog Timers • VAX-11/780. • A multiprocessor system for commercial applications • The console processor runs a watchdog process that is reset when an interrupt line is strobed • If the interrupt line is not strobed by a processor within 200 microseconds, this indicates a failure and the console processor attempts to determine the reason for the failure. • Bell System Telephone Switch • External watchdog timers monitor correct program operation by triggering recovery when timers are not periodically reset. • This allows early (before the error propagates) detection of problems caused by software errors and consequently easier recovery

Heartbeats • A common approach to detecting process and node failures in a distributed (networked) computing environment. • Periodically, a monitoring entity sends a message (a heartbeat) to a monitored node or process and waits for a reply. • If the monitored node does not respond within a predefined timeout interval, it is declared as failed and appropriate recovery action is initiated. • Issues • The timeout period is pre-negotiated by the two parties or sometimes even hard-coded by the programmer. • The predefined timeout value cannot adapt to changes in network traffic or to load variability on individual nodes. • The monitored node is assumed to be healthy if is able to respond to a heartbeat message. • Process/thread responding to the heartbeat message may operate correctly, while other processes/threads may be in a deadlock situation or operating incorrectly

Heartbeat Monitor Heartbeat Replier HB message RTT HB ack Heartbeat Period Timeout Expiration Adaptive & Smart Heartbeat • Adaptive heartbeat - the timeout value used by the monitor process is not fixed but is periodically negotiated between the two parties to adapt to changes in the network traffic or node load. • Smart heartbeat - the entity being monitored excites a set of predefined checks to verify the robustness of the entire process and only then responds to the monitoring process.

Data Audits • Widely used in the telecommunications industry • A broad range of custom and ad hoc application-level techniques for detecting and recovering from errors in a switching environment (in particular in a database). • Data-specific techniques deeply embedded in the application can provide significant improvement in availability • Static and Dynamic Data Check • A corruption in static data region detected by computing a golden checksum of all static data at startup and comparing it with a periodically computed checksum (e.g., 32-bit Cyclic Redundancy Code) • For dynamic data, the range of allowable values for database fields are stored in the database system catalog. This information is used to perform a range check on the dynamic fields in the database.

Data Audits • Structural Check • The structure of the database in the controller system is established by header fields that precede the data portion in every record of each table. • Calculates the offset of each record header from the beginning of the database based on record sizes stored in system tables (all record sizes are fixed and known). • The database structure (in particular, the alignment of each record and table within the database) is checked by comparing all header fields at computed offsets with expected values. • Semantic Referential Integrity Check • Traces logical relationships among records in different tables to verify the consistency of the logical loops formed by the record. • Detects resource leaks • Corruption of key attributes in a database leads to lost records, i.e., records participating in semantic relationships disappear without being properly updated.

Data Audits – Semantic Referential Integrity Check Example • Consider the data structure established when servicing a voice connection. • A thread must be spawned to manage the connection; • A new record needs to be written into each of the following three tables: Process Table (Process ID, Name, Connection ID, Status, …), Connection Table (Connection ID, Channel ID, Caller ID, …), Resource Table (Channel ID, Process ID, Status …) • The audit can follow dependency loop Connection ID, Channel ID, Process ID for each active record in each of the three tables and detect violations of semantic constraints.

Prioritized Audit Triggering. • Based on the assumption that database objects have different importance and that their importance varies during system operation. • Criteria to determine the importance of database objects: • The access frequency of database tables - more frequently updated are more liable to be corrupted due to software misbehavior and are more likely to cause error propagation to processes that use the data. • The nature of the database object - takes into account criticality of database objects, e.g., the database system catalog which is referenced on every database access • The number of errors detected in each table - based on the assumption of temporal locality of data errors; i.e., the area where more errors occurred in the recent past is likely to contain more errors in the near future.

Runtime Generated Assertions • Goals • Generate runtime assertions by monitoring the values of selected variables in a program. • Use the monitored data to abstract out, via statistical pattern recognition techniques, the key relationships between the variables, separately and jointly, and to establish their probabilistic behavior. • Approach • Identify clusters of values traversed by different variables. • Use this information to automatically generate runtime assertions capable of capturing abnormal behavior of an application due to hardware or software errors. • Cross-check with other entities in the system their views on the state of selected variables • if a variable is globally accessible, then multiple entities (e.g., multiple execution threads) may have their own opinions about the correct value of the variable • can improve coverage and reduce false alarms

Main Processor Address Data Watchdog Processor Memory Control-flow Monitoring Using SignaturesHardware Approaches • Employ a Watchdog (a simple co-processor) to monitor behavior of a Main Processor • Suitable for a single embedded applications with little or no caching • Limited applicability in off-the-shelf systems, as require additional specialized resources, e.g., watchdog, pre-compiler. • Embedded Signature Monitoring • Pre-computed signature embedded in the application program • Autonomous Signature Monitoring • Watchdog Processor stores pre-computed signature in the memory and mimics the control flow of the application

Control Flow Error Detection Example Hardware Approaches

Control-flow Monitoring Using Signatures Software Approaches • Software techniques partition the application into blocks, either in the assembly language or in the high-level language • Appropriate instrumentation inserted at the beginning and/or the end of the blocks • The checking code is inserted in the instruction stream eliminating the need for a hardware watchdog processor • Two classes of approaches • non-preemptive signature checking • preemptivesignature checking

Control Flow Error Detection Example Software Approaches

Block i Block i Block i AB Assertion block (AB) BFI BFI Branch Free Interval (BFI) at assembly level ERROR Arbitrary Block ERROR AB Control Flow Instruction (CFI) CFI CFI Error-free executionpath CRASH Not taken (but valid) execution path Block i+2 Block i+1 Block i+2 Block i+1 Block i+1 Block i+2 AB AB AB AB BFI BFI BFI BFI BFI BFI AB AB CFI CFI CFI CFI CFI CFI Problems with Control Flow Signatures Incorrect execution without preemptive checking Incorrect execution with preemptive checking Correct execution • Preemptive check detects erroneous control flow • Computation stops • Erroneous change in control flow • AB not reached X

Preemptive Control Signatures (PECOS) • PECOS determines the runtime target address and compares that against the valid addresses before the jump to the target address is made • High-level control structure of Assertion Block • Determine the runtime target address [= Xout]. • Extract the list of valid target addresses [= {X1,X2}]. • Calculate ID := Xout * 1/P, where, P = ![(Xout-X1) * (Xout-X2)] • Calculation of ID to raise a DIV-BY-ZERO exception in case of error • Assertion Block does not introduce any new control flow instructions

What Can We Cover with Preemptive Software Control Signature? Address Bus Data Bus Memory CPU Errors in the cache Not covered Errors on the Bus Covered Errors in the Memory Covered Future: Insert programmable error detection core into the CPU

Application: PECOS Protection of DHCP Server • DHCP Application • A critical service for mobile network environments • Manages a common pool of IP addresses which are allocated to clients • Clients depend on a DHCP server’s high availability

DHCP Evaluation Fault Injection Results • Example: Improvement in system downtime because of reduction of system detection • Baseline recovery time: 0.54 * y (y: time for process recovery) • PECOS recovery time: 0.07 * y + 0.47 * y/100 = 0.075 * y • Improvement = 7.2

DHCPDISCOVER Client Server DHCPOFFER DHCPREQUEST Client Server DHCPACK DHCP Performance Overhead • DHCP Client-Server Protocol Phases: • Phase 1: • Phase 2: Phase 2 Phase 1 Server overhead Client overhead Server overhead Client overhead 13.8% 5.2% 18.0% 10.9% Selective Instrumentation 29.9% 25.0% 25.1% 15.3% Complete Instrumentation

Capability Checking can be implemented as a hardware mechanism or can be part of the operating system (usually the case) access to objects (memory segments, I/O devices) is limited to users (processors or processes) with the proper authorization Examples: virtual address management (MMU usually has a capability check) permission vs. activity; if these are not valid, there is an error trap password checking Consistency Checks range check - confirms that a computed value is in a valid range, e.g., a computed probability must be in the range 0 to 1 address checking - verifies that the address to accessed exists opcode checking - checks whether the instruction to be executed has one of defined (documented) opcodes arithmetic overflow and underflow Consistency, Capability Checking  Exception Handling

Example Exception Handling in Linux

Interrupts & Exceptions • Interrupts • Maskable interrupts: All IRQs issued by I/O devices give rise to maskable interrupts. • Nonmaskable interrupts: Sent to the NMI(NonMaskable Interrupts) pin. • Processor-detected Exceptions • Faults • Unintentional. The instruction can be resumed when the exception handler terminates. Examples: page faults (recoverable), protection faults (unrecoverable). • Traps • Intentional. It is triggered only when there is no need to re-execute the instruction terminated. The main use of it is for debugging purposes. Examples: breakpoint traps. • Aborts • Unintentional and unrecoverable. A serious error occurred; the control unit is in trouble. • Programmed Exceptions(Software Interrupts) • Occur at the request of the programmer. Triggered by into(check for overflow),int3, and bound(check on address bound) instructions. System call is one of the common uses. Init() process loads the initial address of the IDT(Interrupt Descriptor Table) into the idtr register of CPU and initializes all the entries of IDT table.

Exception Exception Handler Exception Type Signal 0 "Divide error" divide_error() fault SIGFPE 1 "Debug" debug() SIGTRAP 2 NMI nmi() 3 "Breakpoint" int3() trap SIGTRAP 4 "Overflow" overflow() trap SIGSEGV 5 "Bounds check" bounds() fault SIGSEGV 6 "Invalid opcode" invalid_op() fault SIGILL 7 "Device not available" device_not_available() fault SIGSEGV 8 "Double fault" double_fault() abort SIGSEGV 9 "Coprocessor segment overrun" coprocessor_segment_overrun() abort SIGFPE 10 "Invalid TSS" invalid_tss() fault SIGSEGV 11 "Segment not present" segment_not_present() fault SIGBUS 12 "Stack exception" stack_segment() fault SIGBUS 13 "General protection" general_protection() fault SIGSEGV 14 "Page fault" page_fault() fault SIGSEGV 15 Intel reserved 16 "Floating point error" coprocessor_error() fault SIGFPE 17 "Alignment check" alignment_check() fault SIGSEGV Sample Exceptions and their Handlers (Daniel P. Bovet, Marco Cesati, Understanding the Linux Kernel, O’Reilly, 2001) • 0 to 31 correspond to exceptions and nonmaskable interrupts. • 32(0x20) to 47(0x2f) as maskable interrupts, caused by IRQs, say, keyboard. • 48 to 255 may be used to identify software interrupts. • - 128(0x80) is used to implement system calls. Interrupt/Exception vector

Hardware Handling of Exceptions Page fault in MMU Finish last instruction CR3(Control Register 3) is used to store base address of Directory Page Address(DPA) which points to corresponding Page Entries. Control Unit checks whether an interrupt/exception has occurred. In TLB In memory fault No fault No page table Use page table in cache to access Determines vector i [0,255] Execute next instruction Search memory to find page table Reads ith entry of IDT referred by idtr register No page in memory No page in memory Get base address of GDT(global descriptor table) from gdtr.Combine GDT with ith of IDT to obtain segment address of exception handler, and TSSD(task state segment descriptor). Page in memory Page fault Save the contents of eflags,cs, eip,ss,esp on stack. Copy-on-write fault, NULL pointer page fault, Bad paging fault, … Software exception handler Illegal opcode, General protection fault, … iret, return from exception Restore saved eflags,cs,eip,ss,esp from stack

Software Handling of Exceptions Exception handlers …die() Fault address Exception handlers User Mode Kernel User Mode Kernel bug Kernel Mode Force to send SIGSEGV Signal Print CPU regs, kernel stack on console Bad system call parameters from User Mode Send SIGSEGV Signal to current process Print CPU regs, kernel stack on console Save data on system message buffer Send error code Save data on system message buffer Terminate process Terminate user process, invoke do_exit() Terminate process Invoke do_exit() Kernel OOPS Invoke do_exit() Invoke do_exit() Invoke do_exit() Remove references(memory,open files,semaphores,etc.) to the terminating process from kernel data structure. Set the process state as TASK_ZOMBIE Update parenthood relationships Crash: Infinite loop; Eat all system resources. Invoke schedule()

Linux Kernel Behavior Under Errors

Objective • Determine Linux kernel sensitivity to errors • Single points of failures (dependability bottlenecks) • Causes of kernel crashes • Error propagation patterns • Placement of detection and recovery mechanisms • Establish error injection based benchmarking procedure for analyzing and comparing different platforms, • Linux running on different processors (Intel, PowerPC) • Facilitate analysis of costs–reliability–performance tradeoffs in selecting a computing platform

Error Model and Injection Targets • Error Model • Injecting bit error into the kernel instruction stream • Four Subsystems Targeted • architecture dependent (arch) • file subsystem (fs) • kernel subsystem (kernel) • Memory management (mm) • Workload • UnixBench benchmark suite • Profiling • 32 most frequently used kernel functions responsible for 95% of kernel usage (Based on Evolution in Open Source Software, 2000, Michael W. Godfrey @ Uwaterloo)

Example Error Locations Steps involved in performing a read() operation Example error locations Workload (UnixBench) (Adopted from A. Rubini, http://www.linux.it/kerneldocs/ksys/ksys.html)

Automated process of error injection Experimental Setup NFTAPE Target System (Linux Kernel 2.4.19) Hardware Monitor Control Host Workload Driver-based Injector Crash Handler

Error Injection Campaigns and Outcome Categories • Error injection campaigns’ targets • Any random (non-branch) instruction(Campaign ①) • All conditional branch instructions within selected functions • Random bit (Campaign ②) • The bit that reverses the condition of branch instruction(Campaign ③)

Failure Distribution • For the random branch error, nearly half (47.5%) of the activated errors have no effect • More interestingly, 33% of errors that alter the control path also have no effect – inherent redundancy in the code • mm is the most sensitive subsystem, followed by kernel and fs, while arch is the least sensitive subsystem • Few functions cause vast majority of observed crashes • do_page_fault (page fault handler from arch) • schedule (process scheduler from kernel ) Activated: 13,370(46.1%) 2,779(63.8%) 1,228(56.1%) Function: 51 81 176

Crash Severity • Significant percentage (33%) of errors that alter the control path(Campaign ③)have no effect • The 90% most severe crashes are due to reversing the condition of a branch instruction • The most severe crashes require a complete reformatting of the file system on the disk (9 cases) • Can take nearly an hour to recover the system • Profound impact on availability • To achieve 5NINES of availability (5 minutes/yr downtime) one can afford such a failure in 12 years

A Most Severe Crash Example

Crash Causes • 95% of known kernel crashes are due to four major causes • unable to handle kernel NULL pointer • unable to handle kernel paging request • invalid opcode • general protection fault • exceeding segment limit, writing to a read-only code or data segment, loading a selector with a system descriptor, reading an execution-only code segment. ① ② ③

Example of A Crash Crash Cause: Invalid opcode Corresponding C code include/asm/page.h 99 #define BUG() \ 100 __asm__ __volatile__( "ud2\n" \ 101"\t.word %c0\n" \ 102 "\t.long %c1\n" \ 103 : : "i" (__LINE__), "i" (__FILE__)) mm/page_alloc.c … 96 if (PageActive(page)) 97 BUG(); Intel IA-32

Crash Latency • 40% crashes are within 10 CPU cycles • 20% of crashes have longer latency (>100,000 cycles) ① ② ③ CPU: Intel P4, 1.5GHz, 1 cycle = 0.67ns

Kernel crashes here Crash cause Error injected here Error Propagation Between Kernel Subsystems • 10% of the crashes are associated with fault propagation • Analysis of error propagation can guide strategic placing of • assertions (or error detection) within the code

Fail Silence Violation (Cont.) Code is taken from arch/i386/kernel/time.c/gettimeofday() Assembly code • c010e93a: 81 fa 3f 42 0f 00 cmp $0xf423f,%edx • c010e940: 76 1d jbe c010e95f • Original branch is taken. • Flip bit: 77 1d ja c010e95f • while (usec >= 1000000) { • usec -= 1000000; • sec++; • } • Fail Silent Violation: timer is much faster because of: sec++ C code

Summary • Experimental study of Linux kernel behavior in the presence of transient errors • Extensive error injection experiments targeting the most frequently used functions in the selected kernel subsystems. • The analysis of the obtained data shows: • 95% of known crashes are due to few common exceptions, e.g., unable to handle kernel NULL pointer • 10% of the crashes are associated with fault propagation • errors in the kernel can result in crashes that require reformatting the file system (nearly an hour recovery time) • Establishing benchmarking procedure for analyzing and comparing different platforms • Linux running on different processors (Intel, PowerPC)

Design of High Availability Systems and Networks Lecture 3 Error Detection Techniques