Automatic Software Self-Healing using Rescue Points
320 likes | 452 Vues
This paper presents ASSURE, an innovative automatic software self-healing system developed by researchers at Columbia University. Motivated by the challenges posed by buggy and crash-prone software, ASSURE focuses on enhancing software integrity and availability by employing rescue points for recovery. By mapping potential faults to managed recovery states, the system leverages existing application code to facilitate quick fault recovery without downtime. This approach significantly reduces the operational costs associated with software failures while improving system resilience against various types of faults.
Automatic Software Self-Healing using Rescue Points
E N D
Presentation Transcript
Automatic Software Self-Healing using Rescue Points • AngelosKeromytis, Jason Nieh, Sal Stolfo • Department of Computer Science • Columbia University
Motivation • Software remains buggy and crash-prone • Problem for high availability systems, remote attacks, high-volume events, non-exploitable bugs • High cost of downtime • In the absence of perfect software, error toleration and recovery techniques become necessary complement to existing techniques
Dealing with Failures Programming Language Design (Avoid failures) Software Verification (Prove failure-free) Software Testing (Expose failures) Development bugs Deployment Failure Detection (Detect Failures)
I detected a failure, now what? • User/Administrator • Restart application • File bug report • Developer • Locate bug • Create patch • Test patch • Deploy patch
Integrity vs Availability • Terminate execution when fault is detected • Recurring faults (worms, etc.,) • Applications that build a lot of state • Collateral damage • But is this not the only sane thing to do? • Life after death?
Our Work: ASSURE • New automatic software self-healing system • Augments software integrity with availability • Works on commercial-off-the-shelf software
Software Elasticity • Assumption: Behind every complex system lies a well-tested core • Programmers build error handling • They just can’t cover every corner case • Feature creep, complexity, etc.,
Rescue Points • Recover using program’s code • Mapping between set of faults that could occur and those explicitly handled by the program code • Profile programs during “bad” test runs • Build behavioral model • Discover candidate recovery(rescue) points • Induce faults at locations that are known (or suspected) to propagate faults correctly • Work on binaries (COTS)
High-level Example a() a() b() b() • int c() { • if ((res = d() < 0) • return -1; /* Error */ • else • /* Do useful work */ • return 0; /* OK */ • } c() c() int d() { /* Slice-off functionality */ return -1; /* Error */ } d() d()
Why does this work? • Focus on server applications • Short error propagation distance • Errors in one request do not affect the computation of subsequent requests • Servers inherently support error handling (bad requests)
Self-Healing Process • Monitor • Rescue point discovery • Fault monitoring • Diagnose • Fault reproduction • Rescue point selection • Adapt • Rescue point creation • Test • Rescue point testing and deployment 12
ASSURE: Time Line Production System Fault Detected { Rescue-point Analysis (offline) Vulnerability Window Time Patched Production System Dynamic Patch
Rescue Point Discovery • Dynamic analysis via fuzzing • No access to source code • Dynamic binary instrumentation (Dyninst) • Examine behavior under “bad” input • Identify candidate rescue points • Log most frequent error return values • Happens off-line • Need only do it once
Fault Detection • Fault detection viewed as blackbox • Map detected faults to signals • Lightweight sensors on application • Simply give indication of failure • Watchdog process • ProPolice, StackGuard, etc.
Fault Reproduction • Network inputs • Good for deterministic failures • Cannot fully reproduce system state • Deterministic replay • Record all interactions between processes and their environment
Rescue Point Selection • Replay and detect failure • Extract stack trace • Find candidate rescue point that is closest to failure
Rescue Point Creation • Inject using dynamic binary instrumentation • Take checkpoint at rescue point • If fault detected, restore and steer execution • Cause application to rollback to checkpoint • Force error return using return value from rescue point discovery
Rescue Point Implementation • int rescue_point( int a, int b ) { • int rid = rescue_capture(id, fault); • if (rid < 0) /* checkpoint/restore error */ • handle_error(id); • elseif (rid == 0) /* error virtualization */ • return rescue_ret_val(fault); • else • /* rescue-point identifier */ • ... • }
Checkpoint/Rollback • Based on Zap • OS virtualization layer • Checkpoints kept in memory • Standard copy-on-write semantics • Consistent checkpoints of multi-process applications • File-system snapshot
Rescue Point Testing and Deployment • Test for survivability, correctness and performance • Repeat selection process if needed • Deployment via binary injection into running application on production server • Avoid: patch, compile, stop, restart
Evaluation • Implemented ASSURE for Linux • Tested several popular server applications • Metrics • Survivability • Correctness • Performance All tests on stripped binaries
Related Work • Number of proposals: • Failure-oblivious computing [OSDI 04] • Rx: Treating bugs as allergies [SOSP 05] • Automatic data-structure repair [OOPSLA 03] • Most try to mask the occurrence of faults • Problem with ensuring program semantics on recovery (unanticipated execution paths) • Our approach is to force an error!
Conclusions • Full system that enables automatic software self-healing • Introduced rescue-points • Programmer-tested recovery points • Experimental evaluation • Automatically fixed 8 real bugs • With minimal performance overhead