Triage: Diagnosing Production Run Failures at the User’s Site

Triage: Diagnosing Production Run Failures at the User’s Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer Science University Illinois, Urbana Champaign

Despite all of our effort, production runs still fail • What do we do about these failures? CS-UIUC

What is (currently) done about end-user failures? • Dumps leave much manual effort to diagnose • We still need to reproduce the bug • This is hard, if not impossible, to do CS-UIUC

Why on-site diagnosis of production run failures? • Production run bugs are valuable • Not caught in testing • Potentially environment specific • Causing real damage to end users • We can’t diagnose production failures off-site • Reproduction is hard • The programmer doesn’t have the end-user environment • Privacy concerns limit even the reports we do get • We must diagnose at the end-user’s site CS-UIUC

What do we mean by diagnosis? • Diagnosis traces back to the underlying fault • Core dumps tell you about the failure • Bug detection tells you about some errors • Existing diagnosis tools are offline trigger failure fault error service interruption root cause buggy line of code incorrect state e.g. smashed stack CS-UIUC

What do we need to perform diagnosis? (1) • We need information about the failure • What is the fault, the error, the propagation tree? • Off-site: • Repeatedly inspect the bug (e.g. with a debugger) • We run analysis tools targeted at the failure, or at suspected failures • Off-site techniques don’t work on-site • Reproducing the bug is non-trivial • We don’t know what specific failures will occur • Existing analysis tools are too expensive CS-UIUC

What do we need to perform diagnosis? (2) • We need guidance as to what to do next • What analysis should we perform, what is likely to work well, and what variables are interesting? • Off-site: • The programmer decides, based on past knowledge • On-site, there is no programmer. • Any decisions as to action must be made automatically. CS-UIUC

What do we need to perform diagnosis? (3) • We need to try “what-if’s” with the execution • If we change this input, what happens? Skip this function? • Off-site: • Programmers run many input variations • Even with differing code. • This is difficult on-site • Most replay focuses on minimizing variance • We can’t understand what the results mean CS-UIUC

What does Triage contribute? • Enables on-site diagnosis • Uses systems techniques to make offline analysis tools feasible on-site • Addresses the three previous challenges • Allows a new technique, delta analysis • Human study • Real programmers and real bugs • Show large time savings in time-to-fix CS-UIUC

Overview • Introduction • Addressing the three challenges • Diagnosis process & design • Experimental results • Human study • Overhead • Related work • Conclusions CS-UIUC

Getting information about the failure • Checkpoint/re-execution can capture the bug • The environment, input, memory state, etc. • Everything we need to reproduce the bug • Benefits: • We can relive the failure over and over • Dynamically plug in analysis tools “on-demand” • Makes the expensive cheap • Normal-run overhead is low too CS-UIUC

Guidance about what to do next • A human-like diagnosis protocol can guide the diagnosis process • Repeated replay lets us diagnose incrementally • Based on past results, we can pick the next step • E.g. if the bug doesn’t always repeat, we should look for races CS-UIUC

Trying “what-ifs” with the execution • Flexible re-execution lets us play with what-ifs • Three types of re-execution • Plain – deterministic • Loose – allow some variance • Wild – introduce (potentially large) variations • Extracts how they differ with delta analysis CS-UIUC

Main idea of Triage • How to get information about the failure? • Capture the bug with checkpoint/re-execution • Relive the bug with various diagnostic techniques • How to decide what to do? • Use a human-like protocol to select analysis • Incrementally increase our understanding of the bug • How to try out “what-if” scenarios? • Flexible re-execution allows varied executions • Delta analysis points out what makes them different CS-UIUC

Overview • Introduction • Addressing the three challenges • Diagnosis process & design • Experimental results • Human study • Overhead • Related work • Conclusions CS-UIUC

Triage Architecture Checkpointing Subsystem Control Unit (Protocol) Analysis Tools (e.g. backward slicing, bug detection) CS-UIUC

Triage vs. Rx • Both are in memory • Both support variations in execution • Triage has no output commit • Triage has no need for safety • Can even skip code • Triage considers why the failure occurs • Tries to analyze the failure CS-UIUC

Bounds checking (1.1x) Assertion checking (1x) Happens-before (12x) Atomicity detection (60x) Static core analysis (1x) Taint analysis (2x) Dynamic Slicing (1000x) Symbolic exec. (1000x) Lockset analysis (20x) Rearrange allocation Drop inputs Mutate inputs Pad buffers Change file state Drop code Reschedule threads Change libraries Reorder messages Failure analysis & delta generation (stage 1 and 2) The differences caused by variations are useful as well CS-UIUC

Delta analysis Compute the basic block vector: A A A B B B {A:1 B:1 C:1 D:1 X:0 E:1 F:1 G:1 Y:0} C C C D D X X {A:1 B:1 C:1 D:0 X:1 E:1 F:0 G:1 Y:1} E E E F F {A:0 B:0 C:0 D:1 X:1 E:0 F:1 G:0 Y:1} G G G Y Y CS-UIUC

Delta analysis • From delta generation’s many runs, Triage finds the “most similar” • Compare the basic block vectors • Triage will diff the two closest runs • The minimum edit distance, aka shortest edit script A B C D E F G - ^ V A B C X E G Y CS-UIUC

A bug in TAR char * savedir (const char *dir) { DIR *dirp; struct dirent *dp; char *name_space; size_t allocated = NAME_SIZE_DEFAULT; size_t used = 0; int save_errno; dirp = opendir (dir); if (dirp == NULL) return NULL; name_space = xmalloc (allocated); errno = 0; while ((dp = readdir (dirp)) != NULL) { char const *entry = dp->d_name; if (entry[entry[0] != '.' ? 0 : entry[1] != '.' ? 1 : 2] != '\0') { size_t entry_size = strlen (entry) + 1; if (used + entry_size < used) xalloc_die (); if (allocated <= used + entry_size) { do { if (2 * allocated < allocated) xalloc_die (); allocated *= 2; } while (allocated <= used + entry_size); char * get_directory_contents (char *path, dev_t device) { struct accumulator *accumulator; /* Recursively scan the given PATH. */ { char *dirp = savedir (path); char const *entry; size_t entrylen; char *name_buffer; size_t name_buffer_size; size_t name_length; struct directory *directory; enum children children; if (! dirp) savedir_error (path); errno = 0; name_buffer_size = strlen (path) + NAME_FIELD_SIZE; name_buffer = xmalloc (name_buffer_size + 2); strcpy (name_buffer, path); if (! ISSLASH (path[strlen (path) - 1])) strcat (name_buffer, "/"); name_length = strlen (name_buffer); directory = find_directory (path); children = directory ? directory->children : CHANGED_CHILDREN; accumulator = new_accumulator (); if (children != NO_CHILDREN) for (entry = dirp; (entrylen = strlen (entry)) != 0; entry += entrylen + 1) Execution difference Segmentation fault null point dereference CS-UIUC

Failure point Segfault in lib strlen Stack & heap OK Bug detection Deterministic bug Null pointer at incremen.c:207 Fault propagation dirp = opendir (dir); if (dirp == NULL) return NULL; dirp = savedir (path); entry = dirp; strlen(entry) Sample Triage report CS-UIUC

Results – Human Study • We tested Triage with a human study • 15 programmers drawn from faculty, research programmers, and graduate students • No undergraduates! • Measured time to repair bugs, with/without Triage • Everybody got core dumps, sample inputs, instructions on how to replicate, and access to many debugging tools • Including Valgrind • 3 simple toy bugs, & 2 real bugs • The TAR bug you just saw • A copy-paste error in BC CS-UIUC

Time to fix a bug • We hope that the report is be easy to check • We cut out the reproduction step • This is quite unfair to Triage • Also, we put a time limit • Over time is counted as max time reproduce find failure …error …fault fix it check Triage report fix it CS-UIUC

Results – Human study • For the real bugs, Triage strongly helps (47%) • Better than 99.99% confidence that with < without CS-UIUC

Results – Other Bugs CS-UIUC

Results – Normal Run Overhead • Identical to checkpoint system (Rx) overhead • Under 5% CS-UIUC

Results – Diagnosis Overhead • CPU bound is the worst case • Still reasonable because we’re only redoing 200ms • Delta analysis is somewhat costly • Should be run in the background CS-UIUC

Related work • Checkpointing & re-execution • Zap [Osman, OSDI’02], TTVM [King, USENIX’05] • Bug detection & diagnosis • Valgrind [Nethercote], CCured [Necula, POPL’02], Purify [Hastings, USENIX’92] • Eraser [Savage, TOCS’97], [Netzer , PPoPP’91] • Backward slicing [Weiser, CACM’82] • Innumerable others • Execution variation • Input variation • Delta debugging [Zeller, FSE’02], Fuzzing [B. So] • Environment variation • Rx [Qin, SOSP’05] DieHard [Berger, PLDI’06] CS-UIUC

Conclusions & Future Work • On-site diagnosis can be made feasible • Checkpoint can effectively capture the failure • Expensive off-line analysis can be done on-site • Privacy issues are minimized • Also useful for in house testing • Reduces the manual portion of analysis • Future work • Automatic bug hot fixes • Visualization of delta analysis CS-UIUC

Thank you • Questions? Special thanks to Hewlett-Packard for student scholarship support. This work supported by NSF, DoE, and Intel CS-UIUC

Triage: Diagnosing Production Run Failures at the User’s Site