1 / 31

Triage: Diagnosing Production Run Failures at the User’s Site

Triage: Diagnosing Production Run Failures at the User’s Site. Joseph Tucek , Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer Science University Illinois, Urbana Champaign. Despite all of our effort, production runs still fail

ura
Télécharger la présentation

Triage: Diagnosing Production Run Failures at the User’s Site

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Triage: Diagnosing Production Run Failures at the User’s Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer Science University Illinois, Urbana Champaign

  2. Despite all of our effort, production runs still fail • What do we do about these failures? CS-UIUC

  3. What is (currently) done about end-user failures? • Dumps leave much manual effort to diagnose • We still need to reproduce the bug • This is hard, if not impossible, to do CS-UIUC

  4. Why on-site diagnosis of production run failures? • Production run bugs are valuable • Not caught in testing • Potentially environment specific • Causing real damage to end users • We can’t diagnose production failures off-site • Reproduction is hard • The programmer doesn’t have the end-user environment • Privacy concerns limit even the reports we do get • We must diagnose at the end-user’s site CS-UIUC

  5. What do we mean by diagnosis? • Diagnosis traces back to the underlying fault • Core dumps tell you about the failure • Bug detection tells you about some errors • Existing diagnosis tools are offline trigger failure fault error service interruption root cause buggy line of code incorrect state e.g. smashed stack CS-UIUC

  6. What do we need to perform diagnosis? (1) • We need information about the failure • What is the fault, the error, the propagation tree? • Off-site: • Repeatedly inspect the bug (e.g. with a debugger) • We run analysis tools targeted at the failure, or at suspected failures • Off-site techniques don’t work on-site • Reproducing the bug is non-trivial • We don’t know what specific failures will occur • Existing analysis tools are too expensive CS-UIUC

  7. What do we need to perform diagnosis? (2) • We need guidance as to what to do next • What analysis should we perform, what is likely to work well, and what variables are interesting? • Off-site: • The programmer decides, based on past knowledge • On-site, there is no programmer. • Any decisions as to action must be made automatically. CS-UIUC

  8. What do we need to perform diagnosis? (3) • We need to try “what-if’s” with the execution • If we change this input, what happens? Skip this function? • Off-site: • Programmers run many input variations • Even with differing code. • This is difficult on-site • Most replay focuses on minimizing variance • We can’t understand what the results mean CS-UIUC

  9. What does Triage contribute? • Enables on-site diagnosis • Uses systems techniques to make offline analysis tools feasible on-site • Addresses the three previous challenges • Allows a new technique, delta analysis • Human study • Real programmers and real bugs • Show large time savings in time-to-fix CS-UIUC

  10. Overview • Introduction • Addressing the three challenges • Diagnosis process & design • Experimental results • Human study • Overhead • Related work • Conclusions CS-UIUC

  11. Getting information about the failure • Checkpoint/re-execution can capture the bug • The environment, input, memory state, etc. • Everything we need to reproduce the bug • Benefits: • We can relive the failure over and over • Dynamically plug in analysis tools “on-demand” • Makes the expensive cheap • Normal-run overhead is low too CS-UIUC

  12. Guidance about what to do next • A human-like diagnosis protocol can guide the diagnosis process • Repeated replay lets us diagnose incrementally • Based on past results, we can pick the next step • E.g. if the bug doesn’t always repeat, we should look for races CS-UIUC

  13. Trying “what-ifs” with the execution • Flexible re-execution lets us play with what-ifs • Three types of re-execution • Plain – deterministic • Loose – allow some variance • Wild – introduce (potentially large) variations • Extracts how they differ with delta analysis CS-UIUC

  14. Main idea of Triage • How to get information about the failure? • Capture the bug with checkpoint/re-execution • Relive the bug with various diagnostic techniques • How to decide what to do? • Use a human-like protocol to select analysis • Incrementally increase our understanding of the bug • How to try out “what-if” scenarios? • Flexible re-execution allows varied executions • Delta analysis points out what makes them different CS-UIUC

  15. Overview • Introduction • Addressing the three challenges • Diagnosis process & design • Experimental results • Human study • Overhead • Related work • Conclusions CS-UIUC

  16. Triage Architecture Checkpointing Subsystem Control Unit (Protocol) Analysis Tools (e.g. backward slicing, bug detection) CS-UIUC

  17. Triage vs. Rx • Both are in memory • Both support variations in execution • Triage has no output commit • Triage has no need for safety • Can even skip code • Triage considers why the failure occurs • Tries to analyze the failure CS-UIUC

  18. Bounds checking (1.1x) Assertion checking (1x) Happens-before (12x) Atomicity detection (60x) Static core analysis (1x) Taint analysis (2x) Dynamic Slicing (1000x) Symbolic exec. (1000x) Lockset analysis (20x) Rearrange allocation Drop inputs Mutate inputs Pad buffers Change file state Drop code Reschedule threads Change libraries Reorder messages Failure analysis & delta generation (stage 1 and 2) The differences caused by variations are useful as well CS-UIUC

  19. Delta analysis Compute the basic block vector: A A A B B B {A:1 B:1 C:1 D:1 X:0 E:1 F:1 G:1 Y:0} C C C D D X X {A:1 B:1 C:1 D:0 X:1 E:1 F:0 G:1 Y:1} E E E F F {A:0 B:0 C:0 D:1 X:1 E:0 F:1 G:0 Y:1} G G G Y Y CS-UIUC

  20. Delta analysis • From delta generation’s many runs, Triage finds the “most similar” • Compare the basic block vectors • Triage will diff the two closest runs • The minimum edit distance, aka shortest edit script A B C D E F G - ^ V A B C X E G Y CS-UIUC

  21. A bug in TAR char * savedir (const char *dir) { DIR *dirp; struct dirent *dp; char *name_space; size_t allocated = NAME_SIZE_DEFAULT; size_t used = 0; int save_errno; dirp = opendir (dir); if (dirp == NULL) return NULL; name_space = xmalloc (allocated); errno = 0; while ((dp = readdir (dirp)) != NULL) { char const *entry = dp->d_name; if (entry[entry[0] != '.' ? 0 : entry[1] != '.' ? 1 : 2] != '\0') { size_t entry_size = strlen (entry) + 1; if (used + entry_size < used) xalloc_die (); if (allocated <= used + entry_size) { do { if (2 * allocated < allocated) xalloc_die (); allocated *= 2; } while (allocated <= used + entry_size); char * get_directory_contents (char *path, dev_t device) { struct accumulator *accumulator; /* Recursively scan the given PATH. */ { char *dirp = savedir (path); char const *entry; size_t entrylen; char *name_buffer; size_t name_buffer_size; size_t name_length; struct directory *directory; enum children children; if (! dirp) savedir_error (path); errno = 0; name_buffer_size = strlen (path) + NAME_FIELD_SIZE; name_buffer = xmalloc (name_buffer_size + 2); strcpy (name_buffer, path); if (! ISSLASH (path[strlen (path) - 1])) strcat (name_buffer, "/"); name_length = strlen (name_buffer); directory = find_directory (path); children = directory ? directory->children : CHANGED_CHILDREN; accumulator = new_accumulator (); if (children != NO_CHILDREN) for (entry = dirp; (entrylen = strlen (entry)) != 0; entry += entrylen + 1) Execution difference Segmentation fault null point dereference CS-UIUC

  22. Failure point Segfault in lib strlen Stack & heap OK Bug detection Deterministic bug Null pointer at incremen.c:207 Fault propagation dirp = opendir (dir); if (dirp == NULL) return NULL; dirp = savedir (path); entry = dirp; strlen(entry) Sample Triage report CS-UIUC

  23. Results – Human Study • We tested Triage with a human study • 15 programmers drawn from faculty, research programmers, and graduate students • No undergraduates! • Measured time to repair bugs, with/without Triage • Everybody got core dumps, sample inputs, instructions on how to replicate, and access to many debugging tools • Including Valgrind • 3 simple toy bugs, & 2 real bugs • The TAR bug you just saw • A copy-paste error in BC CS-UIUC

  24. Time to fix a bug • We hope that the report is be easy to check • We cut out the reproduction step • This is quite unfair to Triage • Also, we put a time limit • Over time is counted as max time reproduce find failure …error …fault fix it check Triage report fix it CS-UIUC

  25. Results – Human study • For the real bugs, Triage strongly helps (47%) • Better than 99.99% confidence that with < without CS-UIUC

  26. Results – Other Bugs CS-UIUC

  27. Results – Normal Run Overhead • Identical to checkpoint system (Rx) overhead • Under 5% CS-UIUC

  28. Results – Diagnosis Overhead • CPU bound is the worst case • Still reasonable because we’re only redoing 200ms • Delta analysis is somewhat costly • Should be run in the background CS-UIUC

  29. Related work • Checkpointing & re-execution • Zap [Osman, OSDI’02], TTVM [King, USENIX’05] • Bug detection & diagnosis • Valgrind [Nethercote], CCured [Necula, POPL’02], Purify [Hastings, USENIX’92] • Eraser [Savage, TOCS’97], [Netzer , PPoPP’91] • Backward slicing [Weiser, CACM’82] • Innumerable others • Execution variation • Input variation • Delta debugging [Zeller, FSE’02], Fuzzing [B. So] • Environment variation • Rx [Qin, SOSP’05] DieHard [Berger, PLDI’06] CS-UIUC

  30. Conclusions & Future Work • On-site diagnosis can be made feasible • Checkpoint can effectively capture the failure • Expensive off-line analysis can be done on-site • Privacy issues are minimized • Also useful for in house testing • Reduces the manual portion of analysis • Future work • Automatic bug hot fixes • Visualization of delta analysis CS-UIUC

  31. Thank you • Questions? Special thanks to Hewlett-Packard for student scholarship support. This work supported by NSF, DoE, and Intel CS-UIUC

More Related