270 likes | 379 Vues
This paper presents an innovative approach to checkpointing and replaying Java software, aiming to enhance debugging and fault tolerance. By leveraging static analyses and code instrumentation, our method captures and restores program state effectively. It tackles challenges such as ease of use, managing application-level checkpointing without JVM support, and optimizing recorded states for large programs. Our approach outputs two program versions: one that augments checkpointing, and another that prunes redundant states for efficient replaying. We evaluate our technique through experimental analysis, demonstrating its potential benefits for long-running applications.
E N D
Guoqing Xu, Atanas Rountev, Yan Tang, Feng Qin Ohio State University ESEC/FSE 07 Efficient Checkpointing of Java Software using Context-Sensitive Capture and Replay
Outline Motivation Challenges for checkpointing/replaying Java software Summary of our approach Contributions Static analyses Multiple execution regions Experimental evaluation Conclusions
Motivation Checkpointing/replaying has been used for a variety of purposes at system level Originally designed to support fault tolerance Debugging of OS and of parallel and distributed software Checkpointing can benefit a number of software engineering tasks Reduce the cost of manual debugging and testing Support for automated techniques for debugging and testing: e.g., dynamic slicing and delta-debugging Inspired by both system-level checkpointing [Pan-PDD88, Dunlap-OSDI02, King-USENIX05] and “saving-and-restoring” software engineering techniques [Saff-ASE05, Orso-WODA05,Orso-WODA06, Elbaum-FSE06]
Challenges Ease of use and deployment Application-level checkpointing: no JVM/runtime support, just code analysis and instrumentation Challenge: no direct access to the call stack; no control over thread scheduling or external resources (files, etc.) Reduce the size of the recorded state Dumping the entire heap may be prohibitively expensive, especially for large programs Challenge: static analyses to prune redundant state Static and dynamic overhead Static analysis cost is amortized over multiple runs Approach is intended for long-running applications
Summary of Our Approach Tool input: program + checkpoint definition Performs static analyses and code instrumentation Tool output: two program versions First, an augmentedcheckpointing version is executed once to record (parts of) the run-time program states At the checkpoint: heap objects, static fields, locals At certain points along the call chain leading to the checkpoint Next, a prunedreplaying version is executed multiple times Restore variables saved at the checkpoint Restore variables saved at points along the call chain How do we resume execution from the checkpoint? Step 1: control flow quickly reaches the checkpoint Step 2: recover state at checkpoint Step 3: incrementally recover state after call sites along the call chain leading to the checkpoint
Definitions Crosscut call chain (CC-chain) A programmer-specified call chain that leads to the method that contains the checkpoint E.g. main(44) -> run(28) Decision points A call site on the CC-chain (e.g. m.run) – due to polymorphism A predicate on which a decision point or the checkpoint is control-dependent At a decision point, the checkpointing version records the control-flow outcome The replaying version uses this info to force the control flow to reach the checkpoint
Replaying, Step 1: Recover the Call Stack Predicate decision point: recover boolean value Call site decision point o.m(a1…, an) Recover the run-time type of the receiver object; instantiated during replaying using sun.misc.Unsafe
Checkpointing Version void run(String[] args) { processCmdLine(args); loadNecessaryClasses(); Set wp_packs = getWpacks(); Set body_packs = getBpacks(); boolean b = Options.v().whole_jimple(); =>save(b); if (b){// DP getPack("cg").apply(); // --- checkpoint --- => save(…); getPack("wjtp").apply(); getPack("wjop").apply(); getPack("wjap").apply(); } retrieveAllBodies(); … } ... } static void main(String[] args) { Main m = new Main(); boolean b = args.length !=0; =>save(b); if (b) // DP => save(type_of(m)); m.run(args); // DP }
Replaying Version void run(String[] args) { processCmdLine(args); loadNecessaryClasses(); Set wp_packs = getWpacks(); Set body_packs = getBpacks(); boolean b = Options.v().whole_jimple(); =>read(b); if (b){// DP getPack("cg").apply(); // --- checkpoint --- =>read(…); getPack("wjtp").apply(); getPack("wjop").apply(); getPack("wjap").apply(); } retrieveAllBodies(); … } static void main(String[] args) { Main m = new Main(); boolean b = args.length !=0; =>read(b); if (b) // DP => read(type_of(m)); => unsafe.allocate(m); => args = null; m.run(args); // DP }
Step 2: Recover at the Checkpoint Our static analysis selects locals for recording(for checkpointing)/recovering(for replaying) when They are written before the checkpoint They are read after the checkpoint Record primitive-typed values or entire object graphs on the heap (all reachable objects) Static fields are selected based on the same idea void run(String[] args) { processCmdLine(args); loadNecessaryClasses(); Set wp_packs = getWpacks(); Setbody_packs= getBpacks(); if (Options.v().whole_jimple()) { getPack("cg").apply(); // --- checkpoint --- getPack("wjtp").apply(); getPack("wjop").apply(); getPack("wjap").apply(); } retrieveAllBodies(); for (Iterator i = body_packs.iterator(); i.hasNext();) { … }… } body_packs
Selection of Static Fields A whole program Mod/Use analysis A static field is “written” if its value is changed, or any heap object reachable from it is mutated A static field is “read” if its value is directly read Analysis algorithm Context-sensitive and flow-insensitive; uses the points-to solution and the call graph from Spark [Lhotak CC-03] Bottom-up traversal of the SCC-DAG of the call graph For each method m, a set Cmis maintained to contain all objects from which a mutated object can be reached Propagate backwards the objects in Cmthat escape a callee method to its callers Select a static field fld if PointsToSet(fld)∩ Cm ≠ ∅
Step 3: Recover after the Checkpoint Replaying only at decision points and the checkpoint is not enough to guarantee correct execution after the checkpoint Additionally record/recover local variables that will be read after each call site in CC-chain void main(){ Set hs = new HashSet(); B b = new B(hs); //-- reco/rest // (type_of(b)) b.m(); //-- extra reco/rest (hs) if(hs == b.s){ … } } class B{ Set s; void m(){ B r0 = this; r0.s = new HashSet(); //-- checkpoint //-- reco/rest (r0) r0.s.add(“”); } } hsuninitialized
Additional Issues A checkpoint can have multiple run-time instances If a method in CC-chain has callers that are not in the chain, it has to be replicated Currently do not support multi-threaded programs Our technique does not guarantee the correctness of the execution, when the post-checkpoint part of the program Depends on external resources, such as files, databases Depends on unique-per-execution values, such as clock Is modified with new cross-checkpoint dependencies Multiple execution regions Designated by a starting point and an ending point Specified by two CC-chains
Study 1: Static Analysis Program #R #IP compress 1 6 socksproxy 3 11 socksecho 3 14 raytrace 3 10 soot-2.2.3 10 35 muffin 3 20 sablecc 4 11 jess 3 8 violet 4 9 javacup 4 9 jtar-1.21 2 4 db 2 5 jflex 2 8 jb-6.1 3 5 jlex-1.2.6 3 8
Static Analysis Cost Phase 1: Soot infrastructure cost Between 1.64ms and 30.6ms per thousand Jimple statements On average, 11.1ms/1000 statements Phase 2: Our analysis cost Between 1.67ms and 26.6ms per thousand Jimple statements On average, 9.4ms/1000 statements This should be amortized across multiple runs of the replaying version
Study 2: Run-Time Performance (compress) Original program: compressing and decompressing 5 big tar files several times Evaluated for five checkpoint definitions One checkpoint, close to the beginning of the program Two regions of compression and decompression A region containing the process of compression A region containing the process of decompression One checkpoint, close to the end of the program
compress Performance • Normalized running time • Normalized size of captured program state
Study 2: Run-Time Performance (soot) Input: soot-2.2.3 itself containing 2227333 methods Phases Enabling cg.spark, wjtp, wjop.ji, wjap.uft,jtp, jop.cp Evaluated for six checkpoint definitions Before whole-program packs After cg After wjtp After wjop After wjap After body packs
soot Performance • Normalized running time • Normalized size of captured program state
Study 2: Run-Time Performance (jflex-1.4.1) Input: a .flex grammar file corresponding to a DFA containing 21769 states Evaluated for four checkpoint definitions After NFA is generated After DFA is generated to DFA After minimization After emission
jflex Performance • Normalized running time • Normalized size of captured program state
Summary of Evaluation Static analysis successfully reduces the size of program state recorded and recovered It is more meaningful to checkpoint/replay long-running programs Checkpoints are better taken after a phase of long computation with (relatively) small output state √compress: small program state, short running time √ soot: large program state, but very long computation time X jflex: large program state, short running time
Conclusions A static-analysis-based checkpointing/replaying technique An implementation and an evaluation that shows our technique can be an interesting candidate for testing, debugging, and dynamic slicing of long-running programs Future work Language-level checkpointing/replaying multi-threaded programs More precise static analyses could be employed to reduce the size of program state to be captured The run-time support for object reading and writing could be improved