Compiler-Generated Staggered Checkpointing

Compiler-Generated Staggered Checkpointing Alison N. Norman Department of Computer Sciences The University of Texas at Austin Sung-Eun Choi Los Alamos National Laboratory Calvin Lin Department of Computer Sciences The University of Texas at Austin

The Importance of Clusters • Scientific computation is increasingly performed on clusters • Cost-effective: Created from commodity parts • Scientists want more computational power • Cluster computational power is easy to increase by adding processors  Cluster size keeps increasing! The University of Texas at Austin

Clusters Are Not Perfect • Failure rates are increasing • The number of moving parts is growing (processors, network connections, disks, etc.) • Mean Time Between Failure (MTBF) is shrinking How can we deal with these failures? The University of Texas at Austin

Options forFault-Tolerance • Redundancy in space • Each participating process has a backup process • Expensive! • Redundancy in time • Processes save state and then rollback for recovery • Lighter-weight fault tolerance The University of Texas at Austin

Today’s Answer Programmers place checkpoints • Small checkpoint size • Synchronous • Every process checkpoints in the same place in the code • Global synchronization before and after checkpoints The University of Texas at Austin

What’s the Problem? • Future systems will be larger • Checkpointing will hurt program performance • Many processes checkpointing synchronously will result in network and file system contention • Checkpointing to local disk not viable • Application programmers are only willing to pay 1% overhead for fault-tolerance • The solution: • Avoid synchronous checkpoints The University of Texas at Austin

Solution: Staggered Checkpointing • Spread individual checkpoints in time to reduce network and file system contention • Possible approaches exist • Dynamic---Runtime overhead! • Do not guarantee reduced contention • This talk is going to explain: • Why staggered checkpointing is a good solution • Difficulties of staggered checkpointing • How a compiler can help The University of Texas at Austin

Contributions • Show that synchronous checkpointing will suffer significant contention • Show that staggered checkpointing improves performance: • Reduces checkpoint latency up to a factor of 23 • Enables more frequent checkpoints • Describe a prototype compiler for identifying staggered checkpoints • Show that there is great potential for staggering checkpoints within applications The University of Texas at Austin

Talk Outline • Motivation • Our Solution • Build communication graph • Create vector clocks • Identify recovery lines • Results • Future Work • Related Work • Conclusion The University of Texas at Austin

0 1 Understanding Staggered Checkpointing More processes, more data, synchronous checkpoints Contention! Not so fast… There is communication! State is inconsistent--- it could not have existed That’s easy! We’ll stagger the checkpoints…. Today: Tomorrow: State is consistent---it could have existed No problem! Send not saved X VALID Recovery line Processes X Recovery line[Randall 75] Receive is saved checkpoint with contention Receive not saved checkpoint … 64K X 2 Send is saved Time The University of Texas at Austin

Complications with Staggered Checkpointing Checkpoints must be placed carefully: • Want valid recovery lines • Want low contention • Want small state This is difficult! The University of Texas at Austin

Our Solution Compiler places staggered checkpoints • Builds communication graph • Calculates vector clocks • Identifies valid recovery lines The University of Texas at Austin

Assumptions in our Prototype Compiler • Number of nodes known at compile-time • Communication only dependent on: • Node rank • Number of nodes in the system • Other constants • Explicit communication • Implementation assumes MPI The University of Texas at Austin

First Step:Build Communication Graph • Find neighbor at each communication call • Symbolic expression analysis • Constant propagation and folding MPI_irecv(x, x, x, from_process, …) from_process = node_rank % sqrt(no_nodes) - 1 • Instantiate each process • Control-dependence analysis • Not all communication calls are executed every time • Match sends with receives, etc. The University of Texas at Austin

0 1 2 Example:Communication Graph Processes Time The University of Texas at Austin

Second Step: Calculate Vector Clocks • Use communication graph • Create vector clocks (we will review!) • Iterate through calls • Track dependences • Keep current clocks with each call The University of Texas at Austin

0 [1,0,0] [2,0,0] [3,2,0] [4,5,2] 1 [2,3,2] [2,4,2] [1,1,0] [1,2,0] [2,5,2] 2 [2,0,1] [2,0,2] [2,4,3] Example:Calculate Vector Clocks [P0,P1,P2] Track events within a process Vector Clocks: capture inter-process dependences [1,0,0] [1] [2] [3] [4] Processes [3] [4] [1] [2] [5] [1,1,0] [1] [2] [3] [Lamport 78] Time The University of Texas at Austin

0 [1,0,0] [2,0,0] [3,2,0] [4,5,2] 1 [2,3,2] [2,4,2] [1,1,0] [1,2,0] [2,5,2] 2 [2,0,1] [2,0,2] [2,4,3] Next Step:Identify All PossibleValid Recovery Lines There are so many! Final Step: Choose some! And then place them in the code… Processes Time The University of Texas at Austin

Talk Outline • Motivation • Our Solution • Results • Methodology • Contention Effects • Benchmark Results • Future Work • Related Work • Conclusion The University of Texas at Austin

Methodology Compiler Trace Generator Simulator FS • Event-driven Simulator • Models computation events, communication events, and checkpointing events • Network, file system modeled optimistically • Cluster characteristics are modeled after an actual cluster • Compiler Implementation • Implemented in Broadway Compiler [Guyer & Lin 2000] • Accepts C code, generates C code with checkpoints • Trace Generator • Generates traces from pre-existing benchmarks • Uses static analysis and profiling The University of Texas at Austin

Synthetic Benchmark • Large number of sequential instructions • 2 checkpoint locations per process • Simulated with 2 checkpointing policies • Policy 1: Synchronous • Every process checkpoints simultaneously • Barrier before, barrier after • Policy 2: Staggered • Processes checkpoint in groups of four • Spread evenly throughout the sequential instructions The University of Texas at Austin

Staggering Improves Performance 256GB checkpointed by the system 16GB checkpointed by the system The University of Texas at Austin

Amount of data checkpointed by the system Synchronous, 16GB What About a Fixed Problem Size? Average Checkpoint Time Per Process Average Checkpoint Time Per Process Staggered, 16GB • Staggered checkpointing improves performance • Staggered checkpointing becomes more helpful as • Number of processes increases • Amount of data checkpointed increases Numbers represented in previous graph Time (s) Number of Processes Number of Processes The University of Texas at Austin

Synchronous, 16GB Synchronous, 256GB Staggered, 256GB What About a Fixed Problem Size? Average Checkpoint Time Per Process Staggered, 16GB • Staggered checkpointing improves performance • Staggered checkpointing becomes more helpful as • Number of processes increases • Amount of data checkpointed increases Numbers represented in previous graph 23x improvement Time (s) Number of Processes The University of Texas at Austin

Staggering Allows More Checkpoints • Staggered checkpointing allows processes to checkpoint more often • Can checkpoint 9.3x more frequently for 4K processes The University of Texas at Austin

Benchmark Characteristics • IS, BT are versions of the NAS Parallel Benchmarks • ek-simple is CFD benchmark The University of Texas at Austin

Unique ValidRecovery Lines Number of statically unique valid recovery lines • Lots of point-to-point communication means many unique valid recovery lines • ek-simple is most representative of real applications • These recovery lines differ only with respect to dependence-creating communication The University of Texas at Austin

Future Work • Develop heuristic to identify good recovery lines • Determining optimal is NP complete [Li et al 94] • Scalable simulation • Develop more realistic contention models • Relax assumptions in compiler • Dynamically changing communication patterns The University of Texas at Austin

Related Work • Checkpointing with compilers • Compiler-Assisted [Beck et al 1994] • Automatic Checkpointing [Choi & Deitz 2002] • Application-Level Non-Blocking [Bronevetsky et al 2003] • Dynamic fault-tolerant protocols • Message logging [Elnozahy et al 2002] The University of Texas at Austin

Conclusions • Synchronous checkpointing suffers from contention • Staggered checkpoints reduce contention • Reduces checkpoint latency up to a factor of 23 • Allows the application to tolerate more failures without a corresponding increase in overhead • A compiler can identify where to stagger checkpoints • Unique valid recovery lines are numerous in applications with point-to-point communication The University of Texas at Austin

Thank you! The University of Texas at Austin

Dynamic ValidRecovery Lines Number of dynamically unique valid recovery lines The University of Texas at Austin

Crash Send Omission Receive Omission General Omission Arbitrary failures with message authentication Arbitrary (Byzantine) failures Fault Model The University of Texas at Austin

Vector Clock Formula The University of Texas at Austin

Message Logging • Saves all messages sent to stable storage • In the future, storing this data will be untenable • Message logging relies on checkpointing so that logs can be cleared The University of Texas at Austin

In-flight messages:Why we don’t care • We reason about them at the application level so… • Messages are assumed received at actual receive call or at wait • We will know if any messages crossed the recovery line. We can prepare for recovery by checkpointing that information. The University of Texas at Austin

C-Breeze • In-house compiler • Allows us to reason about code at various phases of compilation • Allows us to add our own phases The University of Texas at Austin

In the future… • Systems will be more complex • Programs will be more complex • Checkpointing will be more complex • Programmer should not waste time and talent handling fault-tolerance Checkpointing MPI FORTRAN/C/C++ Algorithm The University of Texas at Austin

The University of Texas at Austin

Solution Goals • Transparent checkpointing • Use the compiler to place checkpoints • Low failure-free execution overhead • Stagger checkpoints • Minimize checkpoint state • Support legacy code The University of Texas at Austin

The Intuition • Fault-tolerance requires valid recovery lines • Many possible valid recovery lines • Find them • Automatically choose a good one • Small state, low contention • Flexibility is key The University of Texas at Austin

Our Solution • Where is the set of valid recovery lines? • Determine communication pattern • Use vector clocks • Which recovery line should we use? • Develop heuristics based on cluster architecture and application (not done yet) The University of Texas at Austin

Overview: Status • Discover communication pattern • Create vector clocks • Identify possible recovery lines • Select recovery line • Experimentation • Performance model and heuristic The University of Texas at Austin

Finding neighbors Find the neighbors for each process: p = sqrt(no_nodes) cell_coord[0][0] = node % p cell_coord[1][0] = node / p; j = cell_coord[0][0] - 1 i = cell_coord[1][0] - 1 from_process = (i – 1 + p) % p + p * j MPI_irecv(x, x, x, from_process, …) (taken from NAS benchmark bt) from_process = (node / (sqrt(no_nodes)) - 1 – 1 + sqrt(no_nodes)) % sqrt(no_nodes) + sqrt(no_nodes) * node % sqrt(no_nodes) - 1 The University of Texas at Austin

Final Step: Recovery Lines • Discover possible recovery lines • Choose a good one • Determining optimal is NP complete [Li 94] • Develop heuristic • Rough performance model for staggering • Goals • Valid recovery line • Reduce bandwidth contention • Reduce storage space The University of Texas at Austin

What about a fixed problem size? Average checkpoint speedup (x faster) per process : Staggered over Synchronous. Numbers represented in previous graph Data checkpointed by the system Number of Processes • Staggered checkpointing improves performance • Staggered checkpointing becomes more helpful as • number of processes increases • amount of data checkpointed increases The University of Texas at Austin

Compiler-Generated Staggered Checkpointing