Two Lectures on Grid Computing

Two Lectureson Grid Computing Keshav Pingali Cornell University

What is grid computing? • Two models: • Distributed collaboration: • Large system is built from software components and instruments residing at different sites on Internet • High-performance distributed computation • Utility computing: • Metaphor of electrical power grid • Program execute wherever there are computational resources • Programs are mobile to take advantage of changing resource availability

Application-level Checkpoint-Restart for High-performance Computing Keshav Pingali Cornell University Joint work with Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Paul Stodghill

Trends in HPC (i) • Old picture of high-performance computing: • Turn-key big-iron platforms • Short-running codes • Modern high-performance computing: • Large clusters of commodity parts • Lower reliability than turn-key systems • Example: MTBF of ASCI is 6 hours • Long-running codes • Protein-folding on Blue Gene/L: one year • Program runtimes greatly exceed mean time to failure Fault-tolerance is critical

Fault tolerance • Fault tolerance comes in different flavors • Mission-critical systems: (eg) air traffic control system • No down-time, fail-over, redundancy • Computational science applications • Restart after failure, minimizing lost work • Fault models • Fail-stop: • Failed process dies silently • No corrupted messages or shared data • Byzantine: • Arbitrary misbehavior is allowed • Our focus: • Computational science applications • Fail-stop faults • One solution: checkpoint/restart (CPR)

Trends in HPC (ii) • Grid computing: utility computing • Programs execute wherever there are computational resources • Program are mobile to take advantage of changing resource availability • Key mechanism: checkpoint/restart • Identical platforms at different sites on grid • Platform-dependent checkpoints (cf. Condor) • Different platforms at different sites on grid • Platform-independent (portable) checkpoints

Checkpoint/restart (CPR) • System-level checkpointing (SLC) (eg) Condor • core-dump style snapshots of computations • mechanisms very architecture and OS dependent • checkpoints are not portable • Application-level checkpointing (ALC) • programs are self-checkpointing and self-restarting • (eg) n-body codes save and restore positions and velocities of particles • amount of state saved can be much smaller than SLC • IBM’s BlueGene protein folding • Megabytes vs terabytes • Alegra (Sandia) • App. level restart file only 5% of core size • Disadvantage of current application-level check-pointing • manual implementation • requires global barriers in programs

Our Approach • Automate application-level check-pointing • Minimize programmer effort • Generalize to arbitrary MPI programs w/o barriers Application + State-saving Original Application Precompiler Failure detector Thin Coordination Layer MPI Implementation MPI Implementation Reliable communication layer

Outline • Saving single-process state • stack, heap, globals • MPI programs: [PPoPP 2003,ICS2003,SC2004] • coordination of single-process states into a global snapshot • non-blocking protocol: no barriers needed in program • OpenMP programs:[EWOMP 2004, ASPLOS 2004] • blocking protocol • barriers and locks • Ongoing work • portable checkpoints • program analysis to reduce amount of saved state

System ArchitectureSingle-Processor Checkpointing Application + State-saving Precompiler Original Application Failure detector Thin Coordination Layer MPI Implementation MPI Implementation Reliable communication layer

Precompiler • Where to checkpoint • At calls to potentialCheckpoint() function • Mandatory calls in main process (initiator) • Other calls are optional • Process checks if global checkpoint has been requested, and if so, joins in protocol to save state • Inserted by programmer or automated tool • Currently inserted by programmer • Transformed program can save its state only at calls to potentialCheckpoint()

Saving Application State • Heap • special malloc that tracks memory that is allocated and freed • Stack • Location stack (LS): track which function invocations led to place where checkpoint was taken • Variable Description Stack (VDS): records local variables in these function invocations that must be saved • On recovery • LS is used to re-execute sequence of function invocations and re-create stack frames • VDS is used to restore variables into stack • Similar to approach used in Dome [Beguelin et al] • Globals • precompiler inserts statements to save them

main() { int a; VDS.push(&a, sizeof a); if(restart) load LS; copy LS to LS.old jump dequeue(LS.old) // … LS.push(label1); label1: function(); LS.pop(); // … VDS.pop(); } function() { int b; VDS.push(&b, sizeof b); if(restart) jump dequeue(LS.old) // … LS.push(label2); potentialCheckpoint(); label2: if(restart) load VDS; restore variables; LS.pop(); // … VDS.pop(); } Example

Sequential Experiments (vs Condor) • Checkpoint sizes are comparable.

Runtime Overheads (Linux)

Ongoing work: reducing saved state • Automatic placement of potentialCheckpoint calls • Program analysis to determine live data at checkpoint [with Radu Rugina] • Incremental state-saving • Recomputation vs saving state • eg. Protein folding, A*B = C • Prior work: • CATCH (Fuchs,Urbana): uses runtime learning rather than static analysis • Beck, Plank and Kingsley (UTK): memory exclusion analysis of static data

Outline • Saving single-process state • Stack, heap, globals, locals, … • MPI programs: • coordination of single-process states into a global snapshot • non-blocking protocol • OpenMP programs: • blocking protocol • barriers and locks • Ongoing work

System ArchitectureDistributed Checkpointing Application + State-saving Precompiler Original Application Failure detector Thin Coordination Layer MPI Implementation MPI Implementation Reliable communication layer

Need for Coordination P’s Checkpoint Early Message Process P Past Message Future Message Process Q Late Message Q’s Checkpoint • Horizontal Lines – events in each process • Recovery Line • line connecting checkpoints on each processor • represents global system state on recovery • Problem with Communication • messages may cross recovery line

Late Messages • Record message data at receiver as part of checkpoint • On recovery, re-read recorded message data P’s Checkpoint Process P Process Q Late Message Q’s Checkpoint

Early Messages • Must suppress the resending of message on recovery P’s Checkpoint Early Message Process P Process Q Q’s Checkpoint

Early Messages • Must suppress the resending of message on recovery • What about non-deterministic events before the send? • Must ensure the application generates the same early message on recovery • Record and replay all non-deterministic events between checkpoint and send • In our applications, non-determinism arises from wild-card MPI receives P’s Checkpoint Early Message Process P Process Q Q’s Checkpoint

Difficulty of Coordination • No communication  no coordination necessary P’s Checkpoint Process P Process Q Q’s Checkpoint

Difficulty of Coordination • No communication  no coordination necessary • BSP Style programs  checkpoint at barrier P’s Checkpoint Process P Past Message Future Message Process Q Q’s Checkpoint

Difficulty of Coordination • No communication  no coordination necessary • BSP Style programs  checkpoint at barrier • General MIMD programs • System-level checkpointing (eg. Chandy-Lamport) • Forces checkpoints to avoid early messages • Only consistent cuts P’s Checkpoint Process P Past Message Future Message Process Q Late Message Q’s Checkpoint

Difficulty of Coordination • No communication  no coordination necessary • BSP Style programs  checkpoint at barrier • General MIMD programs • System-level checkpointing (eg. Chandy-Lamport) • Only late messages: consistent cuts • Application-level checkpointing • Checkpoint locations fixed, may not force • Late and early messages: inconsistent cuts • Requires new protocol P’s Checkpoint Early Message Process P Past Message Future Message Process Q Late Message Q’s Checkpoint

MPI-specific issues • Non-FIFO communication – tags • Non-blocking communication • Collective communication • MPI_Reduce(), MPI_AllGather(), MPI_Bcast()… • Internal MPI library state • Visible: • non-blocking request objects, datatypes, communicators, attributes • Invisible: • buffers, address mappings, etc.

The Global View • A program’s execution is divided into a series of disjoint epochs • Epochs are separated by recovery lines • A failure in Epoch n means all processes roll back to the prior recovery line Initiator Epoch 0 Epoch 1 Epoch 2 …… Epoch n Process P Process Q

Recovery Line Protocol Outline (I) • Initiator checkpoints, sends pleaseCheckpoint message to all others • After receiving this message, process checkpoints at the next available spot • Sends every other process Q the number of messages sent to Q in the last epoch Initiator pleaseCheckpoint Process P Process Q

Protocol Outline (II) • After checkpointing, each process keeps a record, containing: • data of messages from last epoch (Late messages) • non-deterministic events: • In our applications, non-determinism arises mainly from wild-card MPI receives Initiator pleaseCheckpoint Process P Recording… Process Q

Protocol Outline (IIIa) • Globally, ready to stop recording when • all processes have received their late messages • all processes have sent their early messages Initiator Process P Process Q

Protocol Outline (IIIb) • Locally, when a process • has received all its late messages  sends a readyToStopRecording message to Initiator. Initiator readyToStopRecording Process P Process Q

Protocol Outline (IV) • When initiator receives readyToStopRecording from everyone, it sends stopRecording to everyone • Process stops recording when it receives • stopRecording message from initiator OR • message from a process that has itself stopped recording Initiator stopRecording stopRecording Process P Application Message Process Q

Protocol Discussion • Why can’t we just wait to receive stopRecording message? • Our record would depend on a non-deterministic event, invalidating it. • The application message may be different or may not be resent on recovery. Initiator stopRecording ? Process P Application Message Process Q

Protocol Details • Piggyback 4 bytes of control data to tell if a message is late, early, etc. • Can be reduced to 2 bits • On recovery • reinitialize MPI library using MPI_Init() • restore individual process states • recreate datatypes and communicators • repost non-blocking receives • ensure that all calls to MPI_Send() and MPI_Recv() are suppressed/fed with data as necessary

MPI Collective Communications • Single communication involving multiple processes • Single-Receiver: multiple senders, one receiver e.g. Gather, Reduce • Single-Sender: one sender, multiple receivers e.g. Bcast, Scatter • AlltoAll: every process in group sends data to every other process e.g. AlltoAll, AllGather, AllReduce, Scan • Barrier: everybody waits for everybody else to reach barrier before going on. (Only collective call with explicit synchronization guarantee)

Possible Solutions • We have a protocol for point-to-point messages. Why not reimplement all collectives as point-to-point messages? • Lots of work and less efficient than native implementation. • Checkpoint collectives directly without breaking them up. • May be complex but requires no reimplementation of MPI internals.

AlltoAll Example MPI_AlltoAll() • Data flows represent application-level semantics of how data travels • Do NOT correspond to real messages • Used to reason about application’s view of communications • AlltoAll data flows are a bidirectional clique since data flows in both directions • Recovery line may be • Before the AlltoAll • After the AlltoAll • Straddling the AlltoAll Process P Process Q MPI_AlltoAll() Process R MPI_AlltoAll()

Straddling AlltoAll: What to do • Straddling the AlltoAll – only single case • PQ and PR – Late/Early data flows • Record result and replay for P • Suppress P’s call to MPI_AlltoAll • Record/replay non-determinism before P’s MPI_AlltoAll call Process P ` Process Q Record Process R

Implementation • Two large parallel platforms • Lemieux: Pittsburgh Supercomputing Center • 750 Compaq Alphaserver ES45 nodes • Node: four 1-GHz Alpha processors, 4 GB memory, 38GB local disk • Tru64 Unix operating system. • Quadrics interconnection network. • Velocity 2: Cornell Theory Center • 128 dual-processor Windows cluster • Benchmarks: • NAS suite: CG, LU, SP • SMG2000, HPL

Runtimes on Lemieux • Compare original running time of code with running time using C3 system without taking any checkpoints • Chronic overhead is small (< 10%)

Overheads w/checkpointing on Lemieux • Configuration 1: no checkpoints • Configuration 2: go through motions of taking 1 checkpoint, but nothing written to disk • Configuration 3: write checkpoint data to local disk on each machine • Measurement noise: about 2-3% • Conclusion: relative cost of taking a checkpoint is fairly small

Overheads on Lemieux

Overhead w/checkpointing on V2 • Overheads on V2 for taking checkpoints are also fairly small.

Blocking Protocol • Protocol: • Barrier • Each thread records own state • Thread 0 records shared state • Barrier Thread A SavingState Thread B

Problems Due to Barriers • Protocol introduces new barriers into program • May cause errors or deadlock • If checkpoint crosses barrier: • Thread A reaches barrier, waits for thread B • Thread B reaches checkpoint, calls first barrier • Both threads advance • Thread B recording checkpoint • Thread A computing, possibly corrupting checkpoint Application Barrier Thread A Thread B Checkpoint

Resolving the Barrier Problem • Cannot allow checkpoints to cross barriers • On recovery some threads before barrier, others after • Illegal situation • Solution: • Whenever checkpoint crosses barrier, force checkpoint before barrier Application Barrier Application Barrier Thread A Thread B Checkpoint Checkpoint

Deadlock due to locks • Suppose checkpoint crosses dependence • Thread B grabs lock, will not release until after checkpoint • Thread A won’t checkpoint before it acquires lock • Since checkpoint is barrier: Deadlock Lock Unlock ( ) Thread A Lock Unlock Thread B ( ) Checkpoint

Two Lectures on Grid Computing

Two Lectures on Grid Computing

Presentation Transcript

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Two Globelics lectures

Grid Computing

Last Two Lectures

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Two lectures on Primordial non-Gaussianity

Grid Computing