Parallel Application Scaling, Performance, and Efficiency

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL

Parallel Scaling of MPI Codes A practical talk on using MPI with focus on: • Distribution of work within a parallel program • Placement of computation within a parallel computer • Performance costs of various types of communication • Understanding scaling performance terminology

Topics • Introduction • Load Balance • Synchronization • Simple stuff • File I/O • Performance profiling

31 31 31 32 Let’s introduce these topics through a familiar example: Sharks and Fish II • Sharks and Fish II : N2 force summation in parallel • E.g. 4 CPUs evaluate force for a global collection of 125 fish • Domain decomposition: Each CPU is “in charge” of ~31 fish, but keeps a fairly recent copy of all the fishes positions (replicated data) • Is it not possible to uniformly decompose problems in general, especially in many dimensions • Luckily this problem has fine granularity and is 2D, let’s see how it scales

Sharks and Fish II : Program Data: n_fish is global my_fish is local fishi = {x, y, …} Dynamics: MPI_Allgatherv(myfish_buf, len[rank], .. for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j ai += g * massj * ( fishi – fishj ) / rij } } Move fish

Sharks and Fish II: How fast? Running on a machine ~seaborg.nersc.gov • 100 fish can move 1000 steps in 1 task  5.459s 32 tasks  2.756s • 1000 fish can move 1000 steps in 1 task  511.14s 32 tasks  20.815s • What’s the “best” way to run? • How many fish do we really have? • How large a computer do we have? • How much “computer time” i.e. allocation do we have? • How quickly, in real wall time, do we need the answer? x 1.98 speedup x 24.6 speedup

Scaling: Good 1st Step: Do runtimes make sense? Running fish_sim for 100-1000 fish on 1-32 CPUs we see 1 Task … 32 Tasks

Scaling: Walltimes Walltime is (all)important but let’s define some other scaling metrics

Scaling: definitions • Scaling studies involve changing the degree of parallelism. Will we be change the problem also? • Strong scaling • Fixed problem size • Weak scaling • Problem size grows with additional resources • Speed up = Ts/Tp(n) • Efficiency = Ts/(n*Tp(n)) Be aware there are multiple definitions for these terms

Scaling: Speedups

Scaling: Efficiencies Remarkably smooth! Often algorithm and architecture make efficiency landscape quite complex

Scaling: Analysis • Why does efficiency drop? • Serial code sections  Amdahl’s law • Surface to Volume  Communication bound • Algorithm complexity or switching • Communication protocol switching  Whoa!

Scaling: Analysis • In general, changing problem size and concurrency expose or remove compute resources. Bottlenecks shift. • In general, first bottleneck wins. • Scaling brings additional resources too. • More CPUs (of course) • More cache(s) • More memory BW in some cases

Scaling: Superlinear Speedup # CPUs (OMP)

Scaling: Communication Bound 64 tasks , 52% comm 192 tasks , 66% comm 768 tasks , 79% comm • MPI_Allreduce buffer size is 32 bytes. • Q: What resource is being depleted here? • A: Small message latency • Compute per task is decreasing • Synchronization rate is increasing • Surface : Volume ratio is increasing

Topics • Introduction • Load Balance • Synchronization • Simple stuff • File I/O

Load Balance : cartoon Unbalanced: Universal App Balanced: Time saved by load balance

Load Balance : performance data Communication Time: 64 tasks show 200s, 960 tasks show 230s MPI ranks sorted by total communication time

960 x 64 x Load Balance: ~code while(1) { do_flops(Ni); MPI_Alltoall(); MPI_Allreduce(); }

Flops Exchange Sync Load Balance: real code MPI Rank  Time 

Load Balance : analysis • The 64 slow tasks (with more compute work) cause 30 seconds more “communication” in 960 tasks • This leads to 28800 CPU*seconds (8 CPU*hours) of unproductive computing • All imbalance requires is one slow task and a synchronizing collective! • Pair well problem size and concurrency. • Parallel computers allow you to waste time faster!

Load Balance : FFT Q: When is imbalance good? A: When is leads to a faster Algorithm.

Flops Exchange Sync MPI Rank  Time  Dynamical Load Balance: Motivation

Load Balance: Summary • Imbalance most often a byproduct of data decomposition • Must be addressed before further MPI tuning can happen • Good software exists for graph partitioning / remeshing • Dynamical load balance may be required for adaptive codes • For regular grids consider padding or contracting

Scaling of MPI_Barrier() four orders of magnitude

Synchronization: definition MPI_Barrier(MPI_COMM_WORLD); T1 = MPI_Wtime(); e.g. MPI_Allreduce(); T2 = MPI_Wtime()-T1; How synchronizing is MPI_Allreduce? • For a code running on N tasks what is the distribution of the T2’s? • The average and width of this distribution tell us how • synchronizing e.g. MPI_Allreduce is • Completions semantics of MPI functions • Local : leave based on local logic (MPI_Comm_rank) • Partially synchronizing : leave after messaging M<N tasks (MPI_Bcast, MPI_Reduce) • Fully synchronizing : leave after every else enters (MPI_Barrier, MPI_Allreduce)

seaborg.nersc.gov • It’s very hard to discuss synchronization outside of the context a particular parallel computer • So we will examine parallel application scaling on an IBM SP which is largely applicable to other clusters

16 way SMP NHII Node G P F S Main Memory GPFS seaborg.nersc.gov basics IBM SP 380 x Colony Switch CSS0 CSS1 • 6080 dedicated CPUs, 96 shared login CPUs • Hierarchy of caching, speeds not balanced • Bottleneck determined by first depleted resource HPSS

16 way SMP NHII Node G P F S Main Memory GPFS MPI on the IBM SP • 2-4096 way concurrency • MPI-1 and ~MPI-2 • GPFS aware MPI-IO • Thread safety • Ranks on same node • bypass the switch Colony Switch CSS0 CSS1 HPSS

16 way SMP NHII Node 16 way SMP NHII Node Main Memory Main Memory GPFS GPFS Seaborg : point to point messaging Switch bandwidth is often stated in optimistic terms inter-connect Intranode Internode

16 way SMP NHII Node Main Memory GPFS MPI: seaborg.nersc.gov css1 css0 • Lower latency  can satisfy more syncs/sec • What is the benefit of two adapters? • Can a single csss

Inter-Node Bandwidth • Tune message size to optimize throughput • Aggregate messages when possible csss css0

MPI Performance is often Hierarchical message size and task placement are key Intra Inter

MPI: Latency not always 1 or 2 numbers The set of all possibly latencies describes the interconnect from the application perspective

Synchronization: measurement MPI_Barrier(MPI_COMM_WORLD); T1 = MPI_Wtime(); e.g. MPI_Allreduce(); T2 = MPI_Wtime()-T1; How synchronizing is MPI_Allreduce? For a code running on N tasks what is the distribution of the T2’s? Let’s measure this…

Synchronization: MPI Collectives Beyond load balance there is a distribution on MPI timings intrinsic to the MPI Call 2048 tasks

Synchronization: Architecture …and from the machine itself t is the frequency kernel process scheduling Unix : cron et al.

Intrinsic Synchronization : Alltoall

Intrinsic Synchronization: Alltoall Architecture makes a big difference!

This leads to variability in Execution Time

Synchronization : Summary • As a programmer you can control • Which MPI calls you use (it’s not required to use them all). • Message sizes, Problem size (maybe) • The temporal granularity of synchronization • Language Writers and System Architects control • How hard is it to do last two above • The intrinsic amount of noise in the machine

Simple Stuff Parallel programs are easier to mess up than serial ones. Here are some common pitfalls.

What’s wrong here?

MPI_Barrier • Is MPI_Barrier time bad? Probably. Is it avoidable? • ~three cases: • The stray / unknown / debug barrier • The barrier which is masking compute balance • Barriers used for I/O ordering Often very easy to fix

Parallel File I/O : Strategies MPI Disk Some strategies fall down at scale

Parallel File I/O: Metadata • A parallel file system is great, but it is also another place to create contention. • Avoid uneeded disk I/O, know your file system • Often avoid file per task I/O strategies when running at scale

Parallel Application Scaling, Performance, and Efficiency