Parallel Application Scaling, Performance, and Efficiency

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL

Parallel Scaling of MPI Codes A practical talk on using MPI with focus on: • Distribution of work within a parallel program • Placement of computation within a parallel computer • Performance costs of different types of communication • Understanding scaling performance terminology

Topics • Application Scaling • Load Balance • Synchronization • Simple stuff • File I/O

Scale: Practical Importance Time required to compute the NxN matrix product C=A*B • Assuming you can address 64GB from one task, can you wait a month? • How to balance • computational goal • vs. • compute resources? • Choose the right scale!

31 31 31 32 Let’s jump to an example • Sharks and Fish II : N2 parallel force evalulation • e.g. 4 CPUs evaluate force for 125 fish • Domain decomposition: Each CPU is “in charge” of ~31 fish, but keeps a fairly recent copy of all the fishes positions (replicated data) • Is it not possible to uniformly decompose problems in general, especially in many dimensions • This toy problem is simple, has fine granularity and is 2D • Let’s see how it scales

Sharks and Fish II : Program Data: n_fish  global my_fish  local fishi = {x, y, …} Dynamics: F = ma … V = Σ 1/rij dq/dt = m * p dp/dt = -dV/dq MPI_Allgatherv(myfish_buf, len[rank], MPI_FishType…) for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j ai += g * massj * ( fishi – fishj ) / rij } } Move fish

Sharks and Fish II: How fast? • 100 fish can move 1000 steps in 1 task  5.459s 32 tasks  2.756s • 1000 fish can move 1000 steps in 1 task  511.14s 32 tasks  20.815s • So what’s the “best” way to run? • How many fish do we really have? • How large a computer (time) do we have? • How quickly do we need the answer? x 1.98 speedup x 24.6 speedup

Scaling: Good 1st Step: Do runtimes make sense? Running fish_sim for 100-1000 fish on 1-32 CPUs we see 1 Task … 32 Tasks time ~ fish2 

Scaling: Walltimes Each line is contour describing computations doable in a given time walltime is (all)important but let’s look at some other scaling metrics

Scaling: terminology • Scaling studies involve changing the degree of parallelism. Will we be changing the problem also? • Strong scaling Fixed problem size • Weak scaling Problem size grows with additional compute resources • How do we measure success in parallel scaling? • Speed up = Ts/Tp(n) • Efficiency = Ts/(n*Tp(n)) Multiple definitions exist!

Scaling: Speedups

Scaling: Efficiencies Remarkably smooth! Often algorithm and architecture make efficiency landscape quite complex

Scaling: Analysis Why does efficiency drop? • Serial code sections  Amdahl’s law • Surface to Volume  Communication bound • Algorithm complexity or switching • Communication protocol switching • Scalability of computer and interconnect  Whoa!

Scaling: Analysis • In general, changing problem size and concurrency expose or remove compute resources. Bottlenecks shift. • In general, first bottleneck wins. • Scaling brings additional resources too. • More CPUs (of course) • More cache(s) • More memory BW in some cases

Scaling: Superlinear Speedup # CPUs (OMP)

Strong Scaling: Communication Bound 64 tasks , 52% comm 192 tasks , 66% comm 768 tasks , 79% comm • MPI_Allreduce buffer size is 32 bytes. • Q: What resource is being depleted here? • A: Small message latency • Compute per task is decreasing • Synchronization rate is increasing • Surface:Volume ratio is increasing

Sharks and Atoms: At HPC centers like NERSC fish are rarely modeled as point masses. The associated algorithms and their scalings are none the less of great practical importance for scientific problems. Particle Mesh Ewald MatSci Computation 600s@256 way or 80s@1024 way

Topics • Load Balance • Synchronization • Simple stuff • File I/O Now instead of looking at scaling of specific applications lets look at general issues in parallel application scalability

Load Balance : Application Cartoon Unbalanced: Universal App Balanced: Time saved by load balance Will define synchronization later

Load Balance : performance data Communication Time: 64 tasks show 200s, 960 tasks show 230s MPI ranks sorted by total communication time

960 x 64 x Load Balance: ~code while(1) { do_flops(Ni); MPI_Alltoall(); MPI_Allreduce(); }

Flops Exchange Sync Load Balance: real code MPI Rank  Time 

Load Balance : analysis • The 64 slow tasks (with more compute work) cause 30 seconds more “communication” in 960 tasks • This leads to 28800 CPU*seconds (8 CPU*hours) of unproductive computing • All load imbalance requires is one slow task and a synchronizing collective! • Pair well problem size and concurrency. • Parallel computers allow you to waste time faster!

Load Balance : FFT Q: When is imbalance good? A: When is leads to a faster Algorithm.

Load Balance: Summary • Imbalance is most often a byproduct of data decomposition • Must be addressed before further MPI tuning can happen • Good software exists for graph partitioning / remeshing • For regular grids consider padding or contracting

Topics • Load Balance • Synchronization • Simple stuff • File I/O

Scaling of MPI_Barrier() four orders of magnitude

Synchronization: terminology MPI_Barrier(MPI_COMM_WORLD); T1 = MPI_Wtime(); e.g. MPI_Allreduce(); T2 = MPI_Wtime()-T1; How synchronizing is MPI_Allreduce? • For a code running on N tasks what is the distribution of the T2’s? • The average and width of this distribution tell us how synchronizing e.g. MPI_Allreduce is relative to some given interconnect. (HW & SW)

Synchronization : MPI Functions Completion semantics of MPI functions • Local : leave based on local logic • MPI_Comm_rank, MPI_Get_count • Probably Local : try to leave w/o messaging other tasks • MPI_Isend/Irecv • Partially synchronizing : leave after messaging M<N tasks • MPI_Bcast, MPI_Reduce • Fully synchronizing : leave after every else enters • MPI_Barrier, MPI_Allreduce

seaborg.nersc.gov • It’s hard to discuss synchronization outside of the context a particular parallel computer • MPI timings depend on HW, SW, and environment • How much of MPI is handled by the switch adapter? • How big are messaging buffers? • How many thread locks per function? • How noisy is the machine (today)? • This is hard to model, so take an empirical approach based on an IBM SP which is largely applicable to other clusters…

16 way SMP NHII Node G P F S Main Memory GPFS seaborg.nersc.gov basics IBM SP 380 x Colony Switch CSS0 CSS1 • 6080 dedicated CPUs, 96 shared login CPUs • Hierarchy of caching, speeds not balanced • Bottleneck determined by first depleted resource HPSS

16 way SMP NHII Node G P F S Main Memory GPFS MPI on the IBM SP • 2-4096 way concurrency • MPI-1 and ~MPI-2 • GPFS aware MPI-IO • Thread safety • Ranks on same node • bypass the switch Colony Switch CSS0 CSS1 HPSS

16 way SMP NHII Node Main Memory GPFS MPI: seaborg.nersc.gov css1 css0 • Lower latency  can satisfy more syncs/sec • What is the benefit of two adapters? • This is for a single pair of tasks csss

16 way SMP NHII Node 16 way SMP NHII Node Main Memory Main Memory GPFS GPFS Seaborg : point to point messaging Switch BW and latency are often stated in optimistic terms. The number and size of concurrent messages changes things. A fat tree / crossbar switch helps hide this. Intranode Internode

Inter-Node Bandwidth • Tune message size to optimize throughput • Aggregate messages when possible csss css0

MPI Performance is often Hierarchical message size and task placement are key to performance Intra Inter

MPI: Latency not always 1 or 2 numbers The set of all possibly latencies describes the interconnect geometry from the application perspective

Synchronization: measurement MPI_Barrier(MPI_COMM_WORLD); T1 = MPI_Wtime(); e.g. MPI_Allreduce(); T2 = MPI_Wtime()-T1; How synchronizing is MPI_Allreduce? For a code running on N tasks what is the distribution of the T2’s? One can derive the level of synchronization from MPI algorithms. Instead let’s just measure …

Synchronization: MPI Collectives Beyond load balance there is a distribution on MPI timings intrinsic to the MPI Call 2048 tasks

Synchronization: Architecture …and from the machine itself t is the frequency kernel process scheduling Unix : cron et al.

Intrinsic Synchronization : Alltoall

Intrinsic Synchronization: Alltoall Architecture makes a big difference!

This leads to variability in Execution Time

Synchronization : Summary • As a programmer you can control • Which MPI calls you use (it’s not required to use them all). • Message sizes, Problem size (maybe) • The temporal granularity of synchronization, i.e., where do synchronization occur. • Language writers and system architects control • How hard is it to do the above • The intrinsic amount of noise in the machine

Simple Stuff Parallel programs are easier to mess up than serial ones. Here are some common pitfalls.

What’s wrong here?

MPI_Barrier • Is MPI_Barrier time bad? Probably. Is it avoidable? • ~three cases: • The stray / unknown / debug barrier • The barrier which is masking compute balance • Barriers used for I/O ordering Often very easy to fix

Parallel File I/O : Strategies MPI Disk Some strategies fall down at scale

Parallel Application Scaling, Performance, and Efficiency

Parallel Application Scaling, Performance, and Efficiency

Presentation Transcript

Parallel and High Performance Computing

Parallel Performance

Data Parallel Application Development and Performance with Windows Azure

Predicting Parallel Performance

Parallel Multidimensional Scaling Performance on Multicore Systems

Parallel Performance

Improving Parallel Performance

Application Scaling

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000

Parallel Application Scaling, Performance, and Efficiency

Performance: Parallel Each

CMAQ Parallel Performance

Parallel Performance Diagnosis

Parallel Scaling of parsparsecircuit3.c

Scaling and Performance

Parallel Performance

Performance And Scaling Experts Atlas Authority

Principles and Practice of Application Performance Measurement and Analysis on Parallel Systems

Efficiency Performance Contracting

Parallel Performance Diagnosis

Performance budgeting and efficiency

Scaling Parallel Applications