Programming for Performance CS433 Spring 2001

Programming for Performance CS433Spring 2001 Laxmikant Kale

Causes of performance loss • If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance? • Several causes, • Each must be understood separately • but they interact with each other in complex ways • Solution to one problem may create another • One problem may mask another, which manifests itself under other conditions (e.g. increased p).

Causes • Sequential: cache performance • Communication overhead • Algorithmic overhead (“extra work”) • Speculative work • Load imbalance • (Long) Critical paths • Bottlenecks

Algorithmic overhead • Parallel algorithms may have a higher operation count • Example: parallel prefix (also called “scan”) • How to parallelize this? B[0] = A[0]; for (I=1; I<N; I++) B[I] = B[I-1]+A[I];

Parallel Prefix: continued • How to this operation in parallel? • Seems inherently sequential • Recursive doubling algorithm • Operation count: log(P) . N • A better algorithm: • Take blocking of data into account • Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements • N + log(P) +N: doubling of op. Count

Bottleneck • Consider the “primes” program (or the “pi”) • What happens when we run it on 1000 pes? • How to eliminate bottlenecks: • Two structures are useful in most such cases: • Spanning trees: organize processors in a tree • Hypercube-based dimensional exchange

Communication overhead • Components: • per message and per byte • sending, receiving and network • capacity constraints • Grainsize analysis: • How much computation per message • Computation-to-communication ratio

Communication overhead examples • Usually, must reorganize data or work to reduce communication • Combining communication also helps • Examples:

Communication overhead Communication delay: time interval between sending on one processor to receipt on another: time = a + b. N Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN Typical values: a = 10 - 100 microseconds, b: 2-10 ns

Grainsize control • A Simple definition of grainsize: • Amount of computation per message • Problem: short message/ long message • More realistic: • Computation to communication ratio • computation time / (a + bN) for one message

Example: matrix multiplication • How to parallelize this? For (I=0; I<N; I++) For (J=0; j<N; J++) // c[I][j] ==0 For(k=0; k<N; k++) C[I][J] += A[I][K] * B[K][J];

A simple algorithm: • Distribute A by rows, B by columns • So,any processor can request a row of A and get it (in two messages). Same for a col of B, • Distribute the work of computing each element of C using some load balancing scheme • So it works even on machines with varying processor capabilities (e.g. timeshared clusters) • What is the computation-to-communication ratio? • For each object: 2.N ops, 2 messages with N bytes • 2N / (2 a + 2N b) = 2N * 0.01 / (2*10 + 2*0.002N)

A better algorithm: • Store A as a collection row-bunches • each bunch stores g rows • Same of B’s columns • Each object now computes a gxg section of C • Comp to commn ratio: • 2*g*g*N ops • 2 messages, gN bytes each • alpha ratio: 2g*g*N/2, beta ratio: 2g

Alpha vs beta • The per message cost is significantly larger than per byte cost • factor of several thousands • So, several optimizations are possible that trade off : • get larger beta cost in return for smaller alpha • I.e. send fewer messages • Applications of this idea: • Examined in the last two lectures

Programming for performance:steps • Select/design Parallel algorithm • Decide on Decomposition • Select Load balancing strategy • Plan Communication structure • Examine synchronization needs • global synchronizations, critical paths

Design Philosophy: • Parallel Algorithm design: • Ensure good performance (total op count) • Generate sufficient parallelism • Avoid/minimize “extra work” • Decomposition: • Break into many small pieces: • Smallest grain that sufficiently amortizes overhead

Design principles: contd. • Load balancing • Select static, dynamic, or quasi-dynamic strategy • Measurement based vs prediction based load estimation • Principle: let a processor idle but avoid overloading one • (think about this) • Reduce communication overhead • Algorithmic reorganization (change mapping) • Message combining • Use efficient communication libraries

Design principles: Synchronization • Eliminate unnecessary global synchronization • If T(i,j) is the time during i’th phase on j’th PE • With synch: sum ( max {T(i,j)}) • Without: max { sum(T (i,j) } • Critical Paths: • Look for long chains of dependences • Draw timeline pictures with dependences

Diagnosing performance problems • Tools: • Back of the envelope (I.e. simple) analysis • Post-mortem analysis, with performance logs • Visualization of performance data • Automatic analysis • Phase-by-phase analysis (prog. may have many phases) • What to measure • load distribution, (commun.) overhead, idle time • Their averages, max/min, and variances • Profiling: time spent in individual modules/subroutines

Diagnostic technniques • Tell-tale signs: • max load >> average, and # PEs > average is >>1 • max load >> average, and # PEs > average is ~ 1 • Profile shows increase in total time in routine f with increase in PEs: • Communication overhead: obvious Load imbalance Possible bottleneck (if there is dependence) • Algorithmic overhead

Communication Optimization • Example problem from earlier lecture: Molecular Dynamics • Each Processor, assumed to house just one cell, needs to send 26 short messages to “neighboring” processors • Assume Send/Receive each: alpha = 10 us, beta: 2ns • Time spent (notice: 26 sends and 26 receives): • 26*2(10 ) = 520 us • If there are more than one cells on each PE, multiply this number! • Can this be improved? How?

Message combining • If there are multiple cells per processor: • Neighbors of a cell may be on the same neighboring processor. • Neighbors of two different cells on the same processor • Combine messages going to the same processor

Communication Optimization I • Take advantage of the structure of communication, and do communication in stages: • If my coordinates are: (x,y,z): • Send to (x+1, y,z), anything that goes to (x+1, *, *) • Send to (x-1, y,z), anything that goes to (x-1, *, *) • Wait for messages from x neighbors, then • Send to y neighbors a combined message • A total of 6 messages instead of 26 • Apparently longer critical path

Communication Optimization II • Send all migrating atoms to processor 0 • Let processor 0 sort them out and send 1 message to each processor • Works ok if the number of processors is small • Otherwise, bottleneck at 0

Communication Optimization 3 • Generalized problem: • Each to all, individualized messages • Apply all previously learned techniques

Intro to Load Balancing • Example: 500 processors, 50000 units of work • What should the objective of load balancing be?