Analytical Modeling of Parallel Programs

Analytical Modeling of Parallel Programs • Sources of Overhead in Parallel Programs • Performance Metrics for Parallel Systems • Formulating Maximum Speedup: Amdahl’s Law • Scalability of Parallel Systems • Review of Amdahl’s Law: Gustafson-Barsis’ Law

Analytical Modeling - Basics • A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size). • The asymptotic runtime of a sequential program is identical on any serial platform. • On the other hand, the parallel runtime of a program depends on • the input size, • the number of processors, and • the communication parameters of the machine. • An algorithm must therefore be analyzed in the context of the underlying platform. • A parallel system is a combination of a parallel algorithm and an underlying platform.

Sources of Overhead in Parallel Programs • If I use n processors to run my program, would it run n times faster? • Overheads! • Interprocessor Communication & Interactions • Usually the most significant source of overhead • Idling • Load imbalance, Synchronization, Serial components • Excess Computation • Sub-optimal serial algorithm • More aggregate computations • Goal is to minimize these overheads!

Performance Metrics for Parallel Programs • Why analyze the performance of parallel programs? • Determine the best algorithm • Examine the benefit of parallelism • A number of metrics have been used based on the desired outcome of performance analysis: • Execution time • Total parallel overhead • Speedup • Efficiency • Cost

Performance Metrics for Parallel Programs • Parallel Execution Time • Time spent to solve a problem on p processors. • Tp • Total Overhead Function • To = pTp – Ts • Speedup • S = Ts/Tp • Can we have superlinear speedup? • exploratory computations, hardware features • Efficiency • E = S/p • Cost • pTp(processor-time product)

Performance Metrics: Working Example

Performance Metrics: Example on Speedup • What is the benefit from parallelism? • Consider the problem of adding n numbers by using n processing elements. • If n is a power of two, we can perform this operation in log n steps by propagating partial sums up a logical binary tree of processors. • If an addition takes constant time, say, tcand communication of a single word takes time ts + tw, the parallel time is TP = Θ (log n) • We know that TS = Θ (n) • Speedup S is given by S = Θ (n / log n)

Performance Metrics: Speedup Bounds • For computing speedup, the best sequential program is taken as the baseline. • There may be different sequential algorithms with different asymptotic runtimes for a given problem • Speedup can be as low as 0 (the parallel program never terminates). • Speedup, in theory, should be upper bounded by p. • In practice, a speedup greater than p is possible. • This is known as superlinear speedup • Superlinear speedup can result • when a serial algorithm does more computations than its parallel formulation • due to hardware features that put the serial implementation at a disadvantage • Note that superlinear speed happens only if each processing element spends less than time TS/p solving the problem.

Performance Metrics: Superlinear Speedups • Superlinearity effect due to exploratory decomposition

Cost of a Parallel System • As shown earlier, Cost is the product of parallel runtime and the number of processing elements used (pTP). • Cost reflects the sum of the time that each processing element spends solving the problem. • A parallel system is said to be cost-optimal if the cost of solving a problem on a parallel computer is asymptotically identical to serial cost. • Since E = TS / p TP, for cost optimal systems, E = O(1). • Cost is sometimes referred to as work or processor-time product. • The problem of adding n numbers on n processors is not cost-optimal.

Formulating Maximum Speedup • Assume an algorithm has some sequential parts that are only executed on one processor. • Assume the fraction of the computation that cannot be divided into concurrent tasks is f. • Assume no overhead incurs when the computation is divided into concurrent parts. • The time to perform the computation with pprocessors is: • Hence, the speedup factor is (Amdahl’s Law):

Visualizing Amdahl’s law

Speedup Against Number of Processors

Speedup against number of processors • From the preceding formulation, f has to be a small fraction of the overall computation if significant increase in speedup is to occur • Even with infinite number of processors, maximum speedup limited to 1/f: • Example: With only 5% of computation being serial, maximum speedup is 20, irrespective of number of processors. • Amdahl used this argument to promote single processor machines

Scalability • Speedup and efficiency are relative terms. They depend on • Number of processors • Problem size • The algorithm used • For example, efficiency of a parallel program often decreases as the number of processors increases • Similarly, a parallel program may be quite efficient for solving large problems, but not for solving small problems • A parallel program is said to scale if its efficiency is constant for a broad range of number of processors and problem sizes • Finally, speedup and efficiency depend on the algorithm used. • A parallel program might be efficient relative to one sequential algorithm but not relative to a different sequential algorithm

Gustafson’s Law • Presented an argument based upon scalability concepts. • To show that Amdahl’s law was not as significant as first supposed in limiting the potential speedup. • Observation: In practice a larger multiprocessor usually allows a larger size of the problem to be undertaken in a reasonable execution time. • Hence, the problem size is not independent of the number of processors. • Rather than assume the problem size is fixed, we should assume that the parallel execution time is fixed. • Using the parallel constant execution time constraint, the resulting speedup factor will be numerically different from Amdahl’s speedup factor and is called a scaled speedup factor

Speedup vs Number of Processors

Formulating Gustafson’s Law • Assuming the parallel execution time, Tp, is normalized to unity: • Assuming that in the serial execution time, Ts, below, fTs is a constant, • Then the scaled speedup factor (Gustafson’s Law) is:

Analytical Modeling of Parallel Programs

Analytical Modeling of Parallel Programs

Presentation Transcript

Towards Automated Tuning of Parallel Programs

Analytical Modeling of Kinematic Linkages

Performance of Parallel Programs

Analysis of Fork-Join Parallel Programs

Correctness of parallel programs

Parallel Cartographic Modeling

Analytical Modeling of Parallel Systems

Analytical Zero Inertia Modeling

Analytical Modeling of Parallel Systems

Parallel Programs

Designing Parallel Programs

Source Level Debugging of Parallel Programs

Multilevel Modeling Programs

Code Optimization of Parallel Programs

Evaluating Parallel Programs

Analytical Approach to Parallel Repetition

Designing Parallel Programs

Lecture 3 : Performance of Parallel Programs

Analytical Modeling of Parallel Systems

Analytical Modeling of Parallel Systems

Towards Automated Tuning of Parallel Programs

Lecture 3 : Performance of Parallel Programs