Introduction to Parallel Computing, Fall 2009

Introduction to Parallel Computing, Fall 2009 Sinan Kockara Department of Computer Science UCA

Invalidate Protocol Systems • Snoopy cache systems • Directory based systems • Distributed directory systems • Permits O(p) simultaneous coherence operations where p is number of processors • Thus, more scalable than snoopy and director based systems

Cost of Communication • Programming model semantics • Network topology • Data handling and routing • Message passing cost • Time to prepare a message for transmission • Routing cost • Time taken by the message to traverse the network to its final destination • Associated software protocols

Principal Parameters for communication latency • Message Startup time (tstartup) • Occurs only once per message • Message Per-hop time (thop) • Node latency • Message Per-word transfer time (tdata) • If channel bandwidth is r words per second, then tdata = 1/r

Message routing schemes • Store-and-forward routing • Each node stores entire message then passes to the next node • Packet routing • Message is broken into smaller parts • Provides better utilization of communication resources • Lower overhead from packet loss or errors • Packets may take different paths • Provides better error correction • Because of all these nice features packet routing is basis of internet • Overhead • Each packet must carry routing, error correction, and sequencing information • Cut-through routing • Resulting from optimizations of routing, error correction, and sequencing information • Message is partitioned into fixed size units called flits (flow control digits) • Problem: deadlock may occur

Packet Routing

Cut-through routing deadlock example

Granularity for Message passing • Granularity: size of a process, # of processes • Coarse granularity • Each process contains large number of sequential instructions and takes substantial time to execute • Fine granularity • Each process consist of few instructions (sometimes one) to execute • Middle granularity • We want to increase granularity (means coarse granularity) to reduce the tstartup and interprocess communication costs • This will cause reduce the amount of parallelism • Thus, suitable compromise has to be made • Granularity is related to number of processors being used

Granularity metric • Computation time/Communication time • tcomp / tcomm • It is very important to maximize the granularity metric while maintaining sufficient parallelism • In general, we would like to design a parallel program in which it is easy to vary the granularity i.e. a scalable program design

Speedup Factor A measure of relative performance between a multiprocessor System and a single processor system is speedup factor tsingle Execution time using one processor (best sequential algorithm) • S(p) = tp Execution time using a multiprocessor with p processors • Use best sequential algorithm with single processor system. Underlying algorithm for parallel implementation might be (and is usually) different. • Speedup factor can also be cast in terms of computational steps: • Can also extend time complexity to parallel computations. Number of computational steps using one processor S(p) = Number of parallel computational steps with p processors

Maximum Achievable Speedup with multiprocessor system • Maximum speedup is usually p with p processors (linear speedup). • E.g. one process mapped on one processor in multiprocessor system which is consisting of n processors What is S(n)? • Possible to get superlinear speedup (greater than p) but usually a specific reason such as: • Extra memory in multiprocessor system • Nondeterministic algorithm tsingle • S(p) = tp

Superlinear Speedup example - Searching (a) Searching each sub-space sequentially Start Time t s t /p s Sub-space D t search x t /p s Solution found x indeterminate

(b) Searching each sub-space in parallel D t Solution found

Speed-up then given by t s ´ D x + t p S(p) = D t

Worst case for sequential search when solution found in last sub-space search. Then parallel version offers greatest benefit, i.e. p – 1 ´ D t + t s p ® ¥ S(p) = D t D as t tends to zero

Least advantage for parallel version when solution found in first sub-space search of the sequential search, i.e. Actual speed-up depends upon which subspace holds solution but could be extremely large. D t S(p) = = 1 D t+…

Overheads to Speedup • There are several factors that limit the speedup • Periods when some of the processors are idle or not performing useful work • Extra computations that does not exist in sequential version e.g. to recompute constants locally • Communication time for sending messages

What is the maximum speedup for a parallel program • Amdahl’s law (1960) • Constant problem size scaling • Independent from number of processors • Gustafson’s law (1988) • Time constrained scaling • Dependent to number of processors

Amdahl’s law Fraction of the computation that cannot be divided into concurrent tasks is f t s ft (1 - f ) t s s Serial section Parallelizable sections (a) One processor (b) Multiple processors p processors (1 - f ) t / p s t p

Question • According to previous slide what is the maximum speedup? • With Amdahl’s Speedup formulation, what happens if number of processors goes to infinity?

Amdahl’s law: Speedup formulation Speedup factor is given by: This equation is known as Amdahl’s law

Gustafson’s Law • Rather than assuming that the problem size is fixed, assuming that the parallel execution time is fixed • Also assumes that increasing the problem size does not increase the serial section of the code. • Gustafson’s speedup factor is called scaled speedup factor

Gustafson’s Scaled Speedup Factor • Let s be the time for executing the serial part of the computation and p the time for executing parallel part of the computation on a single processor. Suppose we fix the total execution time on a single processor, s+p, as 1 so that s and p are now actual fractions of the total computation and s becomes the same as f previously as in Amdahl’s law (see slide #21 for reference). Then Amdahl’s law becomes:

Gustafson’s Scaled Speedup Factor 2 • The execution time on a single computer will be s+pn as the n parallel parts must be executed sequentially. Then:

Gustafson’s Speedup Factor Example • Suppose we had a serial section of 5% and 20 processers. • What is the speedup according to Gustafson’s Law? • What is the speedup according to Amdahl’s Law?

Solution • According to Gustafson: • According to Amdahl’s:

Speedup against number of processors f = 0% 20 16 12 f = 5% 8 f = 10% f = 20% 4 4 8 12 16 20 Number of processors , p

Parallel program’s Efficiency • Efficiency defined as Execution time using one processor • E = Execution time using multiprocessor x Number of Processors • Efficiency gives the fraction of the time that • processors are being used on the computation • So, what does mean 100% efficiency and • when it occurs?

Parallel Program’s Cost • Cost= (execution time) x (total # of processors) • Cost = tp x p • Parallel execution time is given by tp=ts/S(p) • From Efficiency equation above, • Cost = ts / E

Introduction to Parallel Computing, Fall 2009