Create Presentation
Download Presentation

Download Presentation
## Introduction to Parallel Computing, Fall 2009

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Introduction to Parallel Computing, Fall 2009**Sinan Kockara Department of Computer Science UCA**Invalidate Protocol Systems**• Snoopy cache systems • Directory based systems • Distributed directory systems • Permits O(p) simultaneous coherence operations where p is number of processors • Thus, more scalable than snoopy and director based systems**Cost of Communication**• Programming model semantics • Network topology • Data handling and routing • Message passing cost • Time to prepare a message for transmission • Routing cost • Time taken by the message to traverse the network to its final destination • Associated software protocols**Principal Parameters for communication latency**• Message Startup time (tstartup) • Occurs only once per message • Message Per-hop time (thop) • Node latency • Message Per-word transfer time (tdata) • If channel bandwidth is r words per second, then tdata = 1/r**Message routing schemes**• Store-and-forward routing • Each node stores entire message then passes to the next node • Packet routing • Message is broken into smaller parts • Provides better utilization of communication resources • Lower overhead from packet loss or errors • Packets may take different paths • Provides better error correction • Because of all these nice features packet routing is basis of internet • Overhead • Each packet must carry routing, error correction, and sequencing information • Cut-through routing • Resulting from optimizations of routing, error correction, and sequencing information • Message is partitioned into fixed size units called flits (flow control digits) • Problem: deadlock may occur**Granularity for Message passing**• Granularity: size of a process, # of processes • Coarse granularity • Each process contains large number of sequential instructions and takes substantial time to execute • Fine granularity • Each process consist of few instructions (sometimes one) to execute • Middle granularity • We want to increase granularity (means coarse granularity) to reduce the tstartup and interprocess communication costs • This will cause reduce the amount of parallelism • Thus, suitable compromise has to be made • Granularity is related to number of processors being used**Granularity metric**• Computation time/Communication time • tcomp / tcomm • It is very important to maximize the granularity metric while maintaining sufficient parallelism • In general, we would like to design a parallel program in which it is easy to vary the granularity i.e. a scalable program design**Speedup Factor**A measure of relative performance between a multiprocessor System and a single processor system is speedup factor tsingle Execution time using one processor (best sequential algorithm) • S(p) = tp Execution time using a multiprocessor with p processors • Use best sequential algorithm with single processor system. Underlying algorithm for parallel implementation might be (and is usually) different. • Speedup factor can also be cast in terms of computational steps: • Can also extend time complexity to parallel computations. Number of computational steps using one processor S(p) = Number of parallel computational steps with p processors**Maximum Achievable Speedup with multiprocessor system**• Maximum speedup is usually p with p processors (linear speedup). • E.g. one process mapped on one processor in multiprocessor system which is consisting of n processors What is S(n)? • Possible to get superlinear speedup (greater than p) but usually a specific reason such as: • Extra memory in multiprocessor system • Nondeterministic algorithm tsingle • S(p) = tp**Superlinear Speedup example - Searching**(a) Searching each sub-space sequentially Start Time t s t /p s Sub-space D t search x t /p s Solution found x indeterminate**(b) Searching each sub-space in parallel**D t Solution found**Speed-up then given by**t s ´ D x + t p S(p) = D t**Worst case for sequential search when solution found in last**sub-space search. Then parallel version offers greatest benefit, i.e. p – 1 ´ D t + t s p ® ¥ S(p) = D t D as t tends to zero**Least advantage for parallel version when solution found in**first sub-space search of the sequential search, i.e. Actual speed-up depends upon which subspace holds solution but could be extremely large. D t S(p) = = 1 D t+…**Overheads to Speedup**• There are several factors that limit the speedup • Periods when some of the processors are idle or not performing useful work • Extra computations that does not exist in sequential version e.g. to recompute constants locally • Communication time for sending messages**What is the maximum speedup for a parallel program**• Amdahl’s law (1960) • Constant problem size scaling • Independent from number of processors • Gustafson’s law (1988) • Time constrained scaling • Dependent to number of processors**Amdahl’s law**Fraction of the computation that cannot be divided into concurrent tasks is f t s ft (1 - f ) t s s Serial section Parallelizable sections (a) One processor (b) Multiple processors p processors (1 - f ) t / p s t p**Question**• According to previous slide what is the maximum speedup? • With Amdahl’s Speedup formulation, what happens if number of processors goes to infinity?**Amdahl’s law: Speedup formulation**Speedup factor is given by: This equation is known as Amdahl’s law**Gustafson’s Law**• Rather than assuming that the problem size is fixed, assuming that the parallel execution time is fixed • Also assumes that increasing the problem size does not increase the serial section of the code. • Gustafson’s speedup factor is called scaled speedup factor**Gustafson’s Scaled Speedup Factor**• Let s be the time for executing the serial part of the computation and p the time for executing parallel part of the computation on a single processor. Suppose we fix the total execution time on a single processor, s+p, as 1 so that s and p are now actual fractions of the total computation and s becomes the same as f previously as in Amdahl’s law (see slide #21 for reference). Then Amdahl’s law becomes:**Gustafson’s Scaled Speedup Factor 2**• The execution time on a single computer will be s+pn as the n parallel parts must be executed sequentially. Then:**Gustafson’s Speedup Factor Example**• Suppose we had a serial section of 5% and 20 processers. • What is the speedup according to Gustafson’s Law? • What is the speedup according to Amdahl’s Law?**Solution**• According to Gustafson: • According to Amdahl’s:**Speedup against number of processors**f = 0% 20 16 12 f = 5% 8 f = 10% f = 20% 4 4 8 12 16 20 Number of processors , p**Parallel program’s Efficiency**• Efficiency defined as Execution time using one processor • E = Execution time using multiprocessor x Number of Processors • Efficiency gives the fraction of the time that • processors are being used on the computation • So, what does mean 100% efficiency and • when it occurs?**Parallel Program’s Cost**• Cost= (execution time) x (total # of processors) • Cost = tp x p • Parallel execution time is given by tp=ts/S(p) • From Efficiency equation above, • Cost = ts / E