1 / 45

Parallel Programming

Parallel Programming. Sathish S. Vadhiyar. Motivations of Parallel Computing. Parallel Machine: a computer system with more than one processor Motivations Faster Execution time due to non-dependencies between regions of code Presents a level of modularity

sawyer
Télécharger la présentation

Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Programming Sathish S. Vadhiyar

  2. Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations • Faster Execution time due to non-dependencies between regions of code • Presents a level of modularity • Resource constraints. Large databases. • Certain class of algorithms lend themselves • Aggregate bandwidth to memory/disk. Increase in data throughput. • Clock rate improvement in the past decade – 40% • Memory access time improvement in the past decade – 10%

  3. Parallel Programming and Challenges • Recall the advantages and motivation of parallelism • But parallel programs incur overheads not seen in sequential programs • Communication delay • Idling • Synchronization

  4. Challenges P0 P1 Idle time Computation Communication Synchronization

  5. How do we evaluate a parallel program? • Execution time, Tp • Speedup, S • S(p, n) = T(1, n) / T(p, n) • Usually, S(p, n) < p • Sometimes S(p, n) > p (superlinear speedup) • Efficiency, E • E(p, n) = S(p, n)/p • Usually, E(p, n) < 1 • Sometimes, greater than 1 • Scalability – Limitations in parallel computing, relation to n and p.

  6. Speedups and efficiency S E p p Ideal Practical

  7. Limitations on speedup – Amdahl’s law • Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. • Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement. • Places a limit on the speedup due to parallelism. • Speedup = 1 (fs + (fp/P))

  8. Amdahl’s law Illustration S = 1 / (s + (1-s)/p) Courtesy: http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm

  9. Amdahl’s law analysis • For the same fraction, speedup numbers keep moving away from processor size. • Thus Amdahl’s law is a bit depressing for parallel programming. • In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

  10. Gustafson’s Law • Amdahl’s law – keep the parallel work fixed • Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time • For a particular number of processors, find the problem size for which parallel time is equal to the constant time • For that problem size, find the sequential time and the corresponding speedup • Thus speedup is scaled or scaled speedup. Also called weak speedup

  11. Metrics (Contd..) Table 5.1: Efficiency as a function of n and p.

  12. Scalability • Efficiency decreases with increasing P; increases with increasing N • How effectively the parallel algorithm can use an increasing number of processors • How the amount of computation performed must scale with P to keep E constant • This function of computation in terms of P is called isoefficiency function. • An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

  13. Parallel Program Models • Single Program Multiple Data (SPMD) • Multiple Program Multiple Data (MPMD) Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  14. Programming Paradigms • Shared memory model – Threads, OpenMP, CUDA • Message passing model – MPI

  15. Parallelization

  16. Parallelizing a Program Given a sequential program/algorithm, how to go about producing a parallel version Four steps in program parallelization • Decomposition Identifying parallel tasks with large extent of possible concurrent activity; splitting the problem into tasks • Assignment Grouping the tasks into processes with best load balancing • Orchestration Reducing synchronization and communication costs • Mapping Mapping of processes to processors (if possible)

  17. Steps in Creating a Parallel Program

  18. Orchestration • Goals • Structuring communication • Synchronization • Challenges • Organizing data structures – packing • Small or large messages? • How to organize communication and synchronization ?

  19. Orchestration • Maximizing data locality • Minimizing volume of data exchange • Not communicating intermediate results – e.g. dot product • Minimizing frequency of interactions - packing • Minimizing contention and hot spots • Do not use the same communication pattern with the other processes in all the processes • Overlapping computations with interactions • Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2) • Initiate communication for type 1; During communication, perform type 2 • Replicating data or computations • Balancing the extra computation or storage cost with the gain due to less communication

  20. Mapping • Which process runs on which particular processor? • Can depend on network topology, communication pattern of processors • On processor speeds in case of heterogeneous systems

  21. Mapping • Static mapping • Mapping based on Data partitioning • Applicable to dense matrix computations • Block distribution • Block-cyclic distribution • Graph partitioning based mapping • Applicable for sparse matrix computations • Mapping based on task partitioning 0 0 0 1 1 1 2 2 2 0 1 2 0 1 2 0 1 2

  22. Based on Task Partitioning • Based on task dependency graph • In general the problem is NP complete 0 0 4 0 2 4 6 0 1 2 3 4 5 6 7

  23. Mapping • Dynamic Mapping • A process/global memory can hold a set of tasks • Distribute some tasks to all processes • Once a process completes its tasks, it asks the coordinator process for more tasks • Referred to as self-scheduling, work-stealing

  24. High-level Goals

  25. Parallel Architecture

  26. Classification of Architectures – Flynn’s classification In terms of parallelism in instruction and data stream • Single Instruction Single Data (SISD): Serial Computers • Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  27. Classification of Architectures – Flynn’s classification • Multiple Instruction Single Data (MISD): Not popular • Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  28. Classification of Architectures – Based on Memory • Shared memory • 2 types – UMA and NUMA NUMA Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q UMA Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

  29. Classification 2:Shared Memory vs Message Passing • Shared memory machine: The n processors share physical address space • Communication can be done through this shared memory • The alternative is sometimes referred to as a message passing machine or a distributed memory machine P P P P P P P M M M M M M M P P P P P Interconnect P P Interconnect Main Memory

  30. Shared Memory Machines The shared memory could itself be distributed among the processor nodes • Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time • Terms: NUMA vs UMA architecture • Non-Uniform Memory Access • Uniform Memory Access

  31. Classification of Architectures – Based on Memory • Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ • Recently multi-cores • Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

  32. Parallel Architecture: Interconnection Networks • An interconnection network defined by switches, links and interfaces • Switches – provide mapping between input and output ports, buffering, routing etc. • Interfaces – connects nodes with network

  33. Parallel Architecture: Interconnections • Indirect interconnects: nodes are connected to interconnection medium, not directly to each other • Shared bus, multiple bus, crossbar, MIN • Direct interconnects: nodes are connected directly to each other • Topology: linear, ring, star, mesh, torus, hypercube • Routing techniques: how the route taken by the message from source to destination is decided • Network topologies • Static – point-to-point communication links among processing nodes • Dynamic – Communication links are formed dynamically by switches

  34. Interconnection Networks • Static • Bus – SGI challenge • Completely connected • Star • Linear array, Ring (1-D torus) • Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus • k-d mesh: d dimensions with k nodes in each dimension • Hypercubes – 2-logp mesh – e.g. many MIMD machines • Trees – our campus network • Dynamic – Communication links are formed dynamically by switches • Crossbar – Cray X series – non-blocking network • Multistage – SP2 – blocking network. • For more details, and evaluation of topologies, refer to book by Grama et al.

  35. Indirect Interconnects Shared bus Multiple bus 2x2 crossbar Crossbar switch Multistage Interconnection Network

  36. Direct Interconnect Topologies Star Ring Linear 2D Mesh (binary n-cube) Hypercube n=3 n=2 Torus

  37. Evaluating Interconnection topologies • Diameter – maximum distance between any two processing nodes • Full-connected – • Star – • Ring – • Hypercube - • Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks • Linear-array – • Ring – • 2-d mesh – • 2-d mesh with wraparound – • D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d

  38. Evaluating Interconnection topologies • bisection width – minimum number of links to be removed from network to partition it into 2 equal halves • Ring – • P-node 2-D mesh - • Tree – • Star – • Completely connected – • Hypercubes - 2 Root(P) 1 1 P2/4 P/2

  39. Evaluating Interconnection topologies • channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes • channel rate – performance of a single physical wire • channel bandwidth – channel rate times channel width • bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

  40. Shared Memory Architecture: Caches P2 P1 Read X Read X Read X Write X=1 Cache hit: Wrong data!! X: 1 X: 0 X: 0 X: 0 X: 1

  41. Cache Coherence Problem • If each processor in a shared memory multiple processor machine has a data cache • Potential data consistency problem: the cache coherence problem • Shared variable modification, private cache • Objective: processes shouldn’t read `stale’ data • Solutions • Hardware: cache coherence mechanisms

  42. Cache Coherence Protocols • Write update – propagate cache line to other processors on every write to a processor • Write invalidate – each processor gets the updated cache line whenever it reads stale data • Which is better?

  43. Invalidation Based Cache Coherence P2 P1 Read X Read X Read X Write X=1 X: 1 X: 1 X: 0 X: 0 Invalidate X: 1 X: 0

  44. Cache Coherence using invalidate protocols • 3 states associated with data items • Shared – a variable shared by 2 caches • Invalid – another processor (say P0) has updated the data item • Dirty – state of the data item in P0 • Implementations • Snoopy • for bus based architectures • shared bus interconnect where all cache controllers monitor all bus activity • There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches • Memory operations are propagated over the bus and snooped • Directory-based • Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors • A central directory maintains states of cache blocks, associated processors • Implemented with presence bits

  45. END

More Related