Parallel Programming

Parallel Programming Sathish S. Vadhiyar

Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations • Faster Execution time due to non-dependencies between regions of code • Presents a level of modularity • Resource constraints. Large databases. • Certain class of algorithms lend themselves • Aggregate bandwidth to memory/disk. Increase in data throughput. • Clock rate improvement in the past decade – 40% • Memory access time improvement in the past decade – 10%

Parallel Programming and Challenges • Recall the advantages and motivation of parallelism • But parallel programs incur overheads not seen in sequential programs • Communication delay • Idling • Synchronization

Challenges P0 P1 Idle time Computation Communication Synchronization

How do we evaluate a parallel program? • Execution time, Tp • Speedup, S • S(p, n) = T(1, n) / T(p, n) • Usually, S(p, n) < p • Sometimes S(p, n) > p (superlinear speedup) • Efficiency, E • E(p, n) = S(p, n)/p • Usually, E(p, n) < 1 • Sometimes, greater than 1 • Scalability – Limitations in parallel computing, relation to n and p.

Speedups and efficiency S E p p Ideal Practical

Limitations on speedup – Amdahl’s law • Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. • Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement. • Places a limit on the speedup due to parallelism. • Speedup = 1 (fs + (fp/P))

Amdahl’s law Illustration S = 1 / (s + (1-s)/p) Courtesy: http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm

Amdahl’s law analysis • For the same fraction, speedup numbers keep moving away from processor size. • Thus Amdahl’s law is a bit depressing for parallel programming. • In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

Gustafson’s Law • Amdahl’s law – keep the parallel work fixed • Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time • For a particular number of processors, find the problem size for which parallel time is equal to the constant time • For that problem size, find the sequential time and the corresponding speedup • Thus speedup is scaled or scaled speedup. Also called weak speedup

Metrics (Contd..) Table 5.1: Efficiency as a function of n and p.

Scalability • Efficiency decreases with increasing P; increases with increasing N • How effectively the parallel algorithm can use an increasing number of processors • How the amount of computation performed must scale with P to keep E constant • This function of computation in terms of P is called isoefficiency function. • An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

Parallel Program Models • Single Program Multiple Data (SPMD) • Multiple Program Multiple Data (MPMD) Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Programming Paradigms • Shared memory model – Threads, OpenMP, CUDA • Message passing model – MPI

Parallelization

Parallelizing a Program Given a sequential program/algorithm, how to go about producing a parallel version Four steps in program parallelization • Decomposition Identifying parallel tasks with large extent of possible concurrent activity; splitting the problem into tasks • Assignment Grouping the tasks into processes with best load balancing • Orchestration Reducing synchronization and communication costs • Mapping Mapping of processes to processors (if possible)

Steps in Creating a Parallel Program

Orchestration • Goals • Structuring communication • Synchronization • Challenges • Organizing data structures – packing • Small or large messages? • How to organize communication and synchronization ?

Orchestration • Maximizing data locality • Minimizing volume of data exchange • Not communicating intermediate results – e.g. dot product • Minimizing frequency of interactions - packing • Minimizing contention and hot spots • Do not use the same communication pattern with the other processes in all the processes • Overlapping computations with interactions • Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2) • Initiate communication for type 1; During communication, perform type 2 • Replicating data or computations • Balancing the extra computation or storage cost with the gain due to less communication

Mapping • Which process runs on which particular processor? • Can depend on network topology, communication pattern of processors • On processor speeds in case of heterogeneous systems

Mapping • Static mapping • Mapping based on Data partitioning • Applicable to dense matrix computations • Block distribution • Block-cyclic distribution • Graph partitioning based mapping • Applicable for sparse matrix computations • Mapping based on task partitioning 0 0 0 1 1 1 2 2 2 0 1 2 0 1 2 0 1 2

Based on Task Partitioning • Based on task dependency graph • In general the problem is NP complete 0 0 4 0 2 4 6 0 1 2 3 4 5 6 7

Mapping • Dynamic Mapping • A process/global memory can hold a set of tasks • Distribute some tasks to all processes • Once a process completes its tasks, it asks the coordinator process for more tasks • Referred to as self-scheduling, work-stealing

High-level Goals

Parallel Architecture

Classification of Architectures – Flynn’s classification In terms of parallelism in instruction and data stream • Single Instruction Single Data (SISD): Serial Computers • Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Flynn’s classification • Multiple Instruction Single Data (MISD): Not popular • Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Based on Memory • Shared memory • 2 types – UMA and NUMA NUMA Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q UMA Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification 2:Shared Memory vs Message Passing • Shared memory machine: The n processors share physical address space • Communication can be done through this shared memory • The alternative is sometimes referred to as a message passing machine or a distributed memory machine P P P P P P P M M M M M M M P P P P P Interconnect P P Interconnect Main Memory

Shared Memory Machines The shared memory could itself be distributed among the processor nodes • Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time • Terms: NUMA vs UMA architecture • Non-Uniform Memory Access • Uniform Memory Access

Classification of Architectures – Based on Memory • Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ • Recently multi-cores • Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

Parallel Architecture: Interconnection Networks • An interconnection network defined by switches, links and interfaces • Switches – provide mapping between input and output ports, buffering, routing etc. • Interfaces – connects nodes with network

Parallel Architecture: Interconnections • Indirect interconnects: nodes are connected to interconnection medium, not directly to each other • Shared bus, multiple bus, crossbar, MIN • Direct interconnects: nodes are connected directly to each other • Topology: linear, ring, star, mesh, torus, hypercube • Routing techniques: how the route taken by the message from source to destination is decided • Network topologies • Static – point-to-point communication links among processing nodes • Dynamic – Communication links are formed dynamically by switches

Interconnection Networks • Static • Bus – SGI challenge • Completely connected • Star • Linear array, Ring (1-D torus) • Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus • k-d mesh: d dimensions with k nodes in each dimension • Hypercubes – 2-logp mesh – e.g. many MIMD machines • Trees – our campus network • Dynamic – Communication links are formed dynamically by switches • Crossbar – Cray X series – non-blocking network • Multistage – SP2 – blocking network. • For more details, and evaluation of topologies, refer to book by Grama et al.

Indirect Interconnects Shared bus Multiple bus 2x2 crossbar Crossbar switch Multistage Interconnection Network

Direct Interconnect Topologies Star Ring Linear 2D Mesh (binary n-cube) Hypercube n=3 n=2 Torus

Evaluating Interconnection topologies • Diameter – maximum distance between any two processing nodes • Full-connected – • Star – • Ring – • Hypercube - • Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks • Linear-array – • Ring – • 2-d mesh – • 2-d mesh with wraparound – • D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d

Evaluating Interconnection topologies • bisection width – minimum number of links to be removed from network to partition it into 2 equal halves • Ring – • P-node 2-D mesh - • Tree – • Star – • Completely connected – • Hypercubes - 2 Root(P) 1 1 P2/4 P/2

Evaluating Interconnection topologies • channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes • channel rate – performance of a single physical wire • channel bandwidth – channel rate times channel width • bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

Shared Memory Architecture: Caches P2 P1 Read X Read X Read X Write X=1 Cache hit: Wrong data!! X: 1 X: 0 X: 0 X: 0 X: 1

Cache Coherence Problem • If each processor in a shared memory multiple processor machine has a data cache • Potential data consistency problem: the cache coherence problem • Shared variable modification, private cache • Objective: processes shouldn’t read `stale’ data • Solutions • Hardware: cache coherence mechanisms

Cache Coherence Protocols • Write update – propagate cache line to other processors on every write to a processor • Write invalidate – each processor gets the updated cache line whenever it reads stale data • Which is better?

Invalidation Based Cache Coherence P2 P1 Read X Read X Read X Write X=1 X: 1 X: 1 X: 0 X: 0 Invalidate X: 1 X: 0

Cache Coherence using invalidate protocols • 3 states associated with data items • Shared – a variable shared by 2 caches • Invalid – another processor (say P0) has updated the data item • Dirty – state of the data item in P0 • Implementations • Snoopy • for bus based architectures • shared bus interconnect where all cache controllers monitor all bus activity • There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches • Memory operations are propagated over the bus and snooped • Directory-based • Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors • A central directory maintains states of cache blocks, associated processors • Implemented with presence bits

END

Parallel Programming