Parallel Programming

Parallel Programming Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/PPP2009

Motivation for Parallel Programming • Faster Execution time due to non-dependencies between regions of code • Presents a level of modularity • Resource constraints. Large databases. • Certain class of algorithms lend themselves • Aggregate bandwidth to memory/disk. Increase in data throughput. • Clock rate improvement in the past decade – 40% • Memory access time improvement in the past decade – 10% • Grand challenge problems (more later)

Challenges / Problems in Parallel Algorithms • Building efficient algorithms. • Avoiding • Communication delay • Idling • Synchronization

Challenges P0 P1 Idle time Computation Communication Synchronization

How do we evaluate a parallel program? • Execution time, Tp • Speedup, S • S(p, n) = T(1, n) / T(p, n) • Usually, S(p, n) < p • Sometimes S(p, n) > p (superlinear speedup) • Efficiency, E • E(p, n) = S(p, n)/p • Usually, E(p, n) < 1 • Sometimes, greater than 1 • Scalability – Limitations in parallel computing, relation to n and p.

Speedups and efficiency S E p p Ideal Practical

Limitations on speedup – Amdahl’s law • Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. • Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement. • Places a limit on the speedup due to parallelism. • Speedup = 1 (fs + (fp/P))

Amdahl’s law Illustration S = 1 / (s + (1-s)/p) Courtesy: http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm

Amdahl’s law analysis • For the same fraction, speedup numbers keep moving away from processor size. • Thus Amdahl’s law is a bit depressing for parallel programming. • In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

Gustafson’s Law • Amdahl’s law – keep the parallel work fixed • Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time • For a particular number of processors, find the problem size for which parallel time is equal to the constant time • For that problem size, find the sequential time and the corresponding speedup • Thus speedup is scaled or scaled speedup

Metrics (Contd..) Table 5.1: Efficiency as a function of n and p.

Scalability • Efficiency decreases with increasing P; increases with increasing N • How effectively the parallel algorithm can use an increasing number of processors • How the amount of computation performed must scale with P to keep E constant • This function of computation in terms of P is called isoefficiency function. • An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

Scalability Analysis – Finite Difference algorithm with 1D decomposition For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E. Can be satisfied with N = P, except for small P. Hence isoefficiency function = O(P2) since computation is O(N2)

Scalability Analysis – Finite Difference algorithm with 2D decomposition Can be satisfied with N = sqroot(P) Hence isoefficiency function = O(P) 2D algorithm is more scalable than 1D

Parallel Algorithm Design

Steps • Decomposition – Splitting the problem into tasks or modules • Mapping – Assigning tasks to processor • Mapping’s contradictory objectives • To minimize idle times • To reduce communications

Mapping • Static mapping • Mapping based on Data partitioning • Applicable to dense matrix computations • Block distribution • Block-cyclic distribution • Graph partitioning based mapping • Applicable for sparse matrix computations • Mapping based on task partitioning 0 0 0 1 1 1 2 2 2 0 1 2 0 1 2 0 1 2

Based on Task Partitioning • Based on task dependency graph • In general the problem is NP complete 0 0 4 0 2 4 6 0 1 2 3 4 5 6 7

Mapping • Dynamic Mapping • A process/global memory can hold a set of tasks • Distribute some tasks to all processes • Once a process completes its tasks, it asks the coordinator process for more tasks • Referred to as self-scheduling, work-stealing

Interaction Overheads • In spite of the best efforts in mapping, there can be interaction overheads • Due to frequent communications, exchanging large volume of data, interaction with the farthest processors etc. • Some techniques can be used to minimize interactions

Parallel Algorithm Design - Containing Interaction Overheads • Maximizing data locality • Minimizing volume of data exchange • Using higher dimensional mapping • Not communicating intermediate results • Minimizing frequency of interactions • Minimizing contention and hot spots • Do not use the same communication pattern with the other processes in all the processes

Parallel Algorithm Design - Containing Interaction Overheads • Overlapping computations with interactions • Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2) • Initiate communication for type 1; During communication, perform type 2 • Overlapping interactions with interactions • Replicating data or computations • Balancing the extra computation or storage cost with the gain due to less communication

Parallel Algorithm Classification – Types - Models

Parallel Algorithm Types • Divide and conquer • Data partitioning / decomposition • Pipelining

Divide-and-Conquer • Recursive in structure • Divide the problem into sub-problems that are similar to the original, smaller in size • Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner • Combine the solutions to create a solution to the original problem

Divide-and-ConquerExample: Merge Sort • Problem: Sort a sequence of n elements • Divide the sequence into two subsequences of n/2 elements each • Conquer: Sort the two subsequences recursively using merge sort • Combine: Merge the two sorted subsequences to produce sorted answer

Partitioning • Breaking up the given problem into p independent subproblems of almost equal sizes • Solving the p subproblems concurrently • Mostly splitting the input or output into non-overlapping pieces • Example: Matrix multiplication • Either the inputs (A or B) or output (C) can be partitioned.

Pipelining Occurs with image processing applications where a number of images undergoes a sequence of transformations.

Parallel Algorithm Models • Data parallel model • Processes perform identical tasks on different data • Task parallel model • Different processes perform different tasks on same or different data – based on task dependency graph • Work pool model • Any task can be performed by any process. Tasks are added to a work pool dynamically • Pipeline model • A stream of data passes through a chain of processes – stream parallelism

Parallel Program Classification - Models - Structure - Paradigms

Parallel Program Models • Single Program Multiple Data (SPMD) • Multiple Program Multiple Data (MPMD) Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Parallel Program Structure Types • Master-Worker / parameter sweep / task farming • Embarassingly/pleasingly parallel • Pipleline / systolic / wavefront • Tightly-coupled • Workflow P0 P1 P2 P3 P4 P0 P1 P2 P3 P4

Programming Paradigms • Shared memory model – Threads, OpenMP • Message passing model – MPI • Data parallel model – HPF Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Parallel Architectures Classification - Classification - Cache coherence in shared memory platforms - Interconnection networks

Classification of Architectures – Flynn’s classification • Single Instruction Single Data (SISD): Serial Computers • Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Flynn’s classification • Multiple Instruction Single Data (MISD): Not popular • Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Based on Memory • Shared memory • 2 types – UMA and NUMA NUMA Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q UMA Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Based on Memory • Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ • Recently multi-cores • Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

Cache Coherence - for details, read 2.4.6 of book Interconnection networks - for details, read 2.4.2-2.4.5 of book

Cache Coherence in SMPs • All processes read variable ‘x’ residing in cache line ‘a’ • Each process updates ‘x’ at different points of time CPU0 CPU1 CPU2 CPU3 a a a a cache0 cache1 cache2 cache3 a • Challenge: To maintain consistent view of the data • Protocols: • Write update • Write invalidate Main Memory

Caches Coherence Protocols and Implementations • Write update – propagate cache line to other processors on every write to a processor • Write invalidate – each processor get the updated cache line whenever it reads stale data • Which is better??

Caches –False sharing • Different processors update different parts of the same cache line • Leads to ping-pong of cache lines between processors • Situation better in update protocols than invalidate protocols. Why? CPU1 CPU0 A0, A2, A4… A1, A3, A5… cache0 cache1 A0 – A8 A9 – A15 • Modify the algorithm to change the stride Main Memory

Caches Coherence using invalidate protocols • 3 states associated with data items • Shared – a variable shared by 2 caches • Invalid – another processor (say P0) has updated the data item • Dirty – state of the data item in P0 • Implementations • Snoopy • for bus based architectures • Memory operations are propagated over the bus and snooped • Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors • Directory-based • A central directory maintains states of cache blocks, associated processors • Implemented with presence bits

Interconnection Networks • An interconnection network defined by switches, links and interfaces • Switches – provide mapping between input and output ports, buffering, routing etc. • Interfaces – connects nodes with network • Network topologies • Static – point-to-point communication links among processing nodes • Dynamic – Communication links are formed dynamically by switches

Interconnection Networks • Static • Bus – SGI challenge • Completely connected • Star • Linear array, Ring (1-D torus) • Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus • k-d mesh: d dimensions with k nodes in each dimension • Hypercubes – 2-logp mesh – e.g. many MIMD machines • Trees – our campus network • Dynamic – Communication links are formed dynamically by switches • Crossbar – Cray X series – non-blocking network • Multistage – SP2 – blocking network. • For more details, and evaluation of topologies, refer to book

Evaluating Interconnection topologies • Diameter – maximum distance between any two processing nodes • Full-connected – • Star – • Ring – • Hypercube - • Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks • Linear-array – • Ring – • 2-d mesh – • 2-d mesh with wraparound – • D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d

Evaluating Interconnection topologies • bisection width – minimum number of links to be removed from network to partition it into 2 equal halves • Ring – • P-node 2-D mesh - • Tree – • Star – • Completely connected – • Hypercubes - 2 Root(P) 1 1 P2/4 P/2

Evaluating Interconnection topologies • channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes • channel rate – performance of a single physical wire • channel bandwidth – channel rate times channel width • bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

END

Parallel Programming