Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms: PRAM + CGM

Outline • Parallel Performance • Parallel Models • Shared Memory (PRAM, SMP) • Distributed Memory (BSP, CGM) Parallel Analysis of Algorithms

Parallel Analysis of Algorithms Question? • Professor speedy says he has a parallel algorithm for sorting n arbitrary items in n time using p>1 processors. • Do you believe him? Parallel Analysis of Algorithms

Parallel Analysis of Algorithms Performance of a Parallel Algorithm • n : problem size (e.g.: sort n numbers) • p : number of processors • T(p): parallel time • Ts : sequential time (optimal sequ. alg.) • s(p) = Ts / T(p) : speedup (1sp) s s(p)=p super-linear linear sub-linear p Parallel Analysis of Algorithms

Parallel Analysis of Algorithms Speedup • linear speedup s(p) = p optimal • super linear speedup s(p) > p : impossible Proof. Assume that parallel algorithm A has a speedup s > p for processors, i.e. s = Ts / T > p. Hence: Ts > T p. Simulate A on a sequential, single processor machine. Then T(1) = T · p < Ts. Hence, Ts was not optimal. Contradiction. Parallel Analysis of Algorithms

Parallel Analysis of Algorithms Amdahl’s Law • Let f, 0<f<1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup is s <= 1 / [f+(1-f)/p]. Proof: Ts = sequ. time. The T(p)  f Ts + (1-f)Ts / p. Hence s  Ts / [f Ts +(1-f) Ts /p] = 1 / [f+(1-f)/p]. Parallel Analysis of Algorithms

Amdahl’s Law t s ft (1 - f ) t s s Serial section Parallelizable sections (a) One processor (b) Multiple processors p processors (1 - f ) t / p s t Parallel Analysis of Algorithms p

Parallel Analysis of Algorithms Amdahl’s Law P=1 time P=5 P=10 P=1000 Parallel Analysis of Algorithms

Parallel Analysis of Algorithms Amdahl’s Law s(p)  1 / [f+(1-f)/p] • f  0 : s (p)  p • f  1 : s(p)  1 • f = 0.5 : s(p) = 2 [p/(p+1)] <= 2 • f = 1/k : s(p) = k / [1+(k-1)/p] <= k Parallel Analysis of Algorithms

Parallel Analysis of Algorithms s k Parallel Analysis of Algorithms

Parallel Analysis of Algorithms Scaled or Relative Speedup • Ts may be unknown (in fact, for most real experiments this is the case) • Relative speedup s’ (p) = T(1) / T(p) • s’ (p)  s(p) Parallel Analysis of Algorithms

Parallel Analysis of Algorithms Efficiency • e(p) = s(p) / p efficiency (0e1) • optimal linear speedup s(p) = p  e(p) = 1 • e’(p) = s’(p) / p Relative efficiency Parallel Analysis of Algorithms

Outline • Parallel Analysis of Algorithms • Models • Shared Memory (PRAM, SMP) • Distributed Memory (BSP, CGM) Parallel Analysis of Algorithms

Shared Memory (PRAM, SMP) Parallel Random Access Machine (PRAM) shared memory proc. 1 • Exclusive-Read (ER) • Concurrent-Read (CR) • Exclusive-Write (EW) • Concurrent-Write (CW) proc. 2 1 2 proc. 3 ... j proc. i n-1 n ... proc. p Parallel Analysis of Algorithms

Shared Memory (PRAM, SMP) Parallel Random Access Machine (PRAM) shared memory • Concurrent-Write (CW) • Common: All proc. must write the same value • Arbitrary: An arbitrary value “wins” • Smallest: The smallest value “wins” • Priority: The proc. with smallest ID number “wins” proc. 1 proc. 2 1 2 proc. 3 ... j proc. i n-1 n ... proc. p Parallel Analysis of Algorithms

Shared Memory (PRAM, SMP) Parallel Random Access Machine (PRAM) shared memory • Default: CREW (Concurrent Read Exclusive Write) • p = O(n) fine grained massively parallel proc. 1 proc. 2 1 2 proc. 3 ... j proc. i n-1 n ... proc. p Parallel Analysis of Algorithms

Shared Memory (PRAM, SMP) Performance of a PRAM Algorithm • Optimal T = O ( Ts / p ) • Efficient T = O ( logk(n) Ts / p ) • NC T = O (logk(n) ) for p= polynomial (n) Parallel Analysis of Algorithms

Shared Memory (PRAM, SMP) Example: Multiply n numbers shared memory proc. 1 • Input: a1, a2, …, an • Output: a1 * a2 * a3 * … * an * : associative operator proc. 2 1 2 proc. 3 ... j proc. i n-1 n ... proc. p Parallel Analysis of Algorithms

Shared Memory (PRAM, SMP) Algorithm 1 p = n/2 Parallel Analysis of Algorithms

Shared Memory (PRAM, SMP) Analysis • p = n/2 T = O( log n ) • Ts = O(n), Ts / p = O(1)  algorithm is efficient & NC but not optimal Parallel Analysis of Algorithms

Shared Memory (PRAM, SMP) Algorithm 2 • make available only p = n / log n processors • execute Algorithm 1 using “rescheduling”: whenever Algorithm 1 has a parallel step where m > (n / log n) processors are used, simulate this step by a “phase” consisting of m / (n / log n)  steps for (n / log n) processors Parallel Analysis of Algorithms

Shared Memory (PRAM, SMP) proc Parallel Analysis of Algorithms

Shared Memory (PRAM, SMP) Analysis # steps in phase i :  (n / 2i) / (n / log n)  =  log n / 2i T = O(1in log n / 2i ) = O( log n 1in 1/ 2i ) = O( log n ) p = n / log n Ts / p = O( n / [n / log n] ) = O( log n )  algorithm is efficient & NC & optimal Parallel Analysis of Algorithms

Problem 2: List Ranking • Input: A linked list represented by an array. • Output: The distance of each node to the last node.

Algorithm: Pointer Jumping • Assign proc. i to node i • Initialize (all proc. i in parallel): D(i) := 0 if P(i)=i 1 otherwise • REPEAT log n TIMES (all proc. i in parallel): D(i) := D(i) + D(P(i)) P(i) := P(P(i))

Analysis • p = n • T = O( log n ) • efficient & NC but not optimal

Problem 3: Partial Sums • Input: a1, a2, …, an • Output: a1 a1 + a2 a1 + a2 + a3 ... a1 + a2 + a3 + … + an

Parallel Recursion • Compute (in parallel): a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an • Recursively (all proc. together) solve the problem for the n/2 numbers a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an • The result is: (a1+a2)(a1+a2+a3+a4)(a1+a2+a3+a4+a5+a6 )...(a1+a2... an-3+an-2)(a1+a2+an-1+an) • Compute each gap by multiplying its predecessor by a single number

Analysis • p = n • T (n) = T(n/2) + O(1) T(1) = O(1)  T(n) = O(log n) efficient and NC but not optimal

Improving through rescheduling • set p = n / log n • simulate previous algorithm

proc

Analysis • # steps in phase i : •  (n / 2i) / (n / log n)  =  log n / 2i • T = O(1in log n / 2i ) • = O( log n 1in 1/ 2i ) = O( log n ) • p = n / log n • Ts / p = O( n / [n / log n] ) = O( log n ) •  algorithm is efficient & NC & optimal

Problem 4: Sorting shared memory proc. 1 • Input: a1, a2, …, an • Output: a1, a2, …, an permuted into sorted order proc. 2 1 2 proc. 3 ... j proc. i n-1 n ... proc. p

Bitonic Sorting (Batcher) • Unimodal sequence: 9 10 13 17 21 19 16 15 • Bitonic sequence: cyclic shift of a unimodal sequence 16 15 9 10 13 17 21 19

Properties of bitonic sequences • X = x1 x2... xn xn+1 xn+2 ... x2nbitonic • L(X) = y1 ... ynyi = min {xi, xn+i} U(X) = z1 ... znzi = max {xi, xn+i}  (1)L(X) and U(X) are bitonic (2) every element of L(X) is smaller than every element of U(X).

Bitonic Merge: sorting a bitonic sequence • a bitonic sequence of length n can be sorted in time O(log n) using p=n processors

sorting an arbitrary sequence a1, a2, …, an • split a1, a2, …, an into two sub-sequences: a1, …, an/2 and a(n/2)+1, a(n/2)+2, …, an • recursively, in parallel, sort each sub-sequence using p/2 processors • merge the two sorted sub-sequences into one sorted sequence using bitonic merge Note: If X and Y are sorted sequences (increasing order), then X YR is a bitonic sequence.

Analysis • p = n • T (n) = T(n/2) + O(log n) T(1) = O(1)  T(n) = O(log2 n) efficient and NC but not optimal

So what about a SMP machine? • PRAM? • EREW? • CREW? • CRCW? • How does OpenMP play into this? Parallel Analysis of Algorithms

Master Thread Parallel Regions OpenMP/SMP • = CREW PRAM but coarse grained • T(p)  f Ts + (1-f)Ts / p, for f = sequential fraction • T(n,p) = f Ts + sum over all parallel regions of max time fork Parallel Analysis of Algorithms

Outline • Parallel Analysis of Algorithms • Models • Shared Memory (PRAM, SMP) • Distributed Memory (BSP, CGM) Parallel Analysis of Algorithms

Distributed Memory Models

Parallel Computing • p: # processors • n: problem size • Ts(n): sequential time • T(p,n): parallel time • speedup: S(p,n) = Ts(n) / T(p,n) • Goal: obtain linearspeedup S(p,n)=p

Parallel Computers Beowulf Cluster Blue Gene/Q ... Cray XK7 Custom MPP (Tianhe-2)

Parallel Machine Models How to abstract the machine into a simplified model such that • algorithm/application design is not hampered by too many details • calculated time complexity predictions match the actually observed running times (with sufficient accuracy)

Parallel Machine Models • PRAM • Fine grained networks (array, ring, mesh, hypercube) • Bulk Synchronous Parallelism (BSP), Valiant, 1990 • Coarse Grained Multicomputer (CGM), Dehne, Rau-Chaplin, 1993 • Multithread (CILK), Leiserson, 1995 • many more...

p=O(n) processors massively parallel... PRAM

list merge… Bitonic Sort: O(log n) per merge => O(log2 n) Cole: O(1) per merge => O(log n) Example: PRAM Sort

p=O(n) processors massively parallel ... Fine Grained Networks

Parallel Analysis of Algorithms: PRAM + CGM