Parallel and Distributed Processing CSE 8380

Parallel and Distributed ProcessingCSE 8380 February 8, 2005 Session 8

Contents • Computing sum on EREW PRAM • Computing all partial sums on EREW PRAM • Matrix Multiplication on CREW • Other Algorithms

Recall (PRAM Model) Control Private Memory P1 • Synchronized Read Compute Write Cycle • EREW • ERCW • CREW • CRCW • Complexity: T(n), P(n), C(n) Global Private Memory P2 Memory Private Memory Pp

Sum on EREW PRAM • Compute the sum of an array A[1..n] • We use n/2 processors • Summation will end up in location A[n] • For simplicity, we assume n is an integral power of 2 • Work is done in log n iterations. In the first iteration, all processors are active. In the second iteration, only half the processors will be active, and so on.

5 5 5 5 7 2 7 7 10 10 10 10 11 18 1 18 8 8 8 8 20 20 20 12 7 7 7 7 10 3 30 48 Example Active processors A[2] A[3] A[5] A[6] A[8] A[1] A[4] A[7] Sum of an array of numbers on the EREW model P1, P2, P3, P4 P2, P4 P4 Example of algorithm Sum_EREW when n=8

Group Work 1- Discuss the algorithm with your neighbor 2- Design the main loops 3- Discuss the Complexity

Algorithm sum_EREW for i =1 to log n do forall Pj, where 1 < j < n/2 do in parallel if (2j mod 2i) = 0 then A[2j]  A[2j] + A[j – 2i-1] endif endfor endfor

Complexity • Run time: T(n) = O(log n) • Number of processors: P(n) = n/2 • Cost: c(n) = O(n log n) • Is it cost optimal?

All partial sums - EREW PRAM • Compute all partial sums of an array A[1..n] • These are A[1], A[1]+A[2], A[1]+A[2]+A[3], …, A[1]+A[2]+… + A[n]. • At first glance you might think it is inherently sequential because one must add up the first k elements before adding in element k+1 • We’ll see that it can be parallelized • Let’s extend sum_EREW to do that

All partial sums (cont.) • We noticed that in sum_EREW most processors are idle most of the time • By exploiting these idle processors, we should be able to compute all partial sums in the same amount of time it takes to compute the single sum

All partial sums (cont.) • Compute all partial sums of A[1..n] • We use n-1 processors (P2, P3, …, Pn) • A[k] will be replaced by the sum of all elements preceding and including A[k] • In algorithm sum_EREW, at iteration i, only n/2i processors were active, while in allsums_EREW, nearly all processors will be in use.

5 5 5 5 7 2 7 7 17 12 17 10 11 18 1 18 26 21 9 8 38 31 20 12 7 28 45 19 10 3 30 48 Example Active processors A[2] A[3] A[5] A[6] A[8] A[1] A[4] A[7] P2, P3, …, P8 All partial sums on EREW PRAM P3, P4, …, P8 P5, P6, P7, P8 Example of algorithm allsums_EREW when n=8

Group Work 1- Discuss the algorithm with your neighbor 2- Design the main loops 3- Discuss the Complexity

Algorithm allsums_EREW for i =1 to log n do forall Pj, where 2i-1 + 1 < j < n do in parallel a[j]  A[j] + A[j – 2i-1] endfor endfor

Complexity • Run time: T(n) = O(log n) • Number of processors: P(n) = n-1 • Cost: c(n) = O(n log n)

Matrix Multiplication • Two n X n matrices • For clarity, we assume n is power of 2 • We use CREW to allow concurrent read • Two matrices in the shared memory A[1..n,1..n], B[1..n,1..n]. • We will use n3 processors • We will also show how to reduce the number of processors

Matrix Multiplication (cont) • The n3 processors are arranged in a three dimensional array. Processor Pi,j,k is the one with index (i,j,k) • We will use the 3 dimensional array C[1..n,1..n,1..n] in the shared memory as working space. • The resulting matrix will be stored in locations C[i,j,n], where 1<= i,j <= n

Two steps • All n3 processors operate in parallel to compute n3 multiplications. (For each of the n2 cells in the output matrix, n products are computed) • The n products are summed to produce the final value of each cell

Matrix multiplicationUsing n3 processors Two steps of the Algorithms • Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k]. • The idea of Algorithm Sum_EREW is applied along the k dimension n2 times in parallel to compute C[i,j,n], where 1<i, j<n. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

Algorithm MatMult_CREW /* step 1 */ forall Pi,j,k, where 1 < i, j, k<n do in parallel C[i,j,k]  A[i,k] * B[k,j] Endfor /* step 2 */ for i=1 to log n do forall Pi,j,k, where 1 < i, j<n & 1<k<n/2 do in parallel if (2k mod 2l) = 0 then C[i,j,2k]  C[i,j,2k] + C[i,j, 2k-2l-1] endif endfor /* the output matrix is stored in locations C[i,j,n], where l<i, j<n */ endfor

Complexity • Run time: T(n) = O(log n) • Number of processors: P(n) = n3 • Cost: c(n) = O(n3 log n) • Is it cost optimal?

j Example P1,1,1 P1,2,1 i K = 1 C[1,1,1]  A[1,1]B[1,1] C[1,2,1]  A[1,1]B[1,2] P2,1,1 P2,2,1 C[2,1,1]  A[2,1]B[1,1] C[2,2,1]  A[2,1]B[1,2] j i P1,1,2 P1,2,2 K = 2 C[1,1,2]  A[1,2]B[2,1] C[1,2,2]  A[1,2]B[2,2] P2,1,2 P2,2,2 C[2,1,2]  A[2,2]B[2,1] C[2,2,2]  A[2,2]B[2,2] After step 1 Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

j Example (cont.) i P1,1,2 P1,2,2 K = 2 C[1,1,2]  C[1,1,2]+C[1,1,1] C[1,2,2]  C[1,2,2]+C[1,2,1] P2,1,2 P2,2,2 Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW C[2,1,2]  C[2,1,2]+C[2,1,1] C[2,2,2]  C[2,2,2]+C[2,2,1] After step 2

Matrix multiplicationreducing the number of processors to n3/log n • Processors are arranged in n X n X n/(log n) 3-dimensional array • Each processors Pi,j,k, where 1 <k < n/log n, computes the sum of (log n) product. This step will produce (n3/log n) partial sums. • The sum of products produced in step 1 are added to produce the resulting matrix as discussed previously. • Complexity analysis • Run time, T(n) = O(log n) • Number of processors, P(n) = n3/log n • Cost, c(n) = O(n3)

Searching • Given A = a1, a2, …, ai, …, an & x • Determine whether x = ai for some i • Sequential Binary Search  O(log n) • Simple idea • Divide the list among the processors and let each processor conduct its own binary search • EREW PRAM  O(log n/p) + O(log p) = O(log n) • CREW  O(log n/p)

Parallel Binary Search • Split A into p+1 segments of almost equal length • Compare x with p elements at the boundary between successive segments • Either x = ai or search is restricted to only one of the p+1 segments • Repeat until x is found or length of the list is <= p

Parallel and Distributed Processing CSE 8380