Introduction to Parallel Computing

Introduction to Parallel Computing Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Chapter 8Dense Matrix Algorithms

Outline • Matrix-Vector Multiplication • Matrix-Matrix Multiplication • Solving a System of Linear Equations • Gaussian Elimination • Solving a Triangular System: Back-Substitution

Review • Performance Metrics • Work: W • Parallel Time: • Total Overhead function: • Cost: Cost optimal: • Speedup: • Efficiency: • Analysis Isoefficiency: where • Get the relation of W(p) so that the equation holds • Communication – Hypercube network

Matrix-Vector Multiplication • Compute: • Serial complexity: W = O(n2)

Rowwise 1-D Partitioning • All-to-all broadcast • Local computation • Parallel Time • Cost • Cost optimal if

Rowwise 1-D Partitioning • Scalability Analysis where . • The equation holds if • Because W=O(n2) • Isoefficiency function is

2-D Partitioning • Initial Data Aligment • One-to-all broadcast • Local computation • All-to-one reduction • Parallel Time • Cost • Cost optimal if

2-D Partitioning • Scalability Analysis • The equation holds if • Because W=n2 • Isoefficiency function is • Compared with 1D Partition’s isoefficiency

Matrix-Matrix Multiplication • Compute: • Serial complexity: W=O(n3)

Simple 2D Blocking Algorithm • Each processor gets a block of block of A, B, and C • Steps • Broadcast A horizontally and B vertically • Local computation

Simple 2D Blocking Algorithm • Two All-to-All Broadcasts (in one column/row), total • Local computation • Total parallel time • Cost • Isoefficiency • It holds if • Because W=O(n3), the Isoefficiency function is • Problem • Memory consumption, after the broadcast, each process stores the entire row blocks and column blocks

Cannon Algorithm • Total steps • Data shifts between two steps in both rows and columns • Main benefit • No need to store the entire row/columns blocks • Parallel time • Isoefficiency

DNS Algorithm • Distribute A • Each i-k slice has a full A • Distribute B • Each k-j slice has a full B • Local computation • Result reduction along k direction • Result is in i-j plate • Parallel Time • Let • Isoefficiency

Gaussian Elimination • Solve Ax=b, A is nxn, x and b are nx1 vectors • Transform it into Ux=y by Gaussian Elimination • Three nested loops • Division O(n2) • Multiply-add O(n3) • Complexity • - W = O(n3)

Gaussian Elimination Outer loop k Update row k by division Update a sub matrix Require A[k,] from division step

1D Partitioningrowwise • Outer loop k • Division • One-to-all broadcast of row k • parallel updates of rows [k+1, n-1] • Analysis Communication Computation 3

Pipelined GE • Previous algorithm • outer loop k+1 starts after k finishes • Pipelined • Change broadcast to shift • loop k+1starts after the k+1 row shifts the k row’s divisor down • Analysis • Outer loop total n • Each loop: • O(n) local computation • O(n) communication • Total time Tp = O(n2) • Total cost O(n3) • Cost optimal

1D Partitioning – Block Mapping

1D Partitioning – Block V.S.B Cyclic • Block 1-D mapping: load imbalance • Cyclic mapping: better balanced

2D Partitioning • Each processor gets a 2D block • Basic Steps • Broadcast of the “active” column along the rows • Divide step in parallel by the processors who own portions of the row • Broadcast along the columns • Rank-1 update

2D Partition Pipelined • Analysis • Pipeline in both row and columns • O(n) processing time • n2 processors • O(n3) cost • Cost optimal

2D Partition – Block Mapping

2D Partition – Block V.S. Cyclic • Cyclic mapping: better load balance

Solving a Triangular System: Back-Substitution • Triangular System: Ux = y • U is an upper-unit triangular matrix • W = n2/2 multiplications and subtractions

Solving a Triangular System: Back-Substitution • 1-D rowwise partitioning to p processes • Each processes n/p rows of U, n/p of y • Using pipelining, total n steps (outer loop) • Each step: • Unit time communication, shift one x[k] • : Update a subsection of y, n/p elements • Total parallel time: , cost optimal • 2-D partitioning to processes mesh • Using pipelining (both row and column), total n steps • Each step: still require for updating y • Total parallel time: , not cost optimal • But the entire process Gaussian Elimination + Solving Triangular system is dominated by GE part W= O(n3) • The whole process is cost optimal if or

Introduction to Parallel Computing