Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley

Minimizing Communication in Numerical Linear Algebrawww.cs.berkeley.edu/~demmelCommunication Lower Bounds for Direct Linear Algebra Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Outline Recall Motivation Lower bound proof for Matmul, by Irony/Toledo/Tiskin How to generalize it to linear algebra without orthogonal xforms How to generalize it to linear algebra with orthogonal xforms Summary of linear algebra problems for which the lower bound is attainable Summary of extensions to Strassen

Collaborators Grey Ballard, UCB EECS Ioana Dumitriu, U. Washington Laura Grigori, INRIA Ming Gu, UCB Math Mark Hoemmen, UCB EECS Olga Holtz, UCB Math & TU Berlin Julien Langou, U. Colorado Denver Marghoob Mohiyuddin, UCB EECS Oded Schwartz , TU Berlin Hua Xiang, INRIA Kathy Yelick, UCB EECS & NERSC BeBOP group, bebop.cs.berkeley.edu Summer School Lecture 3

Motivation (1/2) Two kinds of costs: Arithmetic (FLOPS) Communication: moving data between levels of a memory hierarchy (sequential case) over a network connecting processors (parallel case). CPU Cache CPU DRAM CPU DRAM DRAM CPU DRAM CPU DRAM Summer School Lecture 3

Motivation (2/2) • Exponentially growing gaps between • Time_per_flop << 1/Network BW << Network Latency • Improving 59%/year vs 26%/year vs 15%/year • Time_per_flop << 1/Memory BW << Memory Latency • Improving 59%/year vs 23%/year vs 5.5%/year communication • Goal : reorganize linear algebra to avoid communication • Between all memory hierarchy levels • L1 L2 DRAM network, etc • Notjust hiding communication (speedup  2x ) • Arbitrary speedups possible • Running time of an algorithm is sum of 3 terms: • # flops * time_per_flop • # words moved / bandwidth • # messages * latency Summer School Lecture 3

Direct linear algebra: Prior Work on Matmul • Parallel case on P processors: • Let NNZ be total memory needed; assume load balanced • Lower bound on #words communicated =  (n3 /(P· NNZ )1/2 ) [Irony, Tiskin & Toledo (2004)] • When NNZ = 3n2 , (“2D alg”), get  (n2 / P1/2 ) • Attained by Cannon’s Algorithm • For “3D alg” (NNZ = O(n2 P1/3 )), get  (n2 / P2/3 ) • Attainable too (details later) • Assume n3 algorithm (i.e. not Strassen-like) • Sequential case, with fast memory of size M • Lower bound on #words moved to/from slow memory =  (n3 / M1/2) [Hong & Kung (1981)] • Attained using blocked or recursive (“cache-oblivious”) algorithm Summer School Lecture 3

Direct linear algebra: Generalizing Prior Work • #words_moved by at least one processor = •  (#flops / M1/2) • Sequential case: # words moved =  (n3 / M1/2) =  (#flops / (fast_memory_size)1/2) • Parallel case: # words moved =  (n3 /(P· NNZ )1/2 ) =  ((n3 /P) / (NNZ/P)1/2 ) =  (#flops_per_proc / (memory_per_processor)1/2 ) (true for at least one processor, assuming “balance” in either flops or in memory) • In both cases, we let M = memory size, and write Summer School Lecture 3

Lower bound for all “direct” linear algebra • #words_moved by at least one processor = • (#flops / M1/2) • #messages_sent by at least one processor = • (#flops / M3/2) • Holds for • Matmul, BLAS, LU, QR, eig, SVD • Need to explain model of computation • Some whole programs (sequences of these operations, no matter how they are interleaved) • Dense and sparse matrices (where #flops << n3 ) • Sequential and parallel algorithms • Some graph-theoretic algorithms Summer School Lecture 3

Proof of Communication Lower Bound on C = A·B (1/5) • Proof from Irony/Toledo/Tiskin (2004) • Original proof, then generalization • Think of instruction stream being executed • Looks like “ … add, load, multiply, store, load, add, …” • Each load/store moves a word between fast and slow memory, or between local memory and remote memory (another processor) • We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative • Assuming that at most M words can be stored in fast memory • Outline: • Break instruction stream into segments, each with M loads and stores • Somehow bound the maximum number of flops that can be done in each segment, call it F • So F · # segments  T = total flops = 2·n3 , so # segments  T / F • So # loads & stores = M · #segments  M · T / F Summer School Lecture 3

Illustrating Segments, for M=3 Load Segment 1 Load Load FLOP Store Segment 2 FLOP Load Load Load FLOP Time FLOP Segment 3 FLOP Store Store FLOP Store Load FLOP ...

Proof of Communication Lower Bound on C = A·B (1/5) • Proof from Irony/Toledo/Tiskin (2004) • Original proof, then generalization • Think of instruction stream being executed • Looks like “ … add, load, multiply, store, load, add, …” • Each load/store moves a word between fast and slow memory • We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative • Assuming that at most M words can be stored in fast memory • Outline: • Break instruction stream into segments, each with M loads and stores • Somehow bound the maximum number of flops that can be done in each segment, call it F • So F · # segments  T = total flops = 2·n3 , so # segments  T / F • So # loads & stores = M · #segments  M · T / F Summer School Lecture 3

Proof of Communication Lower Bound on C = A·B (2/5) • Given segment of instruction stream with M loads & stores, how many adds & multiplies (F) can we do? • At most 2M entries of C, 2M entries of A and/or 2M entries of B can be accessed • Use geometry: • Represent n3 multiplications by n x n x n cube • One n x n face represents A • each 1 x 1 subsquare represents one A(i,k) • One n x n face represents B • each 1 x 1 subsquare represents one B(k,j) • One n x n face represents C • each 1 x 1 subsquare represents one C(i,j) • Each 1 x 1 x 1 subcube represents one C(i,j) += A(i,k) · B(k,j) • May be added directly to C(i,j), or to temporary accumulator Summer School Lecture 3

Proof of Communication Lower Bound on C = A·B (3/5) k “C face” Cube representing C(1,1) += A(1,3)·B(3,1) C(2,3) C(1,1) B(3,1) A(1,3) j B(2,1) A(1,2) B(1,3) B(1,1) A(2,1) A(1,1) “B face” i • If we have at most 2M “A squares”, 2M “B squares”, and 2M “C squares” on faces, how many cubes can we have? “A face”

Proof of Communication Lower Bound on C = A·B (4/5) k x “C shadow” y z j y z “B shadow” x “A shadow” i (i,k) is in “A shadow” if (i,j,k) in 3D set (j,k) is in “B shadow” if (i,j,k) in 3D set (i,j) is in “C shadow” if (i,j,k) in 3D set Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) · area(B shadow) · area(C shadow)) 1/2 # cubes in black box with side lengths x, y and z = Volume of black box = x·y·z = ( xz · zy · yx)1/2 = (#A□s · #B□s · #C□s )1/2 Summer School Lecture 3

Proof of Communication Lower Bound on C = A·B (5/5) • Parallel Case: apply reasoning to one processor out of P • # Adds and Muls2n3 / P (at least one proc does this ) • M= n2 / P (each processor gets equal fraction of matrix) • # “Load & Stores” = # words moved from or to other procsM · (2n3/P) / F=M · (2n3/P) / (2M)3/2 = n2 / (2P)1/2 • Consider one “segment” of instructions with M loads, stores • Can be at most 2M entries of A, B, C available in one segment • Volume of set of cubes representing possible multiply/adds in one segment is ≤ (2M · 2M · 2M)1/2 = (2M) 3/2 ≡ F • # Segments  2n3 / F • # Loads & Stores = M · #Segments  M · 2n3 / F  n3 / (2M)1/2 – M = (n3 / M1/2 ) Summer School Lecture 3

Homework • Prove more general statement of Loomis-Whitney • Suppose T is d-dimensional • N = |T| = #d-dimensional cubes in T • T(i) is projection of T onto hyperplane x(i)=0 • N(i) = d-1 – dimensional volume of T(i) • Show N ≤ i=1 to d (N(i))1/(d-1)

How to generalize this lower bound (1/4) • It doesn’t depend on C and A not overlapping (or C and B, or A and B) • Some reorderings may change answer; still get a lower bound for all reorderings • It doesn’t depend on doing n3 multiply/adds • It doesn’t depend on C(i,j) just being a sum of products C(i,j) = k A(i,k)*B(k,j) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) It doesn’t depend on C(i,j) being a matrix entry, just a unique memory location (same for A(i,k) and B(k,j) )

How to generalize this lower bound (2/4) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) (P) • It does assume the presence of an operand generating a load and/or store; how could this not happen? • Mem(b(k,j)) could be reused in many more gijk than (P) allows • Ex: Compute C(m) = A * (B.^m) (Matlab notation) for m=1 to t • Can move many fewer words than  (#flops / M1/2 ) • We might generate a result during a segment, use it, and discard it, without generating any memory traffic • Turns out QR, eig, SVD all may do this • Need a different analysis for them later… Summer School Lecture 3

How to generalize this lower bound (3/4) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) (P) • Need to distinguish Sources, Destinations of each operand in fast memory during a segment: • Possible Sources: S1: Already in fast memory at start of segment, or read; at most 2M S2: Created during segment; no bound without more information • Possible Destinations: D1: Left in fast memory at end of segment, or written; at most 2M D2: Discarded; no bound without more information • Need to assume no S2/D2 arguments; at most 4M of others Summer School Lecture 3

How to generalize this lower bound (4/4) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) (P) • Simpler: #words_moved =  (#flops / M1/2 ) • Corollary: To evaluate (P) requires at least G/ (81/2 M3/2 ) – 1 messages (simpler:  (#flops / M3/2 ) ) • Proof: maximum message size is M • Theorem: To evaluate (P) with memory of size M, where • fij and gijk are “nontrivial” functions of their arguments • G is the total number of gijk‘s, • No S2/D2 arguments requires at least G/ (8M)1/2 – M slow memory references

Some Corollaries of this lower bound (1/4) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) #words_moved =  (#flops / M1/2 ) (P)  • Theorem applies to dense or sparse, parallel or sequential: • MatMul, including ATA or A2 • Triangular solve C = A-1·X • C(i,j) = (X(i,j) - k<i A(i,k)*C(k,j)) / A(i,i) … A lower triangular • C plays double role of b and c in Model (P) • LU factorization (any pivoting, LU or “ILU”) • L(i,j) = (A(i,j) - k<j L(i,k)*U(k,j)) / U(j,j), U(i,j) = A(i,j) - k<i L(i,k)*U(k,j) • L (and U) play double role as c and a (c and b) in Model (P) • Cholesky (any diagonal pivoting, C or “IC”) • L(i,j) = (A(i,j) - k<j L(i,k)*LT(k,j)) / L(j,j), L(j,j) = (A(j,j) - k<j L(j,k)*LT(k,j))1/2 • L (and LT) play triple role as a, b and c Summer School Lecture 3

Some Corollaries of this lower bound (2/4) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) #words_moved =  (#flops / M1/2 ) (P)  • Applies to “simplified” operations • Ex 1: Compute ||A·B||F where A(i,k) = f(i,k) and B(k,j) = g(k,j), so no inputs and 1 output; assume each f(i,k) and g(k,j) evaluated once • Ex 2: Compute determinant of A(i,j) = f(i,j) using LU • Apply lower bound by “imposing reads and writes” • Ex 1: Every time a final value of (A·B)(i,j) is computed, write it; every time f(i,k), g(k,j) evaluated, insert a read • Ex 2: Every time a final value of L(i,j), U(i,j) computed, write it; every time f(i,j) evaluated, insert a read • Still get  (#flops / M1/2 – 3n2) words_moved, by subtracting “imposed” reads/writes (sparse case analogous)

Some Corollaries of this lower bound (3/4) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) #words_moved =  (#flops / M1/2 ) (P)  A A2 … At-1 A2 A3 … At • Applies to compositions of operations • Ex: Compute Ak by repeated multiplication, only input A, output Ak • Reorganize composition to match (P) • Impose writes of intermediate results A2 , … , At-1 • Still get #words_moved =  (#flops / M1/2 – (t-2) n2) • Holds for any interleaving of operations • Homework: apply to repeated squaring = · A

Some Corollaries of this lower bound (4/4) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) #words_moved =  (#flops / M1/2 ) (P)  • Applies to some graph algorithms • Ex: All-Pairs-Shortest-Path by Floyd-Warshall: • Matches (P) with gijk= “+” and fij = “min” • Get #words_moved =  (n3 / M1/2) , for dense graphs • Homework: state result for sparse graphs; how does n3 change? Initialize all path(i, j) = length of edge from node i to j (or  if none) for k := 1 to n for i := 1 to n for j := 1 to n path(i, j) = min ( path(i, j), path(i, k) + path(k, j) );

“3D” parallel matrix multiplication C = C + A· B C(2,2) C(1,1) B(3,1) A(1,3) j p1/3 B(2,1) A(1,2) B(1,2) B(1,1) A(2,1) A(1,1) p1/3 i p1/3 • p procs arranged in p1/3 xp1/3x p1/3 grid, with tree networks along “fibers” • Each A(i,k), B(k,j), C(i,j) is n/p1/3 x n/p1/3 • Initially • Proc (i,j,0) owns C(i,j) • Proc (i,0,k) owns A(i,k) • Proc (0,j,k) owns B(k,j) • Algorithm • For all (i,k), broadcast A(i,k) to proc (i,j,k)  j … comm. cost = log p1/3   + log p1/3  n2 / p2/3 · • For all (j,k), broadcast B(k,j) to proc(i,j,k)  j … same comm. cost • For all (i,j,k) Tmp(i,j,k) = A(i,k)  B(k,j) … cost = (2(n/p1/3)3) = 2n3/p flops • For all (i,j), reduce C(i,j) = k Tmp(i,j,k) … same comm. Cost • Total comm. cost = O(log(p)  n2 / p2/3 · + log(p)  ) • Lower bound = ((n3/p)/(n2/p2/3)1/2 · + (n3/p)/(n2/p2/3)3/2 · ) = (n2/p2/3 · + ) • Lower bound also ~attainable for all n2/p2/3  M = n2/px  n2/p [Toledo et al]

So when doesn’t Model (P) apply? For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) #words_moved =  (#flops / M1/2 ) (P)  • Ex: for r = 1 to t, C(r) = A * (B.^(1/r)) … Matlab notation • (P) does not apply • With A(i,k) and B(k,j) in memory, we can do t operations, not just 1 gijk • Can’t apply (P) with arguments b(k,j) indexed in one-to-one fashion • Can still analyze using segments, Loomis-Whitney • #flops/segment  8(tM3)1/2 • #segments  (t n3 ) / 8(tM3)1/2 • #words_moved =  (t1/2 · n3 / M1/2 ), not  (t · n3 / M1/2 ) • Attainable, using variation of usual blocked matmul algorithm • Homework! (Hint: need to recompute (B(j,k))1/r when needed)

Lower bounds for Orthogonal Transformations (1/4) • Needed for QR, eig, SVD, … • Analyze Blocked Householder Transformations • j=1 to b (I – 2 ujuTj ) = I – U T UT where U = [ u1 , … , ub ] • Treat Givens as 2x2 Householder • Model details and assumptions • Write (I – U T UT )A = A – U(TUT A) = A – UZ where Z = T(UT A) • Only count operations in all A – UZ operations • “Generically” a large fraction of the work • Assume “Forward Progress”, that each successive Householder transformation leaves previously created zeros zero; Ok for • QR decomposition • Reductions to condensed forms (Hessenberg, tri/bidiagonal) • Possible exception: bulge chasing in banded case • One sweep of (block) Hessenberg QR iteration Summer School Lecture 3

Lower bounds for Orthogonal Transformations (2/4) • Perform many A – UZ where Z = T(UT A) • First challenge to applying theorem: need to collect all A-UZ into one big set to which model (P) applies • Write them all as { A(i,j) = A(i,j) - k U(i,k) Z(k,j) } where k = index of k-th transformation, • k not necessarily = index of column of A it comes from • Second challenge: All Z(k,j) may be S2/D2 • Recall: S2/D2 means computed on the fly and discarded • Ex: An x 2n =Qn x n · Rn x 2n where 2n2 = M so A fits in cache • Represent Q as n(n-1)/2 2x2 Householder (Givens) transforms • There are n2(n-1)/2 = (M3/2) nonzero Z(k,j), not O(M) • Still only do (M3/2) flops during segment • But can’t use Loomis-Whitney to prove it! • Need a new idea… Summer School Lecture 3

Lower bounds for Orthogonal Transformations (3/4) • Dealing with Z(k,j) being S2/D2 • How to bound #flops in A(i,j) = A(i,j) - k U(i,k) Z(k,j) ? • Neither A nor U is S2/D2 • A either turned into R or U, which are output • So at most 2M of each during segment • #flops ≤ ( #U(i,k) ) · ( #columns A(:,j) present ) ≤ ( #U(i,k) ) · ( #A(i,j) / min #nonzeros per column of A) ≤ h · O(M) / r where h = O(M) … How small can r be? • To store h ≡ #U(i,k) Householder vector entries in r rows, there can be at most r in the first column, r-1 in the second, etc., (to maintain “Forward Progress”) so r(r-1)/2  h so r  h1/2 • # flops ≤ h · O(M) / r ≤ O(M) h1/2 = O(M3/2) as desired Summer School Lecture 3

Lower bounds for Orthogonal Transformations (4/4) • Theorem: #words_moved by QR decomposition using (blocked) Householder transformations = ( #flops / M1/2 ) • Theorem: #words_moved by reduction to Hessenberg form, tridiagonal form, bidiagonal form, or one sweep of QR iteration (or block versions of any of these) = ( #flops / M1/2 ) • Assuming Forward Progress (created zeros remain zero) • Model: Merge left and right orthogonal transformations: A(i,j) = A(i,j) - kLUL(i,kL) ·ZL(kL,j) - kRZR(i,kR) ·UR(j,kR) Summer School Lecture 3

Robust PCA (Candes, 2009) Decompose a matrix into a low rank component and a sparse component: M = L + S Homework: Apply lower bound result

Can we attain these lower bounds? • Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? • Mostly not • If not, are there other algorithms that do? • Yes • Several goals for algorithms: • Minimize Bandwidth ( (#flops/ M1/2 )) and latency ( (#flops/ M3/2)) • Multiple memory hierarchy levels (attain bound for each level?) • Explicit dependence on (multiple) M(s), vs “cache oblivious” • Fewest flops when fits in smallest memory • What about sparse algorithms? Summer School Lecture 3

Recall: Minimizing latency requires new data structures • To minimize latency, need to load/store whole rectangular subblock of matrix with one “message” • Incompatible with conventional columnwise (rowwise) storage • Ex: Rows (columns) not in contiguous memory locations • Blocked storage: store as matrix of bxb blocks, each block stored contiguously • Ok for one level of memory hierarchy, what if more? • Recursive blocked storage: store each block using subblocks

Recall: Blocked vs Cache-Oblivious Algorithms … Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul … another level of memory would need 3 more loops • Cache-obliviousMatmul C = A·B is independent of cache Function C = MM(A,B) If A and B are 1x1 C = A · B else … Break Anxn, Bnxn, Cnxn into (n/2)x(n/2) blocks labeled A(i,j), etc for i = 1 to 2, for j = 1 to 2, for k = 1 to 2 C(i,j) = C(i,j) + MM( A(i,k), B(k,j) ) … n/2 x n/2 matmul Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size

Summary of dense sequential algorithms attaining communication lower bounds • Algorithms shown minimizing # Messages assume (recursive) block layout • Many references (see reports), only some shown, plus ours • Older references may or may not include analysis • Cache-oblivious are underlined, Green are ours, ? is unknown/future work

Summary of dense 2D parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors, memory per processor = O(n2 / P) • Many references (see reports), Green are ours • ScaLAPACK assumes best block size b chosen • Recall lower bounds: • #words_moved = ( n2 / P1/2 ) and #messages = ( P1/2 )

Recent Communication Optimal Algorithms • QR with column pivoting • Cholesky with diagonal pivoting • LU with “complete” pivoting • LDL’ with “complete” pivoting • Sparse Cholesky • For matrices with “good separators” • For “most sparse matrices” (as hard as dense case)

Homework – Extend lower bound • To algorithms using (block) Givens rotations instead of (block) Householder transformations • To QR done with Gram-Schmidt orthogonalization • CGS and MGS • To heterogeneous collections of processors • Suppose processor k • Does (k) flops per second • Has reciprocal bandwidth (k) • Has latency (k) • What is a lower bound on solution time, for any fractions of work assigned to each processor (i.e. not all equal) • To a homogenous parallel shared memory machine • All data initially resides in large shared memory • Processors communicate by data written to/read from slow memory • Each processor has local fast memory size M

Extra slides

Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley