Minimizing Communication in Linear Algebra

Minimizing Communication in Linear Algebra James Demmel 9 Nov 2009

Outline Background – 2 Big Events Why avoiding communication is important Communication-optimal direct linear algebra Autotuning of sparse matrix codes Communication-optimal iterative linear algebra

Collaborators Thanks to Intel, Microsoft, UC Discovery, NSF, DOE, … • Grey Ballard, UCB EECS • Ioana Dumitriu, U. Washington • Laura Grigori, INRIA • Ming Gu, UCB Math • Mark Hoemmen, UCB EECS • Olga Holtz, UCB Math & TU Berlin • Julien Langou, U. Colorado Denver • Marghoob Mohiyuddin, UCB EECS • Oded Schwartz , TU Berlin • Hua Xiang, INRIA • Sam Williams, LBL • Vasily Volkov, UCB EECS • Kathy Yelick, UCB EECS & NERSC • BeBOP group, bebop.cs.berkeley.edu • ParLab, parlab.eecs.berkeley.edu

2 Big Events • “Multicore revolution”, requiring all software (where performance matters!) to change • ParLab – Industry/State funded center with 15 faculty, ~50 grad students ( parlab.eecs.berkeley.edu ) • Rare synergy between commercial and scientific computing • Establishment of a new graduate program in Computational Science and Engineering (CSE) • 117 Faculty from 22 Departments • 60 existing courses, more being developed • CS267 – Applications of Parallel Computers • www.cs.berkeley.edu/~demmel/cs267_Spr09 • 3 Day Parallel “Boot Camp” in August • parlab.eecs.berkeley.edu/2009bootcamp

Parallelism Revolution is Happening Now • Chip density continuing to increase ~2x every 2 years • Clock speed is not • Number of processor cores may double instead • There is little or no more hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

The “7 Dwarfs” of High Performance Computing • Dense Linear Algebra • Ex: Solve Ax=b or Ax = λx where A is a dense matrix • Sparse Linear Algebra • Ex: Solve Ax=b or Ax = λx where A is a sparse matrix (mostly zero) • Operations on Structured Grids • Ex: Anew(i,j) = 4*A(i,j) – A(i-1,j) – A(i+1,j) – A(i,j-1) – A(i,j+1) • Operations on Unstructured Grids • Ex: Similar, but list of neighbors varies from entry to entry • Spectral Methods • Ex: Fast Fourier Transform (FFT) • N-Body (aka Particle Methods) • Ex: Compute electrostatic forces using Fast Multiple Method • Monte Carlo (aka MapReduce) • Ex: Many independent simulations using different inputs Phil Colella (LBL) identified 7 kernels out of which most large scale simulation and data-analysis programs are composed:

Motif/Dwarf: Common Computational Methods (Red Hot Blue Cool) www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf

2 Big Events • “Multicore revolution”, requiring all software (where performance matters!) to change • ParLab – Industry/State funded center with 15 faculty, ~50 grad students ( parlab.eecs.berkeley.edu ) • Rare synergy between commercial and scientific computing • Establishment of a new graduate program in Computational Science and Engineering (CSE) at UCB • 117 Faculty from 22 Departments • 60 existing courses, more being developed • CS267 – Applications of Parallel Computers • www.cs.berkeley.edu/~demmel/cs267_Spr09 • 3 Day Parallel “Boot Camp” in August • parlab.eecs.berkeley.edu/2009bootcamp

Why avoiding communication is important (1/2) Algorithms have two costs: Arithmetic (FLOPS) Communication: moving data between levels of a memory hierarchy (sequential case) processors over a network (parallel case). CPU Cache CPU DRAM CPU DRAM DRAM CPU DRAM CPU DRAM

Why avoiding communication is important (2/2) communication • Time_per_flop << 1/ bandwidth << latency • Gaps growing exponentially with time 59% • Goal : reorganize linear algebra to avoid communication • Between all memory hierarchy levels • L1 L2 DRAM network, etc • Notjust hiding communication (speedup  2x ) • Arbitrary speedups possible • Running time of an algorithm is sum of 3 terms: • # flops * time_per_flop • # words moved / bandwidth • # messages * latency

Naïve Sequential Matrix Multiply: C = C + A*B for i = 1 to n {read row i of A into fast memory, n2 reads} for j = 1 to n {read C(i,j) into fast memory, n2 reads} {read column j of B into fast memory, n3 reads} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory, n2 writes} n3 + O(n2) reads/writes altogether A(i,:) C(i,j) C(i,j) B(:,j) = + *

Less Communication with Blocked Matrix Multiply • … Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc • … b chosen so 3 bxb blocks fit in cache • for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b • C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul, 4b2 reads/writes • (n/b)3 · 4b2 = 4n3/b reads/writes altogether • Minimized when 3b2 = cache size = M, yielding O(n3/M1/2) reads/writes • What if we had more levels of memory? (L1, L2, cache etc)? • Would need 3 more nested loops per level Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size

Blocked vs Cache-Oblivious Algorithms … Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul … another level of memory would need 3 more loops • Cache-obliviousMatmul C = A·B is independent of cache Function C = MM(A,B) If A and B are 1x1 C = A · B else … Break Anxn, Bnxn, Cnxn into (n/2)x(n/2) blocks labeled A(i,j), etc for i = 1 to 2, for j = 1 to 2, for k = 1 to 2 C(i,j) = C(i,j) + MM( A(i,k), B(k,j) ) … n/2 x n/2 matmul Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size

Direct linear algebra: Prior Work on Matmul • Parallel case on P processors: • Let NNZ be total memory needed; assume load balanced • Lower bound on #words communicated =  (n3 /(P· NNZ )1/2 ) [Irony, Tiskin, Toledo, 04] • Assume n3 algorithm (i.e. not Strassen-like) • Sequential case, with fast memory of size M • Lower bound on #words moved to/from slow memory =  (n3 / M1/2) [Hong, Kung, 81] • Attained using blocked or cache-oblivious algorithms

New lower bound for all “direct” linear algebra • Let M = “fast” memory size per processor • #words_moved by at least one processor = • (#flops / M1/2) • #messages_sent by at least one processor = • (#flops / M3/2) • Holds for • BLAS, LU, QR, eig, SVD, … • Some whole programs (sequences of these operations, no matter how they are interleaved, eg computing Ak) • Dense and sparse matrices (where #flops << n3 ) • Sequential and parallel algorithms • Some graph-theoretic algorithms (eg Floyd-Warshall) • See [BDHS09] – proof sketch if time!

Can we attain these lower bounds? • Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? • Mostly not • If not, are there other algorithms that do? • Yes • Goals for algorithms: • Minimize #words =  (#flops/ M1/2 ) • Minimize #messages =  (#flops/ M3/2) • Need new data structures • Minimize for multiple memory hierarchy levels • Cache-oblivious algorithms would be simplest • Fewest flops when matrix fits in fastest memory • Cache-oblivious algorithms don’t always attain this • Only a few sparse algorithms so far

Minimizing #messages requires new data structures • Need to load/store whole rectangular subblock of matrix with one “message” • Incompatible with conventional columnwise (rowwise) storage • Ex: Rows (columns) not in contiguous memory locations • Blocked storage: store as matrix of bxb blocks, each block stored contiguously • Ok for one level of memory hierarchy, what if more? • Recursive blocked storage: store each block using subblocks

Summary of dense sequential algorithms attaining communication lower bounds • Algorithms shown minimizing # Messages use (recursive) block layout • Many references (see reports), only some shown, plus ours • Cache-oblivious are underlined, Green are ours, ? is unknown/future work

Summary of dense 2D parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors, memory per processor = O(n2 / P) • ScaLAPACK assumes best block size b chosen • Many references (see reports), Green are ours • Recall lower bounds: • #words_moved = ( n2 / P1/2 ) and #messages = ( P1/2 )

QR of a tall, skinny matrix is bottleneck;Use TSQR instead: Q is represented implicitly as a product (tree of factors) [DGHL08]

Minimizing Communication in TSQR R00 R10 R20 R30 W0 W1 W2 W3 R01 Parallel: W = R02 R11 R00 W0 W1 W2 W3 Sequential: R01 W = R02 R03 R00 R01 W0 W1 W2 W3 R01 Dual Core: W = R02 R11 R03 R11 Multicore / Multisocket / Multirack / Multisite / Out-of-core: ? Can choose reduction tree dynamically

Performance of TSQR vs Sca/LAPACK • Parallel • Intel Clovertown • Up to 8x speedup (8 core, dual socket, 10M x 10) • Pentium III cluster, Dolphin Interconnect, MPICH • Up to 6.7x speedup (16 procs, 100K x 200) • BlueGene/L • Up to 4x speedup (32 procs, 1M x 50) • Both use Elmroth-Gustavson locally – enabled by TSQR • Sequential • OOC on PowerPC laptop • As little as 2x slowdown vs (predicted) infinite DRAM • LAPACK with virtual memory never finished • See UC Berkeley EECS Tech Report 2008-89 • General CAQR = Communication-Avoiding QR in progress

Modeled Speedups of CAQR vs ScaLAPACK Petascale up to 22.9x IBM Power 5 up to 9.7x “Grid” up to 11x Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.

Randomized Rank-Revealing QR • Needed for eigenvalue and SVD algorithms • func [Q,R,V] = RRQR(A) • [V,R1] = qr(randn(n,n)) … V random with Haar measure • [Q,R] = qr(A · VT ) … so A = Q · R · V • Theorem (DDH). Provided that sr+1 << sr ~ s1 , RRQR produces a rank-revealing decomposition with high probability (when r =O(n), the probability is 1 – O(n-c) ). • Can apply to A = A1+/-1 A2+/-1… Ak+/-1 • Do RRQR, followed by a series of (k-1) QR / RQ to propagate unitary/orthogonal matrix to the front • Analogous to [Stewart, 94]

What about o(n3) algorithms, like Strassen? • Thm (DDHK07) • Given any matmul running in O(n) ops for some >2, it can be made stable and still run in O(n+) ops, for any >0 • Current record:   2.38 • Thm (DDH08) • Given any stable matmul running in O(n+) ops, it is possible to do backward stable dense linear algebra in O(n+) ops • Also reduces communication to O(n+) • But can do much better (work in progress…)

Direct Linear Algebra: summary and future work • Communication lower bounds on #words_moved and #messages • BLAS, LU, Cholesky, QR, eig, SVD, … • Some whole programs (compositions of these operations, no matter how they are interleaved, eg computing Ak) • Dense and sparse matrices (where #flops << n3 ) • Sequential and parallel algorithms • Some graph-theoretic algorithms (eg Floyd-Warshall) • Algorithms that attain these lower bounds • Nearly all dense problems (some open problems) • A few sparse problems • Speed ups in theory and practice • Future work • Lots to implement, tune • Next generation of Sca/LAPACK on heterogeneous architectures (MAGMA) • Few algorithms in sparse case • Strassen-like algorithms • 3D Algorithms (only for Matmul so far), may be important for scaling

How hard is hand-tuning matmul, anyway? • Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 • Students given “blocked” code to start with • Still hard to get close to vendor tuned performance (ACML) • For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/

How hard is hand-tuning matmul, anyway?

What part of the Matmul Search Space Looks Like Number of columns in register block Number of rows in register block A 2-D slice of a 3-D register-tile search space. The dark blue region was pruned. (Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)

Autotuning Matmul with ATLAS (n = 500) Source: Jack Dongarra • ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor. • ATLAS written by C. Whaley, inspired by PHiPAC, by Asanovic, Bilmes, Chin, D.

Automatic Performance Tuning • Goal: Let machine do hard work of writing fast code • What do tuning of Matmul, dense BLAS, FFTs, signal processing, have in common? • Can do the tuning off-line: once per architecture, algorithm • Can take as much time as necessary (hours, a week…) • At run-time, algorithm choice may depend only on few parameters (matrix dimensions, size of FFT, etc.) • Can’t always do tuning off-line • Algorithm and implementation may strongly depend on data only known at run-time • Ex: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrix-vector-multiplication (SpMV) • Part of search for best algorithm just be done (very quickly!) at run-time

Source: Accelerator Cavity Design Problem (from Ko via Husbands)

Linear Programming Matrix …

A Sparse Matrix You Use Every Day

SpMV with Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] Only 2 flops per 2 mem_refs: Limited by getting data from memory

Example: The Difficulty of Tuning • n = 21200 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem

Example: The Difficulty of Tuning • n = 21200 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem • 8x8 dense substructure: exploit this to limit #mem_refs

Best: 4x2 Reference Speedups on Itanium 2: The Need for Search 8x8 Mflop/s Mflop/s

Register Profile: Itanium 2 1190 Mflop/s 190 Mflop/s

Power3 - 17% 252 Mflop/s Power4 - 16% 820 Mflop/s Register Profiles: IBM and Intel IA-64 122 Mflop/s 459 Mflop/s Itanium 1 - 8% Itanium 2 - 33% 247 Mflop/s 1.2 Gflop/s 107 Mflop/s 190 Mflop/s

Ultra 2i - 11% 72 Mflop/s Ultra 3 - 5% 90 Mflop/s Register Profiles: Sun and Intel x86 35 Mflop/s 50 Mflop/s Pentium III - 21% Pentium III-M - 15% 108 Mflop/s 122 Mflop/s 42 Mflop/s 58 Mflop/s

Another example of tuning challenges • More complicated non-zero structure in general • N = 16614 • NNZ = 1.1M

Zoom in to top corner • More complicated non-zero structure in general • N = 16614 • NNZ = 1.1M

3x3 blocks look natural, but… • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • But would lead to lots of “fill-in”

Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! • Actual mflop rate 1.52 = 2.25 higher

Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! • Actual mflop rate 1.52 = 2.25 higher • In general: sample matrix to choose rxc block structure using heuristic

Source: Accelerator Cavity Design Problem (from Ko via Husbands)

Post-RCM Reordering

100x100 Submatrix Along Diagonal

“Microscopic” Effect of RCM Reordering Before: Green+ Red After: Green+Blue

Minimizing Communication in Linear Algebra