Download
minimizing communication in linear algebra n.
Skip this Video
Loading SlideShow in 5 Seconds..
Minimizing Communication in Linear Algebra PowerPoint Presentation
Download Presentation
Minimizing Communication in Linear Algebra

Minimizing Communication in Linear Algebra

218 Vues Download Presentation
Télécharger la présentation

Minimizing Communication in Linear Algebra

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Minimizing Communication in Linear Algebra James Demmel 9 Nov 2009

  2. Outline Background – 2 Big Events Why avoiding communication is important Communication-optimal direct linear algebra Autotuning of sparse matrix codes Communication-optimal iterative linear algebra

  3. Collaborators Thanks to Intel, Microsoft, UC Discovery, NSF, DOE, … • Grey Ballard, UCB EECS • Ioana Dumitriu, U. Washington • Laura Grigori, INRIA • Ming Gu, UCB Math • Mark Hoemmen, UCB EECS • Olga Holtz, UCB Math & TU Berlin • Julien Langou, U. Colorado Denver • Marghoob Mohiyuddin, UCB EECS • Oded Schwartz , TU Berlin • Hua Xiang, INRIA • Sam Williams, LBL • Vasily Volkov, UCB EECS • Kathy Yelick, UCB EECS & NERSC • BeBOP group, bebop.cs.berkeley.edu • ParLab, parlab.eecs.berkeley.edu

  4. 2 Big Events • “Multicore revolution”, requiring all software (where performance matters!) to change • ParLab – Industry/State funded center with 15 faculty, ~50 grad students ( parlab.eecs.berkeley.edu ) • Rare synergy between commercial and scientific computing • Establishment of a new graduate program in Computational Science and Engineering (CSE) • 117 Faculty from 22 Departments • 60 existing courses, more being developed • CS267 – Applications of Parallel Computers • www.cs.berkeley.edu/~demmel/cs267_Spr09 • 3 Day Parallel “Boot Camp” in August • parlab.eecs.berkeley.edu/2009bootcamp

  5. Parallelism Revolution is Happening Now • Chip density continuing to increase ~2x every 2 years • Clock speed is not • Number of processor cores may double instead • There is little or no more hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

  6. The “7 Dwarfs” of High Performance Computing • Dense Linear Algebra • Ex: Solve Ax=b or Ax = λx where A is a dense matrix • Sparse Linear Algebra • Ex: Solve Ax=b or Ax = λx where A is a sparse matrix (mostly zero) • Operations on Structured Grids • Ex: Anew(i,j) = 4*A(i,j) – A(i-1,j) – A(i+1,j) – A(i,j-1) – A(i,j+1) • Operations on Unstructured Grids • Ex: Similar, but list of neighbors varies from entry to entry • Spectral Methods • Ex: Fast Fourier Transform (FFT) • N-Body (aka Particle Methods) • Ex: Compute electrostatic forces using Fast Multiple Method • Monte Carlo (aka MapReduce) • Ex: Many independent simulations using different inputs Phil Colella (LBL) identified 7 kernels out of which most large scale simulation and data-analysis programs are composed:

  7. Motif/Dwarf: Common Computational Methods (Red Hot Blue Cool) www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf

  8. 2 Big Events • “Multicore revolution”, requiring all software (where performance matters!) to change • ParLab – Industry/State funded center with 15 faculty, ~50 grad students ( parlab.eecs.berkeley.edu ) • Rare synergy between commercial and scientific computing • Establishment of a new graduate program in Computational Science and Engineering (CSE) at UCB • 117 Faculty from 22 Departments • 60 existing courses, more being developed • CS267 – Applications of Parallel Computers • www.cs.berkeley.edu/~demmel/cs267_Spr09 • 3 Day Parallel “Boot Camp” in August • parlab.eecs.berkeley.edu/2009bootcamp

  9. Why avoiding communication is important (1/2) Algorithms have two costs: Arithmetic (FLOPS) Communication: moving data between levels of a memory hierarchy (sequential case) processors over a network (parallel case). CPU Cache CPU DRAM CPU DRAM DRAM CPU DRAM CPU DRAM

  10. Why avoiding communication is important (2/2) communication • Time_per_flop << 1/ bandwidth << latency • Gaps growing exponentially with time 59% • Goal : reorganize linear algebra to avoid communication • Between all memory hierarchy levels • L1 L2 DRAM network, etc • Notjust hiding communication (speedup  2x ) • Arbitrary speedups possible • Running time of an algorithm is sum of 3 terms: • # flops * time_per_flop • # words moved / bandwidth • # messages * latency

  11. Naïve Sequential Matrix Multiply: C = C + A*B for i = 1 to n {read row i of A into fast memory, n2 reads} for j = 1 to n {read C(i,j) into fast memory, n2 reads} {read column j of B into fast memory, n3 reads} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory, n2 writes} n3 + O(n2) reads/writes altogether A(i,:) C(i,j) C(i,j) B(:,j) = + *

  12. Less Communication with Blocked Matrix Multiply • … Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc • … b chosen so 3 bxb blocks fit in cache • for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b • C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul, 4b2 reads/writes • (n/b)3 · 4b2 = 4n3/b reads/writes altogether • Minimized when 3b2 = cache size = M, yielding O(n3/M1/2) reads/writes • What if we had more levels of memory? (L1, L2, cache etc)? • Would need 3 more nested loops per level Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size

  13. Blocked vs Cache-Oblivious Algorithms … Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul … another level of memory would need 3 more loops • Cache-obliviousMatmul C = A·B is independent of cache Function C = MM(A,B) If A and B are 1x1 C = A · B else … Break Anxn, Bnxn, Cnxn into (n/2)x(n/2) blocks labeled A(i,j), etc for i = 1 to 2, for j = 1 to 2, for k = 1 to 2 C(i,j) = C(i,j) + MM( A(i,k), B(k,j) ) … n/2 x n/2 matmul Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size

  14. Direct linear algebra: Prior Work on Matmul • Parallel case on P processors: • Let NNZ be total memory needed; assume load balanced • Lower bound on #words communicated =  (n3 /(P· NNZ )1/2 ) [Irony, Tiskin, Toledo, 04] • Assume n3 algorithm (i.e. not Strassen-like) • Sequential case, with fast memory of size M • Lower bound on #words moved to/from slow memory =  (n3 / M1/2) [Hong, Kung, 81] • Attained using blocked or cache-oblivious algorithms

  15. New lower bound for all “direct” linear algebra • Let M = “fast” memory size per processor • #words_moved by at least one processor = • (#flops / M1/2) • #messages_sent by at least one processor = • (#flops / M3/2) • Holds for • BLAS, LU, QR, eig, SVD, … • Some whole programs (sequences of these operations, no matter how they are interleaved, eg computing Ak) • Dense and sparse matrices (where #flops << n3 ) • Sequential and parallel algorithms • Some graph-theoretic algorithms (eg Floyd-Warshall) • See [BDHS09] – proof sketch if time!

  16. Can we attain these lower bounds? • Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? • Mostly not • If not, are there other algorithms that do? • Yes • Goals for algorithms: • Minimize #words =  (#flops/ M1/2 ) • Minimize #messages =  (#flops/ M3/2) • Need new data structures • Minimize for multiple memory hierarchy levels • Cache-oblivious algorithms would be simplest • Fewest flops when matrix fits in fastest memory • Cache-oblivious algorithms don’t always attain this • Only a few sparse algorithms so far

  17. Minimizing #messages requires new data structures • Need to load/store whole rectangular subblock of matrix with one “message” • Incompatible with conventional columnwise (rowwise) storage • Ex: Rows (columns) not in contiguous memory locations • Blocked storage: store as matrix of bxb blocks, each block stored contiguously • Ok for one level of memory hierarchy, what if more? • Recursive blocked storage: store each block using subblocks

  18. Summary of dense sequential algorithms attaining communication lower bounds • Algorithms shown minimizing # Messages use (recursive) block layout • Many references (see reports), only some shown, plus ours • Cache-oblivious are underlined, Green are ours, ? is unknown/future work

  19. Summary of dense 2D parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors, memory per processor = O(n2 / P) • ScaLAPACK assumes best block size b chosen • Many references (see reports), Green are ours • Recall lower bounds: • #words_moved = ( n2 / P1/2 ) and #messages = ( P1/2 )

  20. QR of a tall, skinny matrix is bottleneck;Use TSQR instead: Q is represented implicitly as a product (tree of factors) [DGHL08]

  21. Minimizing Communication in TSQR R00 R10 R20 R30 W0 W1 W2 W3 R01 Parallel: W = R02 R11 R00 W0 W1 W2 W3 Sequential: R01 W = R02 R03 R00 R01 W0 W1 W2 W3 R01 Dual Core: W = R02 R11 R03 R11 Multicore / Multisocket / Multirack / Multisite / Out-of-core: ? Can choose reduction tree dynamically

  22. Performance of TSQR vs Sca/LAPACK • Parallel • Intel Clovertown • Up to 8x speedup (8 core, dual socket, 10M x 10) • Pentium III cluster, Dolphin Interconnect, MPICH • Up to 6.7x speedup (16 procs, 100K x 200) • BlueGene/L • Up to 4x speedup (32 procs, 1M x 50) • Both use Elmroth-Gustavson locally – enabled by TSQR • Sequential • OOC on PowerPC laptop • As little as 2x slowdown vs (predicted) infinite DRAM • LAPACK with virtual memory never finished • See UC Berkeley EECS Tech Report 2008-89 • General CAQR = Communication-Avoiding QR in progress

  23. Modeled Speedups of CAQR vs ScaLAPACK Petascale up to 22.9x IBM Power 5 up to 9.7x “Grid” up to 11x Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.

  24. Randomized Rank-Revealing QR • Needed for eigenvalue and SVD algorithms • func [Q,R,V] = RRQR(A) • [V,R1] = qr(randn(n,n)) … V random with Haar measure • [Q,R] = qr(A · VT ) … so A = Q · R · V • Theorem (DDH). Provided that sr+1 << sr ~ s1 , RRQR produces a rank-revealing decomposition with high probability (when r =O(n), the probability is 1 – O(n-c) ). • Can apply to A = A1+/-1 A2+/-1… Ak+/-1 • Do RRQR, followed by a series of (k-1) QR / RQ to propagate unitary/orthogonal matrix to the front • Analogous to [Stewart, 94]

  25. What about o(n3) algorithms, like Strassen? • Thm (DDHK07) • Given any matmul running in O(n) ops for some >2, it can be made stable and still run in O(n+) ops, for any >0 • Current record:   2.38 • Thm (DDH08) • Given any stable matmul running in O(n+) ops, it is possible to do backward stable dense linear algebra in O(n+) ops • Also reduces communication to O(n+) • But can do much better (work in progress…)

  26. Direct Linear Algebra: summary and future work • Communication lower bounds on #words_moved and #messages • BLAS, LU, Cholesky, QR, eig, SVD, … • Some whole programs (compositions of these operations, no matter how they are interleaved, eg computing Ak) • Dense and sparse matrices (where #flops << n3 ) • Sequential and parallel algorithms • Some graph-theoretic algorithms (eg Floyd-Warshall) • Algorithms that attain these lower bounds • Nearly all dense problems (some open problems) • A few sparse problems • Speed ups in theory and practice • Future work • Lots to implement, tune • Next generation of Sca/LAPACK on heterogeneous architectures (MAGMA) • Few algorithms in sparse case • Strassen-like algorithms • 3D Algorithms (only for Matmul so far), may be important for scaling

  27. How hard is hand-tuning matmul, anyway? • Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 • Students given “blocked” code to start with • Still hard to get close to vendor tuned performance (ACML) • For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/

  28. How hard is hand-tuning matmul, anyway?

  29. What part of the Matmul Search Space Looks Like Number of columns in register block Number of rows in register block A 2-D slice of a 3-D register-tile search space. The dark blue region was pruned. (Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)

  30. Autotuning Matmul with ATLAS (n = 500) Source: Jack Dongarra • ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor. • ATLAS written by C. Whaley, inspired by PHiPAC, by Asanovic, Bilmes, Chin, D.

  31. Automatic Performance Tuning • Goal: Let machine do hard work of writing fast code • What do tuning of Matmul, dense BLAS, FFTs, signal processing, have in common? • Can do the tuning off-line: once per architecture, algorithm • Can take as much time as necessary (hours, a week…) • At run-time, algorithm choice may depend only on few parameters (matrix dimensions, size of FFT, etc.) • Can’t always do tuning off-line • Algorithm and implementation may strongly depend on data only known at run-time • Ex: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrix-vector-multiplication (SpMV) • Part of search for best algorithm just be done (very quickly!) at run-time

  32. Source: Accelerator Cavity Design Problem (from Ko via Husbands)

  33. Linear Programming Matrix

  34. A Sparse Matrix You Use Every Day

  35. SpMV with Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] Only 2 flops per 2 mem_refs: Limited by getting data from memory

  36. Example: The Difficulty of Tuning • n = 21200 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem

  37. Example: The Difficulty of Tuning • n = 21200 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem • 8x8 dense substructure: exploit this to limit #mem_refs

  38. Best: 4x2 Reference Speedups on Itanium 2: The Need for Search 8x8 Mflop/s Mflop/s

  39. Register Profile: Itanium 2 1190 Mflop/s 190 Mflop/s

  40. Power3 - 17% 252 Mflop/s Power4 - 16% 820 Mflop/s Register Profiles: IBM and Intel IA-64 122 Mflop/s 459 Mflop/s Itanium 1 - 8% Itanium 2 - 33% 247 Mflop/s 1.2 Gflop/s 107 Mflop/s 190 Mflop/s

  41. Ultra 2i - 11% 72 Mflop/s Ultra 3 - 5% 90 Mflop/s Register Profiles: Sun and Intel x86 35 Mflop/s 50 Mflop/s Pentium III - 21% Pentium III-M - 15% 108 Mflop/s 122 Mflop/s 42 Mflop/s 58 Mflop/s

  42. Another example of tuning challenges • More complicated non-zero structure in general • N = 16614 • NNZ = 1.1M

  43. Zoom in to top corner • More complicated non-zero structure in general • N = 16614 • NNZ = 1.1M

  44. 3x3 blocks look natural, but… • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • But would lead to lots of “fill-in”

  45. Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! • Actual mflop rate 1.52 = 2.25 higher

  46. Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! • Actual mflop rate 1.52 = 2.25 higher • In general: sample matrix to choose rxc block structure using heuristic

  47. Source: Accelerator Cavity Design Problem (from Ko via Husbands)

  48. Post-RCM Reordering

  49. 100x100 Submatrix Along Diagonal

  50. “Microscopic” Effect of RCM Reordering Before: Green+ Red After: Green+Blue