James Demmel UC Berkeley

The Future of Numerical Linear AlgebraAutomatic Performance Tuning of Sparse Matrix codesThe Next LAPACK and ScaLAPACKwww.cs.berkeley.edu/~demmel/Utah_Apr05.ppt James Demmel UC Berkeley

Outline • Automatic Performance Tuning of Sparse Matrix Kernels • Updating LAPACK and ScaLAPACK

Berkeley Benchmarking and OPtimization (BeBOP) • Prof. Katherine Yelick • Rich Vuduc • Many results in this talk are from Vuduc’s PhD thesis, www.cs.berkeley.edu/~richie • Rajesh Nishtala, Mark Hoemmen, Hormozd Gahvari • Eun-Jim Im, many other earlier contributors • Other papers at bebop.cs.berkeley.edu

Motivation for Automatic Performance Tuning • Writing high performance software is hard • Make programming easier while getting high speed • Ideal: program in your favorite high level language (Matlab, PETSc…) and get a high fraction of peak performance • Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… • Best choice can depend on knowing a lot of applied mathematics and computer science • How much of this can we teach?

Motivation for Automatic Performance Tuning • Writing high performance software is hard • Make programming easier while getting high speed • Ideal: program in your favorite high level language (Matlab, PETSc…) and get a high fraction of peak performance • Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… • Best choice can depend on knowing a lot of applied mathematics and computer science • How much of this can we teach? • How much of this can we automate?

Examples of Automatic Performance Tuning • Dense BLAS • Sequential • PHiPAC (UCB), then ATLAS (UTK) • Now in Matlab, many other releases • math-atlas.sourceforge.net/ • Fast Fourier Transform (FFT) & variations • FFTW (MIT) • Sequential and Parallel • 1999 Wilkinson Software Prize • www.fftw.org • Digital Signal Processing • SPIRAL: www.spiral.net (CMU) • MPI Collectives (UCB, UTK) • More projects, conferences, government reports, …

Tuning Dense BLAS —PHiPAC

Tuning Dense BLAS– ATLAS Extends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)

How tuning works, so far • What do dense BLAS, FFTs, signal processing, MPI reductions have in common? • Can do the tuning off-line: once per architecture, algorithm • Can take as much time as necessary (hours, a week…) • At run-time, algorithm choice may depend only on few parameters • Matrix dimension, size of FFT, etc.

Register Tile Size Selection inDense Matrix Multiply m k m k0 m0 m0 n0 k n0 n . k0 = n

Tuning Register Tile Sizes (Dense Matrix Multiply) 333 MHz Sun Ultra 2i 2-D slice of 3-D space; implementations color-coded by performance in Mflop/s 16 registers, but 2-by-3 tile size fastest Needle in a haystack

90% of implementations perform at < 33% of peak .4% of implementations perform at >= 80% of peak 99% of implementations perform at < 66% of peak

Limits of off-line tuning • Algorithm and its implementation may strongly depend on data only known at run-time • Ex: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrix-vector-multiplication (SpMV) • Can’t afford to generate and test thousands of algorithms and implementations at run-time! • BEBOP project addresses sparse tuning

A Sparse Matrix You Use Every Day

Motivation for Automatic Performance Tuning of SpMV • SpMV widely used in practice • Kernel of iterative solvers for • linear systems • eigenvalue problems • Singular value problems • Historical trends • Sparse matrix-vector multiply (SpMV): 10% of peak or less • 2x faster than CSR with “hand-tuning” • Tuning becoming more difficult over time

SpMV Historical Trends: Fraction of Peak

Approach to Automatic Performance Tuning of SpMV • Our approach: empirical modeling and search • Off-line: measure performance of variety of data structures and SpMV algorithms • On-line: sample matrix, use performance model to predict which data structure/algorithm is best • Results • Up to 4x speedups and 31% of peak for SpMV • Using register blocking • Many other optimization techniques for SpMV

SpMV with Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]]

Example 1: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem

Example 1: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem • 8x8 dense substructure

Taking advantage of block structure in SpMV • Bottleneck is time to get matrix from memory • Only 2 flops for each nonzero in matrix • Goal: decrease size of data structure • Don’t store each nonzero with index, instead store each nonzero r-by-c block with one index • Storage drops by up to 2x (if rc >> 1, all 32-bit quantities) • Time to fetch matrix from memory decreases • Change both data structure and algorithm • Need to pick r and c • Need to change algorithm accordingly • In example, is r=c=8 best choice? • Minimizes storage, so looks like a good idea…

Best: 4x2 Reference Speedups on Itanium 2: The Need for Search Mflop/s Mflop/s

Power3 - 13% 195 Mflop/s Power4 - 14% 703 Mflop/s SpMV Performance (Matrix #2): Generation 1 100 Mflop/s 469 Mflop/s Itanium 1 - 7% Itanium 2 - 31% 225 Mflop/s 1.1 Gflop/s 103 Mflop/s 276 Mflop/s

Register Profile: Itanium 2 1190 Mflop/s 190 Mflop/s

Power3 - 17% 252 Mflop/s Power4 - 16% 820 Mflop/s Register Profiles: IBM and Intel IA-64 122 Mflop/s 459 Mflop/s Itanium 1 - 8% Itanium 2 - 33% 247 Mflop/s 1.2 Gflop/s 107 Mflop/s 190 Mflop/s

Ultra 2i - 11% 72 Mflop/s Ultra 3 - 5% 90 Mflop/s Register Profiles: Sun and Intel x86 35 Mflop/s 50 Mflop/s Pentium III - 21% Pentium III-M - 15% 108 Mflop/s 122 Mflop/s 42 Mflop/s 58 Mflop/s

Example 2: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem

Zoom in to top corner • More complicated non-zero structure in general

3x3 blocks look natural, but… • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells

3x3 blocks look natural, but… • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • But would lead to lots of “fill-in”: 1.5x

Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! • Actual mflop rate 1.52 = 2.25 higher

Automatic Register Block Size Selection • Selecting the r x c block size • Off-line benchmark • Precompute Mflops(r,c) using dense A for each r x c • Once per machine/architecture • Run-time “search” • Sample A to estimate Fill(r,c) for each r x c • Run-time heuristic model • Choose r, c to minimize time  Fill(r,c) /Mflops(r,c)

Accurate and Efficient Adaptive Fill Estimation • Idea: Sample matrix • Fraction of matrix to sample: sÎ [0,1] • Cost ~ O(s * nnz) • Control cost by controlling s • Search at run-time: the constant matters! • Control s automatically by computing statistical confidence intervals • Idea: Monitor variance • Cost of tuning • Heuristic: costs 1 to 11 unblocked SpMVs • Converting matrix costs 5 to 40 unblocked SpMVs • Tuning a good idea when doing lots of SpMVs

Test Matrix Collection • Many on-line sources (see Vuduc’s thesis) • Matrix 1 – dense (in sparse format) • Matrices 2-9: FEM with one block size r x c • N from 14K to 62K, NNZ from 1M to 3M • Fluid flow, structural mechanics, materials … • Matrices 10-17: FEM with multiple block sizes • N from 17K to 52K, NNZ from .5M to 2.7M • Fluid flow, buckling, … • Matrices 18 – 37: “Other” • N from 5K to 75K, NNZ from 33K to .5M • Power grid, chem eng, finance, semiconductors, … • Matrices 40 – 44: Linear Programming • (N,M) from (3K,13K) to (15K,77K), NNZ from 50K to 2M

Accuracy of the Tuning Heuristics (1/4) See p. 375 of Vuduc’s thesis for matrices NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)

Accuracy of the Tuning Heuristics (2/4)

Accuracy of the Tuning Heuristics (2/4) DGEMV

Evaluating algorithms and machines for SpMV • Some speedups look good, but could we do better? • Questions • What is the best speedup possible? • Independent of instruction scheduling, selection • Can SpMV be further improved or not? • What machines are “good” for SpMV? • How can architectures be changed to improve SpMV?

Upper Bounds on Performance for register blocked SpMV • P = (flops) / (time) • Flops = 2 * nnz(A) … don’t count extra work on zeros • Lower bound on time: Two main assumptions • 1. Count memory ops only (streaming) • 2. Count only compulsory, capacity misses: ignore conflicts • Account for line sizes • Account for matrix size and nnz • Charge minimum access “latency” ai at Li cache & amem • e.g., Saavedra-Barrera and PMaC MAPS benchmarks

Example: L2 Misses on Itanium 2 Misses measured using PAPI [Browne ’00]

Example: Bounds on Itanium 2

Fraction of Upper Bound Across Platforms

Summary of Other Performance Optimizations • Optimizations for SpMV • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 2.8x over CSR • Multiple vectors (SpMM): 7x over CSR • And combinations… • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A2*x: 2x over CSR, 1.5x over RB

Raefsky4 (structural problem) + SuperLU + colmmd N=19779, nnz=12.6 M Dense trailing triangle: dim=2268, 20% of total nz Can be as high as 90+%! 1.8x over CSR Example: Sparse Triangular Factor

“axpy” dot product Cache Optimizations for AAT*x • Cache-level: Interleave multiplication by A, AT … … • Register-level: aiT to be r´c block row, or diag row • Algorithmic-level transformations for A2*x, A3*x, …

Impact on Applications: Omega3P • Application: accelerator cavity design [Ko] • Relevant optimization techniques • Symmetric storage • Register blocking • Reordering rows and columns to improve blocking • Reverse Cuthill-McKee ordering to reduce bandwidth • Traveling Salesman Problem-based ordering to create blocks • [Pinar & Heath ’97] • Make columns adjacent if they have many common nonzero rows • 2.1x speedup on Power 4

Source: Accelerator Cavity Design Problem (Ko via Husbands)

James Demmel UC Berkeley