Relational Query Processing Approach to Compiling Sparse Matrix Codes

Relational Query Processing Approach to Compiling Sparse Matrix Codes Vladimir Kotlyar Computer Science Department, Cornell University http://www.cs.cornell.edu/Info/Project/Bernoulli

Outline • Problem statement • Sparse matrix computations • Importance of sparse matrix formats • Difficulties in the development of sparse matrix codes • State-of-the-art restructuring compiler technology • Technical approach and experimental results • Ongoing work and conclusions

Sparse Matrices and Their Applications • Number of non-zeroes per row/column << n • Often, less that 0.1% non-zero • Applications: • Numerical simulations, (non)linear optimization, graph theory, information retrieval, ...

Application: numerical simulations • Fracture mechanics Grand Challenge project: • Cornell CS + Civil Eng. + other schools; • supported by NSF,NASA,Boeing • A system of differential equations is solved over a continuous domain • Discretized into an algebraic system in variables x(i) • System of linear equations Ax=b is at the core • Intuition: A is sparse because the physical interactions are local

Application: Authoritative sources on the Web • Hubs and authorities on the Web • Graph G=(V,E) of the documents • A(u,v) = 1 if (u,v) is an edge • A is sparse! • Eigenvectors of identify hubs, authorities and their clusters (“communities”) [Kleinberg,Raghavan ‘97] Hubs Authorities

Sparse matrix algorithms • Solution of linear systems • Direct methods (Gaussian elimination): A = LU • Impractical for many large-scale problems • For certain problems: O(n) space, O(n) time • Iterative methods • Matrix-vector products: y = Ax • Triangular system solution: Lx=b • Incomplete factorizations: A  LU • Eigenvalue problems: • Mostly matrix-vector products + dense computations

Sparse matrix computations • “DOANY” -- operations in any order • Vector ops (dot product, addition,scaling) • Matrix-vector products • Rarely used: C = A+B • Important: C  A+B, A  A + UV • “DOACROSS” -- dependencies between operations • Triangular system solution: Lx = b • More complex applications are built out of the above + dense kernels • Preprocessing (e.g. storage allocation): “graph theory”

Outline • Problem statement • Sparse matrix computations • Sparse Matrix Storage Formats • Difficulties in the development of sparse matrix codes • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions

Storing Sparse Matrices • Compressed formats are essential • O(nnz) time/space, not O(n²) • Example: matrix-vector product • 10M row/columns, 50 non-zeroes/row • 5 seconds vs 139 hours on a 200Mflops computer (assuming huge memory) • A variety of formats are used in practice • Application/architecture dependent • Different memory usage • Different performance on RISC processors

Point formats Coordinate Compressed Column Storage

Block formats • Block Sparse Column • “Natural” for physical problems with several unknowns at each point in space • Saves storage: 25% for 2-by-2 blocks • Improves performance on modern RISC processors

Why multiple formats: performance • Sparse matrix-vector product • Formats: CRS, Jagged diagonal, BlockSolve • On IBM RS6000 (66.5 MHz Power2) • Best format depends on the application (20-70% advantage)

Bottom line • Sparse matrices are used in a variety of application areas • Have to be stored in compressed data structures • Many formats are used in practice • Different storage/performance characteristics • Code development is tedious and error-prone • No random access • Different code for each format • Even worse in parallel (many ways to distribute the data)

Libraries • Dense computations: Basic Linear Algebra Subroutines • Implemented by most computer vendors • Few formats, easy to parametrize: row/column-major, symmetric/unsymmetric, etc • Other computations are built on top of BLAS • Can we do the same for sparse matrices?

Sparse Matrix Libraries • Sparse Basic Linear Algebra Subroutine (SPBLAS) library [Pozo,Remington @ NIST] • 13 formats ==> too many combinations of “A op B” • Some important ops are not supported • Not extensible • Coarse-grain solver packages [BlockSolve,Aztec,…] • Particular class of problems/algorithms (e.g. iterative solution) • OO approaches: hooks for basic ops (e.g. matrix-vector product)

Our goal: generate sparse codes automatically • Permit user-defined sparse data structures • Specialize high-level algorithm for sparsity, given the formats FOR I=1,N sum = sum + X(I)*Y(I) FOR I=1,N such that X(I)0 and Y(I)0 sum = sum + X(I)*Y(I) executable code

Input to the compiler • FOR-loops are sequential • DO-loops can be executed in any order (“DOANY”) • Convert dense DO-loops into sparse code DO I=1,N; J=1,N Y(I)=Y(I)+A(I,J)*X(J) for(j=0; j<N;j++) for(ii=colp(j);ii < colp(j+1);ii++) Y(rowind(ii))=Y(rowind(ii))+vals(ii)*X(j);

Outline • Problem statement • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions

An example: locality enhancement • Matrix-vector product, array A stored in column/major order FOR I=1,N FOR J=1,N Y(I) = Y(I) + A(I,J)*X(J) • Would like to execute the code as: FOR J=1,N FOR I=1,N Y(I) = Y(I) + A(I,J)*X(J) • In general? Stride-N Stride-1

An abstraction: polyhedra • Loop nests == polyhedra in integer spaces FOR I=1,N FOR J=1,I ….. • Transformations • Used in production and research compilers (SGI, HP, IBM)

Caveat • The polyhedral model is not applicable to sparse computations FOR I=1,N FOR J=1,N IF (A(I,J)  0) THEN Y(I) = Y(I) + A(I,J)*X(J) • Not a polyhedron What is the right formalism?

Extensions for sparse matrix code generation FOR I=1,N FOR J=1,N IF (A(I,J)  0) THEN Y(I)=Y(I)+A(I,J)*X(J) • A is sparse, compressed by column • Interchange the loops, encapsulate the guard FOR J=1,N FOR I=1,N such that A(I,J)  0 ... • “Control-centric” approach: transform the loops to match the best access to data [Bik,Wijshoff]

Limitations of the control-centric approach • Requires well-defined direction of access

Outline • Problem statement • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions

Data-centric transformations • Main idea: concentrate on the data DO I=…..; J=... …..A(F(I,J))….. • Array access function: <row,column> = F(I,J) • Example: coordinate storage format:

Data-centric sparse code generation • If only a single sparse array: FOR <row,column,value> in A I=row; J=column Y(I)=Y(I)+value*X(J) • For each data structure provide an enumeration method • What if more than one sparse array? • Need to produce efficient simultaneous enumeration

Efficient simultaneous enumeration DO I=1,N IF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I) • Options: • Enumerate X, search Y: “data-centric on” X • Enumerate Y, search X: “data-centric on” Y • Can speed up searching by scattering into a dense vector • If both sorted: “2-finger” merge • Best choice depends on how X and Y are stored • What is the general picture?

An observation DO I=1,N IF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I) • Can view arrays as relations (as in “relational databases”) X(i,x) Y(i,y) • Have to enumerate solutions to the relational query Join(X(i,x), Y(i,y))

Connection to relational queries • Dot product Join(X,Y) • General case?

From loop nests to relational queries DO I, J, K, ... …..A(F(I,J,K,...))…..B(G(I,J,K,...))….. • Arrays are relations (e.g. A(r,c,a)) • Implicitly store zeros and non-zeros • Integer space of loop variables is a relation, too: Iter(i,j,k,…) • Access predicate S: relates loop variables and array elements • Sparsity predicate P: “interesting” combination of zeros/non-zeros Select(P, Select(S Bounds, Product(Iter, A, B, …)))

Why relational queries? [Relational model] provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other E.F.Codd (CACM, 1970) • Want to separate what is to be computed from how

CRS CCS BRS Coordinate …. Bernoulli Sparse Compilation Toolkit Input program • BSCT is about 40K lines of ML + 9K lines of C • Query optimizer at the core • Extensible: new formats can be added Front-end Query Optimizer Abstractproperties Plan Macros Instantiator Low level C code

Query optimization: ordering joins

Query optimization: implementing joins FOR I  Join(A,Y) FOR J  Join(A(I,*), X) .…. FOR I  Merge(A,Y) H = scatter(X) FOR J  enumerate A(I,*), search H ….. • Output is called a plan

Instantiator: executable code generation H=scatter X FOR I  Merge(A,Y) FOR J  enumerate A(I,*), search H ….. ….. for(I=0; I<N; I++) for(JJ=ROWP(I); JJ < ROWP(I+1); JJ++) ….. • Macro expansion • Open system

Summary of the compilation techniques • Data-centric methodology: walk the data,compute accordingly • Implementation for sparse arrays • arrays = relations, loop nests = queries • Compilation path • Main steps are independent of data structure implementations • Parallel code generation • Ownership, communication sets,... = relations • Difference from traditional relational databases query opt. • Selectivity of predicates not an issue; affine joins

Experiments • Sequential • Kernels from SPBLAS library • Iterative solution of linear systems • Parallel • Iterative solution of linear systems • Comparison with the BlockSolve library from Argonne NL • Comparison with the proposed “High-Performance Fortran” standard

Setup • IBM SP-2 at Cornell • 120 MHz P2SC processor at each node • Can issue 2 multiply-add instructions per cycle • Peak performance 480 Mflops • Much lower on sparse problems: < 100 Mflops • Benchmark matrices • From Harwell-Boeing collection • Synthetic problems

Matrix-vector products • BSR = “Block Sparse Row” • VBR = “Variable Block Sparse Row” • BSCT_OPT = Some “Dragon Book” optimizations by hand • Loop invariant removal

Solution of Triangular Systems • Bottom line: • Can compete with the SPBLAS library (need to implement loop invariant removal :-)

Iterative solution of sparse linear systems • Essential for large-scale simulations • Preconditioned Conjugate Gradients (PCG) algorithm • Basic kernels: y=Ax, Lx=b +dense vector ops • Preprocessing step • Find M such that • Incomplete Cholesky factorization (ICC): • Basic kernels: , sparse vector scaling • Can not be implemented using the SPBLAS library • Used CCS format (“natural” for ICC)

Iterative Solution • ICC: a lot of “sparse overhead” • Ongoing investigation (at MathWorks): • Our compiler-generated ICC is 50-100 times faster than Matlab implementation !!

Iterative solution (cont.) • Preliminary comparison with IBM ESSL DSRIS • DSRIS implements PCG (among other things) • On BCSSTK30; have set values to vary the convergence • BSCT ICC takes 1.28 secs • ESSL DSRIS preprocessing (ILU+??) takes ~5.09 secs • PCG iterations are ~15% faster in ESSL

Parallel experiments • Conjugate Gradient algorithm • vs BlockSolve library (Argonne NL) • “Inspector” phase • Pre-computes what communication needs to occur • Done once, might be expensive • “Executor” phase • “Receive-compute-send-...” • On Boeing-Harwell matrices • On synthetic grid problems to understand scalability

Executor performance • Grid problems: problem size per processor is constant • 135K rows, ~4.6M non-zeroes • Within 2-4% of the library

Inspector overhead • Ratio of the inspector to single iteration of the executor • A problem-independent measure • “HPF-2” -- new data-parallel Fortran standard • Lowest-common denominator, inspectors are not scalable

Experiments: summary • Sequential • Competitive with SPBLAS library • Parallel • Inspector phase should exploit formats (cf. HPF-2)

Outline • Problem statement • State-of-the-art restructuring compiler technology • Technical approachand experiments • Ongoing work and conclusions

Ongoing work • Packaging • “Library-on-demand”; as a Matlab toolbox • Parallel code generation • Extend to handle more kernels • Core of the compiler • Disjunctive queries, fill

Ongoing work • Packaging • “Library-on-demand”; as a Matlab toolbox • Completely automatic tool; data structure selection • Out-of-core computations • Parallel code generation • Extend to handle more kernels • Core of the compiler • Disjunctive queries, fill

Relational Query Processing Approach to Compiling Sparse Matrix Codes

Relational Query Processing Approach to Compiling Sparse Matrix Codes

Presentation Transcript

sparse matrix-vector multiplication

Sparse Matrix Methods

The PIER Relational Query Processing System

Matrix Extensions to Sparse Recovery

Compressed Sparse Matrix Storage

Sparse Matrix Computations

Fast Sparse Matrix Multiplication

Sparse Matrix Algorithms

Semantic Web Query Processing with Relational Databases

Order-sensitive XML Query Processing over Relational Sources: An Algebraic Approach

Sparse Matrix Methods

Sparse matrix computations

Probabilistic Sparse Matrix Factorization

Sparse Matrix ADT

Autotuning sparse matrix kernels

Sparse matrix formats

Relational Query Processing on OpenCL-based FPGAs

Sparse matrix data structure

Sparse Matrix Methods

Sparse Matrix Methods

Sparse matrix

Fast Sparse Matrix Multiplication