Sparse Matrix Dense Vector Multiplication

Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002

The Problem • Improve the speed of sparse matrix - dense vector multiplication using MPI in a beowolf parallel computer.

What To Improve • Current algorithms use excessive indirect addressing • Current optimizations depend on the structure of the matrix (distribution of the nonzero elements)

Sparse Matrix Representations • Coordinate format • Compressed Sparse Row (CSR) • Compressed Sparse Column (CSC) • Modified Sparse Row (MSR)

Compressed Sparse Row (CSR) rS ndx val

CSR Code void sparseMul(int m, double *val, int *ndx, int *rS, double *x, double *y) { int i,j; for(i=0;i<m;i++) { for(j=rowStart[i];j<rS[i+1];j++) { y[i]+=(*val++)*x[*ndx++]; } } }

Goals • Eliminate indirect addressing • Remove the dependency on the distribution of the nonzero elements • Further compress the matrix storage • Most of all, to speed up the operation

Proposed Solution A =

Data Structure typedef struct { int rCol; double val; } dSparS_t; {rCol,val}

hdr.size Process A local_size residual < p local_size – hdr.size / p residual = hdr.size % p

Scatter A local_A local_size

Multiplication Code if( (index=local_A[0].rCol) > 0 ) local_Y[0].val = local_A[0].val * X[index]; else local_Y[0].val = local_A[0].val * X[0]; local_Y[0].rCol = -1; k=1; h=0; while(k<local_size) { while((0<(index=local_A[k].rCol)) && (k<local_size)) local_Y[h].val += local_A[k++].val * X[index]; if(k<local_size) { local_Y[h++].rCol = -index-1; local_Y[h].val = local_A[k++].val * X[0]; } } local_Y[h].rCol = local_Y[-1+h++].rCol+1; while(h < stride) local_Y[h++].rCol = -1;

Multiplication local_Y stride doamin Range local_A * = local_size X

Algorithm local_A X Y.val Y.rCol

Gather stride split element range local_Y gatherBuffer residual

Consolidation of Split Rows nCols gatherBuffer += Y residual

Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 10

Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 8 vavasis3.rua - Total non-zero values: 1,683,902 - p = 1

Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 4 vavasis3.rua - Total non-zero values: 1,683,902 - p = 2

Results (vavasis3) vavasis3.rua - Calculated Results

Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 10

Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 8 bayer02.rua - Total non-zero values: 63,679 - p = 1

Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 4 bayer02.rua - Total non-zero values: 63,679 - p = 2

Results (bayer02) bayer02.rua - Calculated Results

Conclusions • The proposed representation speeds up the matrix calculation • Data mismatch solution before gather should be improved • There seems to be a communication penalty for using moving structured data

Bibliography • “Optimizing the Performance of Sparse Matrix-Vector Multiplication” dissertation by Eun-Jin Im. • “Iterative Methods for Sparse Linear Systems” by Yousef Saad • “Users’ Guide for the Harwell-Boeing Sparse Matrix Collection” by Iain S. Duff

Sparse Matrix Dense Vector Multiplication

Sparse Matrix Dense Vector Multiplication

Presentation Transcript

sparse matrix-vector multiplication

A benchmark for sparse matrix-vector multiplication

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Automatic Performance Tuning of Sparse-Matrix-Vector-Multiplication (SpMV) and Iterative Sparse Solvers

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

C onjugate gradients, sparse matrix-vector multiplication, and graph partitioning

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes

Fast Sparse Matrix-Vector Multiplication on GPUs : Implications for Graph Mining

Fast Sparse Matrix Multiplication

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Sparse Matrix-Dense Vector Multiply on G80: Probing the CUDA Parameter Space

Tuning Sparse Matrix Vector Multiplication for multi-core processors

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Fast Sparse Matrix Multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication