Enhanced GPU Algorithms for Efficient Matrix Multiplications

CSE 690: GPGPULecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University

Basic Concept • Triple loop

GPU Algorithms • First algorithm: • render a rectangle of size NxN • represent the matrices as NxN textures • each (i,j) is then a fragment • each fragment program is a loop or an unrolled loop -> may get too long • must pull in the same data many times -> poor data reuse, needs bandwidth • makes no use of 4-way RGBA parallelism -> wastes speedup

GPU Algorithms • Better algorithm: • use RGBA channels, pack a 2x2 submatrix • use swizzling to facilitate data reuse • swizzling improves fragment code length by factor 2 • may need multiple passes for larger matrices

GPU Algorithms • Using multi-texturing • requires l passes

GPU Algorithms • Can use RGBA parallelism as well • each texel represents a 2x2 submatrix • use swizzling as usual • needs l/2 passes

GPU Algorithms • Instead of a 2x2 submatrix, pack 4x1 column vectors • makes 4-times reuse of texels read from B, but uses texels from A only once

GPU Algorithms • Instead of a 2x2 submatrix, pack 4x1 column vectors • 6 fetches are needed for 4 mad’s (mult-add’s) -> 1.5 times more than before • but less rows and columns are accessed per pass -> improves cache hit frequency

GPU Algorithms • Originally only compute one product per shader • practically can unroll the loop 3-6 times (compute 3-6 products) • maximal fragment program length is the limit • reduces the number of passes required

Reality Check • Would like to compare CPU and GPU efficiencies for GPGPU tasks • The task of matrix multiplication is insightful here • features much data reuse • graphics programs are generally more stream-like and have less data reuse • this may lead to some limitations

Platforms • Pentium 4 3Ghz CPU, 512KB L2 cache • 12 GFLOPS peak compute • 44.1GB/sec cache BW • Using sgemm routine from ATLAS package • NVIDIA • GeForce 5900 Ultra • GeForce 6800 Ultra • ATI • Radeon 9800 XT • Radeon X800 XT PE

Performance

Bandwidth vs. Peak FLOPS

Analysis • Currently: • GPUs can fetch 16 floats and perform 16 4-component mad’s per clock • our app fetches 8 floats to perform one 4-component mad -> not enough computations • need more math ops per float fetched (> 8)

Analysis • Pentium processors have large L1 caches to boost memory bandwidth (bw) • bw / compute ratio better • main reason for only small performance gain achieved with GPUs

Analysis • Pentium processors have large L1 caches to boost memory bandwidth (bw) • bw / compute ratio better • main reason for only small performance gain achieved with GPUs • for matrix multiplications

Analysis • Expectations • make sure that there is enough arithmetic per data item fetched • lots of data resuse in the algorithm / task will make the CPU look better • streaming data OK -> they don’t “suffer” from reuse • matrix multiplication is an excellent reality-check example

Analysis • What do GPUs need: • bigger caches to enable larger blocks • currently there are enough registers to store a 6x6 submatrix • but currently shaders can only produce a small number of outputs -> limits the amount of blocking • Provide full-floating point accumulator registers • Widen path between texture and register files

References • E. Larsen and D. McAllister, “Fast matrix multiplies using graphics hardware,” Supercomputing 2001. • J. Hall, N. Carr and J. Hart, “Cache and bandwidth aware matrix multiplication on the GPU,” Tech Report UIUCDCS-R-2003-2328-1 • K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of GPU algorithms for matrix-matrix multiplication,” Graphics Hardware Workshop 2004.

Enhanced GPU Algorithms for Efficient Matrix Multiplications