Linear Algebra on GPUs
E N D
Presentation Transcript
Linear Algebra on GPUs VasilyVolkov
GPU Architecture Features • SIMD architecture • Don’t be confused by scalar ISA which is only a program model • We use vector thread program model instead • Similar to SSE/Cell SPE/vector units, not superscalar units • Native vector length is 32; can simulate larger (thread blocks) • Multithreading using register windows • Context switch never flushes registers to memory • If more threads than can fit, some don’t start until others finish • Huge register files, small high-latency caches • Fundamental difference with CPUs, similar to vector processors • Cache is to save memory bandwidth, not reduce latency • Use register blocking, not cache blocking • Large startup costs (≈5s) • Not good for fine-grain computations • Use global barriers instead? (≈1.5 s)
Relation to Other Multicores • All offer both multithreading and SIMD capabilities • Use CUDA to program all of them?
Pointer Chase Benchmark, 8800GTX • run k=A[k] • in a loop • in a scalar thread • latency bound • larger latency at cache miss • Reveals cache sizes 8-word cache line L1: 8 x 5kB, each is 20-way associative L2: 6 x 32kB, each is 4-way associative 512kB memory pages TLB: 16 entries, fully associative 8800GTX/8800GTS/8600GTS/FX5600: Different number of similar caches
Matrix-Matrix multiply, C = C + AB • 64x16 blocks in C, rank-4 updates • Ensures that our code is compute-bound • Each thread computes one block in C • ijk/jikform; other choices produce race condition in C • Keep A’s and C’s block in vector registers • Similarly done on IBM 3090 VF and Cray X1 • Keep B’s block in local memory • Others keep it in scalar registers (n/a on GPU); or caches (slow) • Use compiler options to enforce tight register budget • To hide memory latencies better by multithreading • Use prefetching to hide latencies even better • Now, performance is not bound by latency and bandwidth in reading blocks in A and B! • Bound by instruction issue and local memory ops (230 Gflop/s)
Performance of SGEMM • CUBLAS 1.1: • keeps square blocks in A and B in local memory • uses long vectors (poor instruction mix) • exposing too much of data parallelism may cost you • Our SGEMM is now in CUBLAS 2.0 beta
SGEMM, 8800GTX, k = 1024 Constant work per vector thread (function of k) Optimized version does better load balancing by computing partial sums
Panel Factorization CPU: runtime on Core2 Duo 2.66GHz, Intel MKL 10.0 (includes CPU-GPU transfer!) GPU: estimated for 8800GTX as
Design of Matrix Factorizations • Right-looking scheme = most parallelism = best on 16-core GPUs • Crout scheme = least bandwidth = best on 4-core GPU and if using CUBLAS 1.1 • Left-looking = half of work in triangular solve = limited parallelism = inefficient • 2-level blocking • Both levels are right-looking + premature exit from finer level to keep • Up to 6% speedup only, at large matrices (n≈10,000) • Block size on GPU is 64 (same as in matrix multiply) • Autotuning in QR (up to 7% speedup) • Row-major layout on GPU in LU decomposition • Since gathers with large stride are inefficient • Requires transposition at every CPU-GPU transfer • >2x speedup! • Panel factorization on CPU overlapped with BLAS3 on GPU (use lookahead) • Multiply by inverse (GEMM) instead of triangular solve (TRSM) • TRSM vs. GEMM is 13 Gflop/s vs. 160 Gflop/s if matrix is 64x64 • Parallelism in TRSM is not embarrassing enough
Test Platform • GPU • Core2 Duo 2.67GHz + GeForce 8800 GTX • Core2 Duo 2.67GHz + two GeForce 8800 GTX • CPU • Core2 Duo 2.67GHz • Core2 Quad 2.4GHz
Speedup using 2 GPUs Using column-cyclic layout
Other Work Done • Tridiagonaleigenvalue solver (bisection) • Most work: factorize A–iI = LDLT, count signs in D (compute bound) • Done for many i in parallel — traditionally vectorized • If need more parallelism — do multisection instead of bisection • But it increases total flop count • Rest is difficult to parallelize, does not worth it • Our solution: • Run vectorized loops on the GPU, rest (least work) on the CPU • Autotune to decide optimal redundancy and when involve CPU • Use features of IEEE arithmetic to save another 15-30% of runtime • Up to 156x faster than LAPACK on Pentium 4 • Tridiagonal eigenvector solver (inverse iteration) • Most work: Solve (A–iI)xk+1=xk for fixed i (bandwidth bound) • Factorize A–iI = LDLT once, keep D only. Reconstruct L on need • Reconstruction is overlapped with memory access; still bandwidth bound • Don’t pivot — recompute using safe code if fails (do it on CPU) • Up to 120x faster than LAPACK on Core2 Duo so far • More complicated when eigenvalues are clustered • Stencil computation (7-point on 3D grid) • Blocks in registers and local memory • Bandwidth-bound, runs at up to 66% of pin-bandwidth
Future Work • Analysis of architecture • Find best parallels in past architectures to reuse methods • Catching up with newer GPUs • More micro-benchmarking to get better performance models • More scientific kernels • CUFFT is ≈50Gflop/s, can do better (e.g. by not doing sin/cos in the inner loop) • More LAPACK • Two-sided factorizations used in eigensolvers and SVD • LAPACK does 50% of work is in BLAS1/BLAS2 • Mostly BLAS3 algorithm is known, but has requires more flops if eigenvectors are needed • May use divide-and-conquer instead • MRRR (improved inverse iteration algorithm, also rich in parallelism) • Non-symmetric eigensolvers such as QR iterations • currently fine-grained, can do better? • Iterative refinement for eigenvalue problem? • ScaLAPACK (distributed memory LAPACK) • One-sided factorizations on a GPU cluster