1 / 31

Parallel Solving massive linear equations with CUDA

Zheng Xia School of Electrical and Computer Engineering University of Ottawa, Ottawa, Canada zxia010@uottawa.ca. Parallel Solving massive linear equations with CUDA. Introduction. Specify the main issue on the problem Introduction of CLapack and CULA CPU solving routine

tokala
Télécharger la présentation

Parallel Solving massive linear equations with CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Zheng Xia School of Electrical and Computer Engineering University of Ottawa, Ottawa, Canada zxia010@uottawa.ca Parallel Solving massive linear equations with CUDA

  2. Introduction • Specify the main issue on the problem • Introduction of CLapack and CULA • CPU solving routine • Design and Implement parallel solving routine • Naïve • Coalesced access • Conclusion CPU VS. GPU: Mythbusters Demo GPU versus CPU

  3. FEM The finite element method (FEM) is a numerical technique for finding approximate solutions to boundary value problems It encompasses all the methods for connecting many simple element equations over many small subdomains, named finite elements, to approximate a more complex equation over a larger domain e.g. Approximation of a circle with large number of lines In this case, the deformable object will be decomposed into many tetrahedrawith spring-damper model between every nodes. FEM mesh created by an analyst prior to finding a solution to a magnetic problem using FEM software.

  4. Main issue: Linear equations Main issue: : the force on each node : displacement for each node (it will be three-dimensional which includes ) : the top-level matrix (global matrix). As the linear equations go through the equation above, the unknown displacement/force can be simply represented by the where a is the inverse of the matrix K

  5. How to get • Condition: non-singular matrix (full rank, e.g. square matrix) • General: Coppersmith-Winograd: • Inefficient for the high-dimension matrix • Better to avoid [1] [1].http://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm

  6. (C)Lapack (C)Lapack • Provides many routines for solving systems of linear equations such as: • Linear least squares • Eigenvalue problems • Singular value decomposition • by applying LU, QR, Cholesky and Schurdecomposition • A free software library for numerical linear algebra • CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler

  7. LU decomposition dgesv() • DGESV computes the solution to a real system of linear equations • A * X = B, • where A is an N-by-N matrix and X and B are N-by-NRHS matrices. • The LU decomposition with partial pivoting and row interchanges is used to factor A as • A = P * L * U, • where P is a permutation matrix, L is unit lower triangular, and U is upper triangular. Limitation: Fits for small n Constrains by the full rank Hard to be parallel processed

  8. QR decomposition dgels() • solve overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A. • It is assumed that A has full rank. • If m>= n: find the least squares solution of an overdetermined system • minimize || B - Ax || • if m< n: find the minimum norm solution of an underdetermined system • A x = B

  9. CULA • What is CULA • CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.

  10. Y=Ab • Another question: • Options: • CUBLAS Library • Simple dgemv on CPU • Parallel dgemv on GPU(with and without coalesced read)

  11. CUBLAS Library Intro The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU). Basic step to use the CUBLAS library: • Allocate matrices and vectors in the GPU • Fill the data • Call the sequence of desired CUBLAS functions • Upload the results from the GPU memory space back to the host. The CUBLAS library also provides helper functions for writing and retrieving data from the GPU.

  12. CUBLAS<t>gemv() cublasSgemv() • This function performs the matrix-vector multiplication • y = α op ( A ) x + β y • where A is a m × n matrix stored in column-major format, x and y are vectors, and α and β are scalars. Also, for matrix A Limitation: attaches to a single GPU and does not auto-parallelize across multiple GPUs

  13. Simple Dgemv on CPU

  14. Parallel Sgemv on GPU • Method • For each threads in blocks run a loop with: • in this case, • Kernel implementation • unsigned intid = blockIdx.x * blockDim.x + threadIdx.x; • for(unsigned int k = 0;k< Row; k++) • { • if(i < size) c[i] += a[id* Row + k] * b[k]; • }

  15. Experimental configuration • Test Platform • Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz • GeForce GTX 295 (only use single GPU) • CUDA 5.5 • Data • rand() in C single/double precision • Matrix size: ( ) • Timing function • CPU: clock_t • GPU: CUDA event • *Time compare excludes transfer between system and GPU memory

  16. Cont'd • CLapack • dgesv_ (N, NRHS, A, LDA, IPIV, B, LDB, INFO) • dgels_ (const char *trans, const int *M, const int *N, const int *nrhs, double *A, const int *lda, double *b, const int *ldb, double *work, const int * lwork, int *info) • CUBLAS Library • cublasStatus_t cublasDgemv(cublasHandle_t handle, cublasOperation_t trans, int m, int n, const double *alpha, const double *A, int lda, const double *x, int incx, const double *beta, double *y, int incy)

  17. Cont'd • Simple Dgemv (CPU) • void simple_sgemv(float *A, float *B, float *C,unsigned int size) • Simple Dgemv(GPU) • __global__ void CUParaSgemv(float *a, float *b, float *c,unsigned int size

  18. Performance CLapack

  19. Cont'd CPU VS. GPU(and CUBLAS) Not coalesced access!

  20. Coalesced Access threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read. MartixA: 0 1 2 3 4 5 6 7 8 9 a b Thread 0: 0 4 8 Thread 1: 1 5 9 Thread 2: 2 6 a Thread 3: 3 7 b Thread 0: 0 1 2 Thread 1: 3 4 5 Thread 2: 6 7 8 Thread 3: 9 a b

  21. Cont’d Naïve VS. Coalesced Access

  22. Analysis N = 1200

  23. Cont’d N = 3000

  24. Conclusion • The parallel compute method on GPU have a significant effect when small data has been involved in solving routine. • Data(Matrix) pre-processing also accounts for a large proportion of time.

  25. Future work • Multicore compute on GPU • Reduce the pre-process time on data • LU, QR in CUDA version • Very Large Matrix K decomposition

  26. Questions & Comments?

  27. Thank you!

  28. References • LU Matrix Decomposition in parallel with CUDA: • http://www.noctua-blog.com/index.php/2011/04/21/lu-matrix-decomposition-in-parallel-with-cuda/ • QR Decomposition on GPUs • http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Kerr.pdf • The QR Algorithm for Finding Eigenvectors • http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Eric-Mikida-Fall-2011.pdf

  29. Appendix A:Algorithm LU decomposition Algorithm: Pivot Pivot

  30. QR Decomposition Methods • Gram‐Schmidt Algorithm • Givens rotation • Householder reflection

  31. Appendix B:Other methods on GPU • CUFFT • Fast Fourier Transform (FFT) library is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets, and It supports the following features: • 1D, 2D, and 3D transforms of complex and real-valued data • Batch execution for doing multiple transforms of any dimension in parallel • 2D and 3D transform sizes in the range [2, 16384] in any dimension • 1D transform sizes up to 8 million elements • In place and out of place transforms for real and complex data • Double precision transforms on compatible hardware (GT200 and later GPUs) • Support for streamed execution, enabling simultaneous computation together with data movement

More Related