1 / 16

Parallelization and CUDA libraries

Parallelization and CUDA libraries. Lei Zhou, Yafeng Yin, Hong Man. Outline. GPU & CUDA Manually CUDA Coding CUDA Library FIR Realization Auto Parallelizing Tool. GPU & CUDA. GPUs are massively multithreaded many core chips Hundreds of scalar processors

nile
Télécharger la présentation

Parallelization and CUDA libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man

  2. Outline • GPU & CUDA • Manually CUDA Coding • CUDA Library • FIR Realization • Auto Parallelizing Tool

  3. GPU & CUDA • GPUs are massively multithreaded many core chips • Hundreds of scalar processors • Tens of thousands of concurrent threads • CUDA is the acronym for Compute Unified Device Architecture. • A parallel computing architecture developed by NVIDIA. • The computing engine in GPU. • CUDA can be accessible to software developers through industry standard programming languages. GeForce 8800 GTX (128 cores) Tesla C1060 (240 cores)

  4. Processing Flow Serial code executes on the host while parallel code executes on the device.

  5. Manually CUDA Coding • Find parallel kernels • Improve data reuse inside kernels to have better compute intensity • Access the memory in a GPU-friendly • Take advantage of complex memory hierarchy that make the GPU fast • Reduce the copy-in and copy-out transfers that pile up on the PCIe • Reduce memory usage in the GPU • Limit inter-block synchronizations

  6. CUDA Libraries • Basic CUDA computation library • CUBLAS • CUFFT • GPULib • Advanced CUDA computation library • CULA • MAGMA • VSIPL

  7. Basic libraries • CUBLAS provides a set of functions for basic vector and matrix operations • matrix‐vector copy, sort, dot product, Euclidean norm etc • CUFFT is the CUDA FFT library • cufftPlan1d() ,cufftPlan2d() ,cufftPlan3d() • GPULib provides a library of mathematical functions • addition, subtraction, multiplication, and division, as well as unary functions, including sin(), cos(), gamma(), and exp(), • interpolation, array reshaping, array slicing, and reduction operations

  8. Advanced libraries • CULA: GPU Accelerated Linear Algebra • provide LAPACK (Linear Algebra PACKage) function on CUDA GPUs • MAGMA: Matrix Algebra on GPU and Multicore Architectures • develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures and "Multicore+GPU" systems

  9. Advanced lib -VSIPL • VSIPL: Vector Image Signal Processing Library • Generalized matrix product • Fast FIR filtering • Correlation • Fast Fourier Transform • QR decomposition • Random number generation  • Elementwise arithmetic, logical, and comparison operators, linear algebra procedures

  10. Example // Allocate device memory for filter kernel Complex* d_filter_kernel; cutilSafeCall(cudaMalloc((void**)&d_filter_kernel, mem_size)); // Copy host memory to device cutilSafeCall(cudaMemcpy(d_filter_kernel, h_padded_filter_kernel, mem_size, cudaMemcpyHostToDevice)); // CUFFT plan cufftHandle plan; cufftSafeCall(cufftPlan1d(&plan, new_size, CUFFT_C2C, 1)); // Transform signal and kernel cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD));

  11. FIR Realization on CUDA

  12. FIR Realization on CUDA Threads t

  13. CUDA Demo (FIR) GPU: NVIDIA GeForce 8600 GT CPU: Intel Duo CPU 2.33G Software: Visual Studio 2005

  14. CUDA Demo (FIR)

  15. Auto-Parallelizing Tool • Par4All (open source environment): C and Fortran to CUDA C • PGI Accelerator: Fortran and C to CUDA C Auto-parallelizing Compiler • CAPS HMPP: C and Fortran to CUDA C Auto-parallelizing Compiler • Goose: C to CUDA C Auto-parallelizing Compiler • NOAA F2C : Fortran to CUDA C Translator

  16. Par4All (open source environment): C and Fortran to CUDA C

More Related