1 / 36

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi. CoE EECS Department April 21, 2014. About Me. Deepthi Gummadi MS in Computer Networking with Thesis LaTeX programmer at CAPPLab since Fall 2013 Publications

alan-acosta
Télécharger la présentation

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014

  2. About Me • Deepthi Gummadi • MS in Computer Networking with Thesis • LaTeX programmer at CAPPLab since Fall 2013 • Publications • “New CPU- to-GPU Memory Mapping Technique,” in IEEE SouthEast Conference 2014. • “The Impact of Thread Synchronization and Data Parallelism on Multicore Game Programming,” accepted in IEEE ICIEV-2014. • “Feasibility Study of Spider-Web Multicore/Manycore Network Architectures,” currently preparing. • “Investigating Impact of Data Parallelism on Computer Game Engine,” under review, IJCVSP Journal, 2014.

  3. Committee Members • Dr. Abu Asaduzzaman, EECS Dept. • Dr. Ramazan Asmatulu, ME Dept. • Dr. Zheng Chen, EECS Dept.

  4. “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” Outline ► • Introduction • Motivation • Problem Statement • Proposal • Evaluation • Experimental Results • Conclusions • Future Work Q U E S T I O N S ? Any time, please.

  5. Introduction Central Processing Unit (CPU) Technology • Interpret and Execute the program instructions. What is new about CPU? • Initially, Processor evolved in sequential structure. • In millennium, processor speeds reached parallel. • Currently, we have multi core on-chip CPUs. CPU Speed Chart

  6. Why we use cache memory? Several memory layers: Lower-level caches – faster, performing computations. Higher-level cache – slower, storage purposes. Cache Memory Organization Intel 4-core processor

  7. NVIDIA Graphic Processing Unit • Parallel Processing Architecture • Components • Streaming Multiprocessors • Warp Schedulers • Execution pipelines • Registers • Memory Organization • Shared memory • Global memory GPU Memory Organization

  8. CPU and GPU CPU and GPU work together to be more efficient.

  9. CPU-GPU Computing Workflow Step 1: CPU allocates the memory and copies the data. cudaMallac() cudaMemcpy()

  10. CPU-GPU Computing Workflow Step 2: CPU sends function parameters and instructions to GPU.

  11. CPU-GPU Computing Workflow Step 3: GPU executes the instructions based on received commands.

  12. CPU-GPU Computing Workflow Step 4: After execution, the results will be retrieved from GPU DRAM to CPU memory.

  13. Data level parallelism Spatial data partitioning Temporal data partitioning Spatial instruction partitioning Temporal instruction partitioning Motivation Two Parallelization Strategies

  14. Motivation • Parallelism and optimization techniques simplifies the programming for CUDA. • From developers view the memory is unified.

  15. Problem Statement Traditional CPU to GPU global memory mapping technique is not good for GPU Shared memory

  16. “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” Outline ► • Introduction • Motivation • Problem Statement • Proposal • Evaluation • Experimental Results • Conclusions • Future Work Q U E S T I O N S ? Any time, please.

  17. Proposal Proposed CPU to GPU memory mapping to improve GPU shared memory performance

  18. Proposed Technique Major Steps: Step 1: Start Step 2: Analyze problems; determine input parameters. Step 3: Analyze GPU card parameters/characteristics. Step 4: Analyze CPU and GPU memory organizations. Step 5: Determine the number of computations and the number of threads. Step 6: Identify/Partition the data-blocks for each thread. Step 7: Copy/Regroup CPU data-blocks to GPU global memory. Step 8: Stop

  19. Data directly copied from CPU to GPU global memory. Retrieved from different global memory blocks. It is difficult to store the data into GPU shared memory. Data should be regrouped and then copied from CPU to GPU global memory. Retrieved from consecutive global memory blocks. It is easy to store the data into GPU shared memory. Proposed Technique Traditional Mapping Proposed Mapping

  20. Evaluation System Parameters: • CPU Dual processor speed: 2.13 GHz • Fermi card: 14 SM, 32 CUDA cores in each SM. • Kepler card: 13 SM, 192 CUDA cores in each SM

  21. Evaluation • Memory sizes of CPU and GPU cards. • Input parameters are size of rows and size of columns, whereas the output parameter is time.

  22. Electric charge distribution by Laplace’s equation for 2D problem (finite difference approximation) ϵx(i,j)(Φi+1,j - Φi,j)/dx + ϵy(i,j)(Φi,j+1 - Φi,j)/dy + ϵx(i-1,j)(Φi,j – Φi-1,j)/dx + ϵx(i,j-1)(Φi,j - Φi,j-1)/dy =0 Φ = electric potential ϵ = medium permittivity dx , dy = spatial grid size, Φi,j = electric potential defined at lattice point (i, j) ϵx(i,j), ϵy(i,j) = effective x- and y-direction permittivity defined at edges of the element cell (i, j). Evaluation

  23. Evaluation Electric potential can be considered as same for a uniform material, the equation becomes (Φi+1,j - Φi,j)/dx + (Φi,j+1 - Φi,j)/dy + (Φi,j – Φi-1,j)/dx + (Φi,j - Φi,j-1)/dy =0

  24. “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” Outline ► • Introduction • Motivation • Problem Statement • Proposal • Evaluation • Experimental Results • Conclusions • Future Work Q U E S T I O N S ? Any time, please.

  25. Experimental Results • Conducted study on high electric charge distribution by Laplace’s equation. • Implemented on three versions • CPU only. • GPU with shared memory. • GPU without shared memory. • Input / Outputs • Problem size (n for NxN Matrix) • Execution time

  26. Experimental Results Validation of our CUDA/C code: Nn,m = 1/5 (Nn,m-1 + Nn,m+1 + Nn,m + Nn-1,m + Nn+1,m) Where, 1 <= n <= 8 and 1 <= m <= 8 Both CPU/C and CUDA/C programs produce the same values

  27. Experimental Results Validation of our CUDA/C code: Nn,m = 1/5 (Nn,m-1 + Nn,m+1 + Nn,m + Nn-1,m + Nn+1,m) Where, 1 <= n <= 8 and 1 <= m <= 8 Both CPU/C and CUDA/C programs produce the same values

  28. Experimental Results Impact of GPU shared memory • As the number of threads increases the processing time decreases (till 8X8 threads). • After 8X8 threads, GPU with shared memory shows better performance.

  29. Experimental Results Impact of the Number of Threads • At a constant shared memory, the processing time of a GPU decreases as the number of threads increases (till 16X16). • After 16X16 threads, Kepler card shows better performance.

  30. Experimental Results Impact of amount of shared memory • As the size of GPU shared memory increases, the processing time decreases.

  31. Experimental Results Impact of the proposed data regrouping technique • In the case of data regrouping with shared memory, as the number of threads increases the processing time decreases. • Among the GPU with and without shared memory, with shared memory gives better performance for more number of threads.

  32. Conclusions • For fast effective analysis of complex systems, high performance computations are necessary. • NVIDIA CUDA CPU/GPU, proves its potential on high computations. • Traditional memory mapping follows locality principle. So, data doesn’t fit in GPU shared memory. • Beneficial to keep data in GPU shared memory than GPU global memory.

  33. Conclusions • To overcome this problem, we proposed a new memory mapping between CPU and GPU to improve the performance. • Implemented on three different versions. • Results indicates that proposed CPU-to-GPU memory mapping technique helps in decreasing the overall execution time by more than 75%.

  34. Future Extensions • Modeling and simulation of Nanocomposites: Nanocomposites requires large number of computations at high speed. • Aircraft applications: High performance computations are required to study the mixture of composite materials.

  35. “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” Questions?

  36. “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” Thank you Contact: Deepthi Gummadi E-mail: dxgummadi@wichita.edu

More Related