240 likes | 335 Vues
This project aimed to convert a Fortran program, the Peruvian Anchovy Individual-Based Model, to run efficiently on an NVIDIA 8800 GTX GPU using CUDA parallel computing architecture. The conversion process involved updating the original model, implementing CUDA kernels, optimizing memory transfers, addressing errors, and improving efficiency. The results included improved run time but also highlighted challenges such as divergent threads and inefficient instructions. Future directions included rewriting code, reducing memory transfers, and optimizing GPU kernel calls. Acknowledgements were given to collaborators and supporters of the project.
E N D
Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU Project By: Kevin Demers Advisors: Steve Cousins, Dr. Huijie Xue, Dr. Fei Chai
Project Goal • The goal of this project was to convert an existing Fortran program to run on an NVIDIA 8800 GTX Graphics Processing Unit (GPU). • NVIDIAs CUDA parallel computing architecture was used. • New tools from The Portland Group allowed for direct execution of Fortran code on a CUDA-Capable GPU. • The desired result was a program capable of working with expanded data sets without a substantial increase in program run time.
The Original Program • Peruvian Anchovy Individual Based Model • No interaction between Anchovies • Written entirely in Fortran • 14784 Anchovies • Years modeled: 1991-2007
Conversion Process – Updating • Original Model – Heavy use of global variables. • Outdated method of global variables in Fortran. • Program rewritten do use Modules and Derived Types • Modules necessary for CUDA use. • CUDA kernels (subroutines) can only use data from modules they are members of.
Conversion Process - CUDA program main use module1 implicit none integer :: a, b !local variables call subroutine1(a b c d e) !call subroutine end program subroutine subroutine1(a b c d e) use module1 implicit none integer :: a, b, c, d !Let subroutine know type of variables type(comp) :: e do some code end subroutine module module1 type comp !Derived type integer :: n real :: r end type comp integer :: c, d type(comp) :: e end module1
Conversion Process - CUDA program main use cudafor use module1 implicit none integer :: a, b integer, allocatable, device :: a_d, b_d !seperate device variables allocate(a_d b_d c_d d_d e_d) !allocate variables on device a_d = a !copy variable data to device b_d = b c_d = c d_d = d e_d = e call<<<dimensions,dimensions>>>subroutine1(a_d, b_d, c_d, d_d, e_d) a = a_d !copy variable data from device b = b_d c = c_d d = d_d e = e_d deallocate(a_d b_d c_d d_d e_d) !deallocate VERY IMPORTANT end program
Conversion Process - CUDA module module1 use cudafor type comp sequence integer :: n real :: r end type comp integer :: c, d integer, device, allocatable :: c_d, d_d type(comp) :: e type(comp), device, allocatable :: e_d contains !Subroutine MUST be in module now attributes(global) subroutine subroutine1 implicit none integer, device :: a_d, b_d, c_d, d_d !Let subroutine know about device variables type(comp),device :: e_d integer :: idx idx = (blockidx%x-1)*blockdim%x + threadidx%x !x coordinate of thread end subroutine end module module1
Conversion Process - Identifying • Four subroutines = 60-80% of run time • The four subroutines loop sequentially over each fish in the model • Parallel version creates a thread for each fish • CPU code primarily handles File I/O
Conversion Process – Problems • CUDA/Fortran tools are relatively new • Some CUDA features are unsupported/broken • Debugging – Cryptic error messages • Profiling Tools would not work • GPROF style profiling is very innacurate
Conversion Process - Errors • NVIDIA Visual Profiler had to be used • Visual Profiler only profiles GPU code • Doesn’t provide detailed information • Runs the program 4 times
Efficiency – Memory Transfers • Excessive memory transfers slow down programs • Program has thousands of memory transfers • Each memory transfer is large (10MB +) • Memory transfers were altered until as efficient as possible
Why Isn’t It Faster? • NVIDIA 8800 GTX has 128 parallel cores • Each group of 8 cores makes up a multiprocessor • Each multiprocessor runs 32 threads simultaneously • Maximum efficiency is obtained when all threads are identical
Why Isn’t It Faster? – Cont’d • Slow clock speed • Inefficient instructions • Divergent threads • GPU code contains a large amount of branches
Future Direction • Rewrite code to group non-branching segments • Trade a single GPU kernel call for many small kernels without branches • Reduce data output to reduce memory transfer time • Dynamically decide what memory to transfer for each kernel call
Acknowledgements • Thanks to Steve Cousins, Yifeng Zhu, Bruce Segee, and all SuperME REU members. • Thanks to Robert England and Janice Gomm for all the food. • This research work is supported by the NSF fund CCF #0754951
Questions? Comments? • I’m looking at you Robert…