1 / 9

GPU Parallelization Strategies for Irregular Grids: Enhancing Computational Efficiency

This presentation, led by experts Jim Rosinski, Mark Govett, Tom Henderson, and Jacques Middlecoff, explores GPU parallelization strategies specifically designed for irregular grids. Key highlights include the comparison of regular and irregular grid models, challenges in threading and blocking over dimensions with dependencies, and significant performance metrics from practical implementations in NIM. Attendees will learn about innovative techniques such as data transposition and chunking to achieve efficient simulations while maximizing GPU capabilities. Insights into the current software status and performance scaling are also provided for improved computational efficiency.

alyssa
Télécharger la présentation

GPU Parallelization Strategies for Irregular Grids: Enhancing Computational Efficiency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU Parallelization Strategy for Irregular Grids September 8, 2011 Presented by Jim Rosinski Mark Govett, Tom Henderson, Jacques Middlecoff

  2. Regular vs. Irregular grid Lat/Lon Model Icosahedral Model • Near constant resolution over the globe • Efficient high resolution simulations (slide courtesy Dr. Jin Lee)

  3. Fortran Loop Structures in NIM • Dynamics: (lev,column) • Lots of data independence in both dimensions => thread over levels, block over columns. • Exceptions: MPI messages, vertical summations • TBD Physics: (column,lev,[chunk]) • True for WRF, CAM • Dependence in “lev” dimension Multi-core Workshop

  4. GPU-izing Physics • Problem 1: Can’t thread or block over “k” because most physics has “k” dependence • In NIM this leaves only 1 dimension for parallelism, and efficient CUDA needs 2 • Problem 2: Data transpose needed Multi-core Workshop

  5. Solution to GPU-izing Physics • Transpose and “chunking”: !ACC$REGION(<chunksize>,<nchunks>,<dynvars,physvars:none>) BEGIN do n=1,nvars_dyn2phy ! Loop over dynvars needed in phy do k=1,nz ! Vertical !ACC$DO PARALLEL (1) do c=1,nchunks ! Chunksize*nchunks >= nip !ACC$DO VECTOR (1) do i=1,chunksize ! 128 is a good number to choose for chunksize ipn = min (ipe, ips + (c-1)*chunksize + (i-1)) physvars(i,c,k,n) = dynvars(k,ipn,n) end do end do end do end do !ACC$REGION END Multi-core Workshop

  6. Transpose Performance in NIM Multi-core Workshop

  7. CPU code runs fast • Used PAPI to count flops (Intel compiler) • Requires –O1 (no vectorization) to be accurate! • 2nd run with –O3 (vectorization) to get wallclock 27% of peak on Westmere 2.8 GHz Multi-core Workshop

  8. NIM scaling Multi-core Workshop

  9. Current Software Status • Full dynamics runs on CPU or GPU • ~5X speedup socket to socket on GPU • GPU solution judged reasonable comparing output field diffs vs. applying rounding-level perturbation • Dummy physics can run on CPU or GPU • Single-source • GPU directives ignored in CPU mode • NO constructs that look like: #ifdef GPU <do this> #else <do that> #endif Multi-core Workshop

More Related