90 likes | 211 Vues
This presentation, led by experts Jim Rosinski, Mark Govett, Tom Henderson, and Jacques Middlecoff, explores GPU parallelization strategies specifically designed for irregular grids. Key highlights include the comparison of regular and irregular grid models, challenges in threading and blocking over dimensions with dependencies, and significant performance metrics from practical implementations in NIM. Attendees will learn about innovative techniques such as data transposition and chunking to achieve efficient simulations while maximizing GPU capabilities. Insights into the current software status and performance scaling are also provided for improved computational efficiency.
E N D
GPU Parallelization Strategy for Irregular Grids September 8, 2011 Presented by Jim Rosinski Mark Govett, Tom Henderson, Jacques Middlecoff
Regular vs. Irregular grid Lat/Lon Model Icosahedral Model • Near constant resolution over the globe • Efficient high resolution simulations (slide courtesy Dr. Jin Lee)
Fortran Loop Structures in NIM • Dynamics: (lev,column) • Lots of data independence in both dimensions => thread over levels, block over columns. • Exceptions: MPI messages, vertical summations • TBD Physics: (column,lev,[chunk]) • True for WRF, CAM • Dependence in “lev” dimension Multi-core Workshop
GPU-izing Physics • Problem 1: Can’t thread or block over “k” because most physics has “k” dependence • In NIM this leaves only 1 dimension for parallelism, and efficient CUDA needs 2 • Problem 2: Data transpose needed Multi-core Workshop
Solution to GPU-izing Physics • Transpose and “chunking”: !ACC$REGION(<chunksize>,<nchunks>,<dynvars,physvars:none>) BEGIN do n=1,nvars_dyn2phy ! Loop over dynvars needed in phy do k=1,nz ! Vertical !ACC$DO PARALLEL (1) do c=1,nchunks ! Chunksize*nchunks >= nip !ACC$DO VECTOR (1) do i=1,chunksize ! 128 is a good number to choose for chunksize ipn = min (ipe, ips + (c-1)*chunksize + (i-1)) physvars(i,c,k,n) = dynvars(k,ipn,n) end do end do end do end do !ACC$REGION END Multi-core Workshop
Transpose Performance in NIM Multi-core Workshop
CPU code runs fast • Used PAPI to count flops (Intel compiler) • Requires –O1 (no vectorization) to be accurate! • 2nd run with –O3 (vectorization) to get wallclock 27% of peak on Westmere 2.8 GHz Multi-core Workshop
NIM scaling Multi-core Workshop
Current Software Status • Full dynamics runs on CPU or GPU • ~5X speedup socket to socket on GPU • GPU solution judged reasonable comparing output field diffs vs. applying rounding-level perturbation • Dummy physics can run on CPU or GPU • Single-source • GPU directives ignored in CPU mode • NO constructs that look like: #ifdef GPU <do this> #else <do that> #endif Multi-core Workshop