1 / 1

A GPU capable version of the COSMO weather prediction model

A GPU capable version of the COSMO weather prediction model. 144 CPUs with 12 cores each (1728 cores). xavier.lapillonne@ethz.env.ch. O. Fuhrer , T. Gysi , X. Lapillonne , C. Osuna, T. Diamanti, M. Bianco, T. Schulthess. 18cm. The HP2C OPCODE project

malo
Télécharger la présentation

A GPU capable version of the COSMO weather prediction model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A GPU capable version of the COSMO weather prediction model 144 CPUs with 12 cores each (1728 cores) xavier.lapillonne@ethz.env.ch O. Fuhrer , T. Gysi , X. Lapillonne, C. Osuna, T. Diamanti, M. Bianco, T. Schulthess 18cm • The HP2C OPCODE project • Part of the Swiss High Performance High Productivity initiative • Prototype implementation of the COSMO production suite of MeteoSwiss making aggressive use of GPU technology • Same time-to-solution on substantially cheaper hardware: • The COSMO model: • Limited-area model • Run operationally by several National Weather Service within the Consortium for Small Scale Modelling: Germany, Switzerland, Italy, Poland, Greece, Rumania, Russia. • Used for climate research in several academics institutions • . • GPUs for scientific computing • GPUs are an interesting option as they can provide significantly larger peak performance and memory bandwidth for comparable energy consumption/price. GPU based hardware 1 cabinet Cray XE5 X 3 X 5 IFS @ ECMWF: lateral boundaries 4x /day16km, 91 levels COSMO-7: 72h 3x/day,6.6km, 60 levels COSMO-2: 33h 8x/day 2.2km, 60 levels 520 x 350 x 60 grid points Implementation strategy : full port to GPU • The OpenACC directives for accelerator • OpenACC: open standard, currently supported by 3 compiler vendors PGI, Cray, Caps. • !$acc data create(a,b,c) !$acc update device(b,c)!$acc parallel !$acc loop gang vector • do i=1,N a(i)=b(i)+c(i)end do • !$acc end parallel!$acc update host(a)!$acc end data • Directives are used for the physics, data assimilation and some other part of the time loop. • Most loop structure remain unchanged, directives are only added to the existing code. • For the most time consuming part, some specific optimization have been introduced: • loop restructuring to reduce multiple kernel call overheads • Replacement of precompute quantities with on the fly computation • removal of Fortran automatic arrays in subroutines that are often called (to avoid calls to malloc) • One key aspect of GPU computing is that host CPU and the GPU have separate memory. Data transfer between may significantly reduce performance, and it was therefore decided to port the entire software so that large GPU-CPU data movement are only required for initialization and I/O. • GPU implementation: • Dynamic: rewrite using a Domain Specific Embedded Language (DSEL). • Physical Parametrizations and Data assimilation: OpenACC compiler directives. • Communication: the GCL library has been developed which is able to deal with GPU-GPU communications. COSMO workflow • Domain Specific Embedded Language: • The dynamics is re-written in C++ using a DSEL approach which allows to: • split loop logic from stencil definition • Provide a way to implement stencils in a platform independent way • Hide optimizations from the user • keep on CPU / copy to GPU Input/Setup Initialize timestep • GPU: OpenACC directives Physics • GPU: OpenACC directives • GPU: Rewrite (uses CUDA) Dynamics Dt update-function / stencil loop-logic • GPU-GPU communication library (GCL) Halo-update DO k = 1, ke DO j = jstart, jend DO i = istart, iend lap(i,j,k) = -4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k) ENDDO ENDDO ENDDO • GPU: OpenACC directive (part on CPU) Assimilation Diagnostics • OpenACC directives • keep on CPU / copy from GPU (not every time step) Output • Optimized GPU and CPU backends have been developed Finalize Results • Test domain 128x128x60. CPU: 16 cores Interlagos CPU; GPU : X2090. • Integrated code • GPU enable version of the dynamics and physics have been integrated together with the other part of the code. • Performance results for the integrated code, without data assimilation and diagnostics shows an overall speed up factor of x3 Dynamics Physics CPU / OpenMP Backend • Factor 1.6x - 1.7x faster than the original COSMO dynamics • No explicit use of vector instructions (10% up to 30% improvement) GPU / CUDA backend • GPU code is roughly a factor 2.6x faster than Interlagos (CPU with 52 GB/s memory bandwidth) • Physics is tested using 2 different compiler • Overall speed up x3.6

More Related