A GPU capable version of the COSMO weather prediction model

A GPU capable version of the COSMO weather prediction model 144 CPUs with 12 cores each (1728 cores) xavier.lapillonne@ethz.env.ch O. Fuhrer , T. Gysi , X. Lapillonne, C. Osuna, T. Diamanti, M. Bianco, T. Schulthess 18cm • The HP2C OPCODE project • Part of the Swiss High Performance High Productivity initiative • Prototype implementation of the COSMO production suite of MeteoSwiss making aggressive use of GPU technology • Same time-to-solution on substantially cheaper hardware: • The COSMO model: • Limited-area model • Run operationally by several National Weather Service within the Consortium for Small Scale Modelling: Germany, Switzerland, Italy, Poland, Greece, Rumania, Russia. • Used for climate research in several academics institutions • . • GPUs for scientific computing • GPUs are an interesting option as they can provide significantly larger peak performance and memory bandwidth for comparable energy consumption/price. GPU based hardware 1 cabinet Cray XE5 X 3 X 5 IFS @ ECMWF: lateral boundaries 4x /day16km, 91 levels COSMO-7: 72h 3x/day,6.6km, 60 levels COSMO-2: 33h 8x/day 2.2km, 60 levels 520 x 350 x 60 grid points Implementation strategy : full port to GPU • The OpenACC directives for accelerator • OpenACC: open standard, currently supported by 3 compiler vendors PGI, Cray, Caps. • !$acc data create(a,b,c) !$acc update device(b,c)!$acc parallel !$acc loop gang vector • do i=1,N a(i)=b(i)+c(i)end do • !$acc end parallel!$acc update host(a)!$acc end data • Directives are used for the physics, data assimilation and some other part of the time loop. • Most loop structure remain unchanged, directives are only added to the existing code. • For the most time consuming part, some specific optimization have been introduced: • loop restructuring to reduce multiple kernel call overheads • Replacement of precompute quantities with on the fly computation • removal of Fortran automatic arrays in subroutines that are often called (to avoid calls to malloc) • One key aspect of GPU computing is that host CPU and the GPU have separate memory. Data transfer between may significantly reduce performance, and it was therefore decided to port the entire software so that large GPU-CPU data movement are only required for initialization and I/O. • GPU implementation: • Dynamic: rewrite using a Domain Specific Embedded Language (DSEL). • Physical Parametrizations and Data assimilation: OpenACC compiler directives. • Communication: the GCL library has been developed which is able to deal with GPU-GPU communications. COSMO workflow • Domain Specific Embedded Language: • The dynamics is re-written in C++ using a DSEL approach which allows to: • split loop logic from stencil definition • Provide a way to implement stencils in a platform independent way • Hide optimizations from the user • keep on CPU / copy to GPU Input/Setup Initialize timestep • GPU: OpenACC directives Physics • GPU: OpenACC directives • GPU: Rewrite (uses CUDA) Dynamics Dt update-function / stencil loop-logic • GPU-GPU communication library (GCL) Halo-update DO k = 1, ke DO j = jstart, jend DO i = istart, iend lap(i,j,k) = -4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k) ENDDO ENDDO ENDDO • GPU: OpenACC directive (part on CPU) Assimilation Diagnostics • OpenACC directives • keep on CPU / copy from GPU (not every time step) Output • Optimized GPU and CPU backends have been developed Finalize Results • Test domain 128x128x60. CPU: 16 cores Interlagos CPU; GPU : X2090. • Integrated code • GPU enable version of the dynamics and physics have been integrated together with the other part of the code. • Performance results for the integrated code, without data assimilation and diagnostics shows an overall speed up factor of x3 Dynamics Physics CPU / OpenMP Backend • Factor 1.6x - 1.7x faster than the original COSMO dynamics • No explicit use of vector instructions (10% up to 30% improvement) GPU / CUDA backend • GPU code is roughly a factor 2.6x faster than Interlagos (CPU with 52 GB/s memory bandwidth) • Physics is tested using 2 different compiler • Overall speed up x3.6

A GPU capable version of the COSMO weather prediction model

A GPU capable version of the COSMO weather prediction model

Presentation Transcript

GPU Performance Prediction

Mesoscale Numerical Weather Prediction With the WRF Model

Modern Weather Prediction

NUMERICAL WEATHER PREDICTION ( Model Physics P art)

A Parallel GPU Version of the Traveling Salesman Problem

Mesoscale Numerical Weather Prediction With the WRF Model

Experience Applying Fortran GPU Compilers to Numerical Weather Prediction

Coupling COSMO with the WAM model

What Is a Numerical Weather Prediction Model?

What Is a Numerical Weather Prediction Model?

Towards a Computationally Bound Numerical Weather Prediction Model

The Grand Challenge of Space Weather Prediction

Numerical Weather Prediction

Weather Prediction

Weather Prediction

Introducing a sea ice scheme in the mesoscale NWP model COSMO-EU of the German Weather Service

Numerical Weather Prediction (NWP) and the WRF Model

Psychology of Weather Prediction

Experience Applying Fortran GPU Compilers to Numerical Weather Prediction

Numerical Weather Prediction

Status of the COSMO-Model Package

Psychology of Weather Prediction