Pushing WRF To Its Computational Limits

Pushing WRF To Its Computational Limits Don Morton, Oralee Nudson, Don Bahls, Greg Newby Arctic Region Supercomputing Center

Acknowledgements • Peter Johnsen, Cray, Inc. • John Michalakes, NCAR • National Institute for Computational Sciences (University of Tennessee) for use of kraken • High Performance Computing Modernization Office (DoD) for use of einstein • Arctic Region Supercomputing Center

Motivation • Insatiable need for • Higher resolution • Larger domains • More realistic physics and dynamics

Image obtained from http://pafg.arh.noaa.gov/zones/zoomd/AKZ222.jpg

The computational cost of high resolution • Refining the resolution by a factor of 3 • Increases number of grid points by a factor of 9 • Requires three times as many timesteps Going from 9km to 3km resolution requires at least 27 times more work, and at least 9 times additional memory and storage

The computational cost of high resolution • Consider moving from a 9km to a 1km domain • A forecast that took one hour will now take at least 729 hours (a month) • A forecast that used 4 Gbytes of memory and 100 Gbytes of disc storage will now require at least 324 Gbytes of memory and 8,100 Gbytes (8.1 Tbytes) of disc storage.

Motivation • Vision – computer models are a primary tool for understanding and experimenting with complex systems • Computer scientists – push the limits to facilitate super high resolution and large-scale runs • Atmospheric scientists – keep up with the issues posed by finer resolutions

Our Aim • Maintain a repository of data and information for • Gauging performance of WRF on wide variety of architectures • Understanding the limits of WRF and methods for getting around some of these • Push WRF and HPC to the limits in preparation for the next generation of simulations

Background • WRF V3 Parallel Benchmark Page (Eldred & Michalakes) • 25km and 12.5km • Provides • wrfrst_d01_2001-10-25_00_00_00 • wrfbdy_d01 • namelist.input • Reference wrfout

Background • Nature Run (Michalakes et. al.) • Idealized high resolution rotating fluid on hemisphere • 4486x4486x101 (2 billion) cells • Run on more than 15k cores of Blue Gene • ...worked through issues of parallel I/O and scalability and employed more processors than have ever been used in a WRF run... [WRF nature run J Michalakes et al 2008 J. Phys.: Conf. Ser. ]

ARSC WRF Benchmarking Suite • Standard 6075x6075km domain centered on Frank Williams’ office • Multiple resolutions to support full range of benchmarking needs

WRF Benchmarking Suite • For a given test case, we provide • WRF restart file • Lateral boundary conditions file • namelist.input set up for 3 forecast hours • A wrfout reference file for comparison

WRF Benchmarking Suite • Some preliminary experiments

Basic Scalability with MPI

Node-loading Analysis

Hybrid WRF MPI/OpenMP 1 patch, decomposed into 8 tiles • Support for hybrid distributed and shared memory computations • Domain decomposed into patches assigned to MPI processes (message passing) • Patches further decomposed into tiles assigned to OpenMP threads (shared memory) Interprocess Communication

Running Hybrid WRF on the XT5 MPI Processes • With PBSPro, allocate MPI tasks and threads. For example – to run 8 MPI processes, two on each node, with 4 threads assigned to each MPI task: • This gives us four nodes, each with 2 MPI processes, each process running 4 threads. Total of 32 threads export OMP_NUM_THREADS=4 #PBS -l mppwidth=8 #PBS -l mppnppn=2 #PBS -l mppdepth=4 aprun -n8 -N2 -d4 ./wrf-hybrid.exe OpenMP Threads

Thread Scalability on a Single Node

Hybrid vs MPI Performance • 128 PEs • MPI – 294 seconds • Hybrid – 490 seconds

Task/Thread Decomposition Analysis 8 threads per MPI task 4 threads per MPI task 2 threads per MPI task

First problems with large-scale runs • Task 0 exhibits larger memory requirements relative to other tasks • Example • 4000x4000x28 = 448 million points • 75 quad-core nodes (300 cores), each with 8 GBytes memory • Problem decomposes into 300 subdomains of approximately 1.9 GBytes each, or 7.6 GB per node – it just might fit, BUT... • Task 0 needs an additional 1.8 GB for global buffer, so Node 0 now needs 9.4 GBytes

First problems with large-scale runs - Resolutions • Allocate more nodes • Resolve through clever core/node allocations (don’t use all the cores in a node) • Parallel I/O • Resolves some of these issues • Reduces I/O bottlenecks

Tackling the 1km, 1 Gigapoint run – getting started • real.exe and ...would go off top... message • Trying to discern whether numerical or memory issue • Experimenting on huge datasets is time and resource intensive • Finally narrowed down to “size” problem • High numbered tasks had unrealistic pressures • Traced to MPI_Scatter() call with a 32-bit argument, trying to pass array offsets larger than 232

I/O Considerations • NetCDF compiled for large memory • Default I/O – task 0 performs all I/O, so ALL data passes through task 0 • Split NetCDF files – each task accesses its own NetCDF file – perfect parallel I/O • pNetCDF – parallel library that facilitates simultaneous access by multiple tasks to a single, standard NetCDF file • Parallel file systems (e.g. lustre) – single file is partitioned (striped) across numerous discs.

Continuing Gigapoint Woes • Finally using parallel I/O (pNetCDF) • Could not generate restart file due to memory issue (2Gbyte message in rsl.error.0000) • We “could” generate these by using WRF’s split NetCDF approach • Don Bahls recognized that pNetCDF needed special compilation for large-memory

Looking Rosier • Finally able to run on 1,500 cores • Nodes (kraken has 16 GBytes per 12-core node) were running out of memory, so only used 2 cores per 6-core processor • 1,000 core job required allocation of 3,000 cores • Still need to experiment

Continuing Woes • Somewhere between 1,500 and 2,000 cores – lots of MPI errors • Object Storage Targets (OST) were getting overloaded • Default striping was 4 OST’s • Pete Johnsen recommended number of OST’s between ½sqrt(numCores) and sqrt(numCores) • 4,000 cores – used 32 OST’s • 10,000 cores – used 100 OST’s

Success • Got up to 10,000 cores running (had to allocate 30,000) • Preliminary timings – time required for the third timestep

Analysis of Various I/O Schemes on Pingo • Oralee Nudson did a series of test runs on ARSC’s Cray XT5 to assess performance of I/O on a medium-scale (3km resolution) benchmark case

Closing Thoughts • File formats • Ultimately want higher resolution – physics problems? • Many other challenges – post-processing of huge and/or decomposed files, hard-coded aspects (rsl.out.9999) • Many heartfelt thanks to those who helped!

Pushing WRF To Its Computational Limits

Pushing WRF To Its Computational Limits

Presentation Transcript

Pushing Data To The Client

WRF to GrADS Converter

Older People with Cognitive Decline at home: Pushing the Limits

Empire and Its Limits

Pushing CSP to PROD

We Must be MAD Pushing FIERA to its Limits

Pushing the limits of dark-target aerosol remote sensing from MODIS

PUSHING THE LIMITS

Pushing Hands

High Power Couplers and… pushing the limits of SC cavities

Pushing the limits of IR

WWII – Pushing to Victory

Supernovae and Masers in Arp220: Pushing the limits of VLBI sensitivity

Limits to ILP

Pushing the limits: Triggering and forward physics at the LHC

Limits to Growth

Probability and its limits

Pushing the Spatial, Temporal and Interpretive Limits of Functional MRI

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

WRF

Pushing HST’s Astrometry capabilities to Its Limits: Proper Motion of M31

China limits its tech metal supply to India