1 / 30

Pushing WRF To Its Computational Limits

Pushing WRF To Its Computational Limits. Don Morton, Oralee Nudson, Don Bahls, Greg Newby Arctic Region Supercomputing Center. Acknowledgements. Peter Johnsen, Cray, Inc. John Michalakes, NCAR National Institute for Computational Sciences (University of Tennessee) for use of kraken

noura
Télécharger la présentation

Pushing WRF To Its Computational Limits

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pushing WRF To Its Computational Limits Don Morton, Oralee Nudson, Don Bahls, Greg Newby Arctic Region Supercomputing Center

  2. Acknowledgements • Peter Johnsen, Cray, Inc. • John Michalakes, NCAR • National Institute for Computational Sciences (University of Tennessee) for use of kraken • High Performance Computing Modernization Office (DoD) for use of einstein • Arctic Region Supercomputing Center

  3. Motivation • Insatiable need for • Higher resolution • Larger domains • More realistic physics and dynamics

  4. Image obtained from http://pafg.arh.noaa.gov/zones/zoomd/AKZ222.jpg

  5. The computational cost of high resolution • Refining the resolution by a factor of 3 • Increases number of grid points by a factor of 9 • Requires three times as many timesteps Going from 9km to 3km resolution requires at least 27 times more work, and at least 9 times additional memory and storage

  6. The computational cost of high resolution • Consider moving from a 9km to a 1km domain • A forecast that took one hour will now take at least 729 hours (a month) • A forecast that used 4 Gbytes of memory and 100 Gbytes of disc storage will now require at least 324 Gbytes of memory and 8,100 Gbytes (8.1 Tbytes) of disc storage.

  7. Motivation • Vision – computer models are a primary tool for understanding and experimenting with complex systems • Computer scientists – push the limits to facilitate super high resolution and large-scale runs • Atmospheric scientists – keep up with the issues posed by finer resolutions

  8. Our Aim • Maintain a repository of data and information for • Gauging performance of WRF on wide variety of architectures • Understanding the limits of WRF and methods for getting around some of these • Push WRF and HPC to the limits in preparation for the next generation of simulations

  9. Background • WRF V3 Parallel Benchmark Page (Eldred & Michalakes) • 25km and 12.5km • Provides • wrfrst_d01_2001-10-25_00_00_00 • wrfbdy_d01 • namelist.input • Reference wrfout

  10. Background • Nature Run (Michalakes et. al.) • Idealized high resolution rotating fluid on hemisphere • 4486x4486x101 (2 billion) cells • Run on more than 15k cores of Blue Gene • ...worked through issues of parallel I/O and scalability and employed more processors than have ever been used in a WRF run... [WRF nature run J Michalakes et al 2008 J. Phys.: Conf. Ser. ]

  11. ARSC WRF Benchmarking Suite • Standard 6075x6075km domain centered on Frank Williams’ office • Multiple resolutions to support full range of benchmarking needs

  12. WRF Benchmarking Suite • For a given test case, we provide • WRF restart file • Lateral boundary conditions file • namelist.input set up for 3 forecast hours • A wrfout reference file for comparison

  13. WRF Benchmarking Suite • Some preliminary experiments

  14. Basic Scalability with MPI

  15. Node-loading Analysis

  16. Hybrid WRF MPI/OpenMP 1 patch, decomposed into 8 tiles • Support for hybrid distributed and shared memory computations • Domain decomposed into patches assigned to MPI processes (message passing) • Patches further decomposed into tiles assigned to OpenMP threads (shared memory) Interprocess Communication

  17. Running Hybrid WRF on the XT5 MPI Processes • With PBSPro, allocate MPI tasks and threads. For example – to run 8 MPI processes, two on each node, with 4 threads assigned to each MPI task: • This gives us four nodes, each with 2 MPI processes, each process running 4 threads. Total of 32 threads export OMP_NUM_THREADS=4 #PBS -l mppwidth=8 #PBS -l mppnppn=2 #PBS -l mppdepth=4 aprun -n8 -N2 -d4 ./wrf-hybrid.exe OpenMP Threads

  18. Thread Scalability on a Single Node

  19. Hybrid vs MPI Performance • 128 PEs • MPI – 294 seconds • Hybrid – 490 seconds

  20. Task/Thread Decomposition Analysis 8 threads per MPI task 4 threads per MPI task 2 threads per MPI task

  21. First problems with large-scale runs • Task 0 exhibits larger memory requirements relative to other tasks • Example • 4000x4000x28 = 448 million points • 75 quad-core nodes (300 cores), each with 8 GBytes memory • Problem decomposes into 300 subdomains of approximately 1.9 GBytes each, or 7.6 GB per node – it just might fit, BUT... • Task 0 needs an additional 1.8 GB for global buffer, so Node 0 now needs 9.4 GBytes

  22. First problems with large-scale runs - Resolutions • Allocate more nodes • Resolve through clever core/node allocations (don’t use all the cores in a node) • Parallel I/O • Resolves some of these issues • Reduces I/O bottlenecks

  23. Tackling the 1km, 1 Gigapoint run – getting started • real.exe and ...would go off top... message • Trying to discern whether numerical or memory issue • Experimenting on huge datasets is time and resource intensive • Finally narrowed down to “size” problem • High numbered tasks had unrealistic pressures • Traced to MPI_Scatter() call with a 32-bit argument, trying to pass array offsets larger than 232

  24. I/O Considerations • NetCDF compiled for large memory • Default I/O – task 0 performs all I/O, so ALL data passes through task 0 • Split NetCDF files – each task accesses its own NetCDF file – perfect parallel I/O • pNetCDF – parallel library that facilitates simultaneous access by multiple tasks to a single, standard NetCDF file • Parallel file systems (e.g. lustre) – single file is partitioned (striped) across numerous discs.

  25. Continuing Gigapoint Woes • Finally using parallel I/O (pNetCDF) • Could not generate restart file due to memory issue (2Gbyte message in rsl.error.0000) • We “could” generate these by using WRF’s split NetCDF approach • Don Bahls recognized that pNetCDF needed special compilation for large-memory

  26. Looking Rosier • Finally able to run on 1,500 cores • Nodes (kraken has 16 GBytes per 12-core node) were running out of memory, so only used 2 cores per 6-core processor • 1,000 core job required allocation of 3,000 cores • Still need to experiment

  27. Continuing Woes • Somewhere between 1,500 and 2,000 cores – lots of MPI errors • Object Storage Targets (OST) were getting overloaded • Default striping was 4 OST’s • Pete Johnsen recommended number of OST’s between ½sqrt(numCores) and sqrt(numCores) • 4,000 cores – used 32 OST’s • 10,000 cores – used 100 OST’s

  28. Success • Got up to 10,000 cores running (had to allocate 30,000) • Preliminary timings – time required for the third timestep

  29. Analysis of Various I/O Schemes on Pingo • Oralee Nudson did a series of test runs on ARSC’s Cray XT5 to assess performance of I/O on a medium-scale (3km resolution) benchmark case

  30. Closing Thoughts • File formats • Ultimately want higher resolution – physics problems? • Many other challenges – post-processing of huge and/or decomposed files, hard-coded aspects (rsl.out.9999) • Many heartfelt thanks to those who helped!

More Related