Advancements in Parallel I/O for Earth System Modeling: Enhancing Scalability and Efficiency

Parallel IO in the Community Earth System Model Jim Edwards John Dennis (NCAR) Ray Loy(ANL) Pat Worley (ORNL)

Some CESM 1.1 Capabilities: • Ensemble configurations with multiple instances of each component • Highly scalable capability proven to 100K+ tasks • Regionally refined grids • Data assimilation with DART

Prior to PIO • Each model component was independent with it’s own IO interface • Mix of file formats • NetCDF • Binary (POSIX) • Binary (Fortran) • Gather-Scatter method to interface serial IO

Steps toward PIO • Converge on a single file format • NetCDF selected • Self describing • Lossless with lossy capability (netcdf4 only) • Works with the current postprocessing tool chain

Extension to parallel • Reduce single task memory profile • Maintain single filedecomposition independent format • Performance (secondary issue)

Parallel IO from all compute tasks is not the best strategy • Data rearrangement is complicated leading to numerous small and inefficient IO operations • MPI-IO aggregation alone cannot overcome this problem

Parallel I/O library (PIO) • Goals: • Reduce per MPI task memory usage • Easy to use • Improve performance • Write/read a single file from parallel application • Multiple backend libraries: MPI-IO,NetCDF3, NetCDF4, pNetCDF, NetCDF+VDC • Meta-IO library: potential interface to other general libraries

CAM ATMOSPHERIC MODEL • CISL LAND ICE MODEL • CLM LAND MODEL • CPL7 COUPLER PIO • CICE OCEAN ICE MODEL • POP2 OCEAN MODEL VDC netcdf4 pnetcdf HDF5 netcdf3 MPI-IO

PIO design principles • Separation of Concerns • Separate computational and I/O decomposition • Flexible user-level rearrangement • Encapsulate expert knowledge

Separation of concerns • What versus How • Concern of the user: • What to write/read to/from disk? • eg: “I want to write T,V, PS.” • Concern of the library developer: • How to efficiently access the disk? • eq: “How do I construct I/O operations so that write bandwidth is maximized?” • Improves ease of use • Improves robustness • Enables better reuse

Separate computational and I/O decomposition computational decomposition Rearrangement between computational and I/O decompositions I/O decomposition

Flexible user-level rearrangement • A single technical solution is not suitable for the entire user community: • User A: Linux cluster, 32 core job, 200 MB files, NFS file system • User B: Cray XE6, 115,000 core job, 100 GB files, Lustre file system Different compute environment requires different technical solution!

Writing distributed data (I) I/O decomposition Computational decomposition Rearrangement • + Maximize size of individual io-op’s to disk • - Non-scalable user space buffering • Very large fan-in  large MPI buffer allocations • Correct solution for User A

Writing distributed data (II) I/O decomposition Computational decomposition Rearrangement • + Scalable user space memory • + Relatively large individual io-op’s to disk • Very large fan-in  large MPI buffer allocations

Writing distributed data (III) I/O decomposition Computational decomposition Rearrangement • + Scalable user space memory • + Smaller fan-in -> modest MPI buffer allocations • Smaller individual io-op’s to disk • Correct solution for User B

Encapsulate Expert knowledge • Flow-control algorithm • Match size of I/O operations to stripe size • Cray XT5/XE6 + Lustre file system • Minimize message passing traffic at MPI-IO layer • Load balance disk traffic over all I/O nodes • IBM Blue Gene/{L,P}+ GPFS file system • Utilizes Blue Gene specific topology information

Experimental setup • Did we achieve our design goals? • Impact of PIO features • Flow-control • Vary number of IO-tasks • Different general I/O backends • Read/write 3D POP sized variable [3600x2400x40] • 10 files, 10 variables per file, [max bandwidth] • Using Kraken (Cray XT5) + Lustrefilesystem • Used 16 of 336 OST

3D POP arrays [3600x2400x40]

PIOVDCParallel output to a VAPOR Data Collection (VDC) • VDC: • A wavelet-based, gridded data format supporting both progressive access and efficient data subsetting • Data may be progressively accessed (read back) at different levels of detail, permitting the application to trade off speed and accuracy • Think GoogleEarth: less detail when the viewer is far away, progressively more detail as the viewer zooms in • Enables rapid (interactive) exploration and hypothesis testing that can subsequently be validated with full fidelity data as needed • Subsetting • Arrays are decomposed into smaller blocks that significantly improve extraction of arbitrarily oriented sub arrays • Wavelet transform • Similar to Fourier transforms • Computationally efficient: order O(n) • Basis for many multimedia compression technologies (e.g. mpeg4, jpeg2000)

Other PIO Users • Earth System Modeling Framework (ESMF) • Model for Prediction Across Scales (MPAS) • Geophysical High Order Suite for Turbulence (GHOST) • Data Assimilation Research Testbed (DART)

Write performance on BG/L Penn State University

Read performance on BG/L Penn State University

100:1 Compression with coefficient prioritization10243 Taylor-Green turbulence (enstrophy field) [P. Mininni, 2006] No compression Coefficient prioritization (VDC2)

40963 Homogenous turbulence simulation Volume rendering of original enstrophy field and 800:1 compressed field 800:1 compressed: 0.34GBs/field Original: 275GBs/field Data provided by P.K. Yeung at Georgia Tech and Diego Donzis at Texas A&M

F90 code generation interface PIO_write_darray ! TYPE real,int ! DIMS 1,2,3 module procedure write_darray_{DIMS}d_{TYPE} end interface genf90.pl

# 1 "tmp.F90.in" interface PIO_write_darray module procedure dosomething_1d_real module procedure dosomething_2d_real module procedure dosomething_3d_real module procedure dosomething_1d_int module procedure dosomething_2d_int module procedure dosomething_3d_int end interface

PIO is opensource • http://code.google.com/p/parallelio/ Documentation using doxygen • http://web.ncar.teragrid.org/~dennis/pio_doc/html/

Thank you

Existing I/O libraries • netCDF3 • Serial • Easy to implement • Limited flexibility • HDF5 • Serial and Parallel • Very flexible • Difficult to implement • Difficult to achieve good performance • netCDF4 • Serial and Parallel • Based on HDF5 • Easy to implement • Limited flexibility • Difficult to achieve good performance

Existing I/O libraries (con’t) • Parallel-netCDF • Parallel • Easy to implement • Limited flexibility • Difficult to achieve good performance • MPI-IO • Parallel • Very difficult to implement • Very flexible • Difficult to achieve good performance • ADIOS • Serial and parallel • Easy to implement • BP file format • Easy to achieve good performance • All other file formats • Difficult to achieve good performance

Advancements in Parallel I/O for Earth System Modeling: Enhancing Scalability and Efficiency

Advancements in Parallel I/O for Earth System Modeling: Enhancing Scalability and Efficiency

Presentation Transcript

The Extended Parallel Process Model

Earth Model

The Earth in The solar System

State of the Community Climate System Model

The Earth System

Dynamic i ce sheets in the Community Earth System Model

State of the Community Climate System Model

Earth System Model

Week #4 Parallel IO Interfacing

Earth system model of INM RAS

System IO

The Community Climate System Model

Greenland surface mass balance in the Community Earth System Model

The Earth System

High-resolution climate and Community Earth System Model

The Earth System

Building an IO Model

Parallel Curriculum Model

Energy in the Earth System

The Earth System: Solid Earth

Community Ontologies for Earth System Science

The Norwegian Earth System Model NorESM