1 / 10

Progress in Storage Efficient Access PnetCDF and MPI-I/O

Progress in Storage Efficient Access PnetCDF and MPI-I/O. SciDAC All Hands Meeting, March 2-3, 2005. Outline. Parallel netCDF Building blocks Status report Users and applications Future works MPI I/O file caching sub-system Enhance client-side file caching for parallel applications

katen
Télécharger la présentation

Progress in Storage Efficient Access PnetCDF and MPI-I/O

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Progress in Storage Efficient AccessPnetCDF and MPI-I/O SciDAC All Hands Meeting, March 2-3, 2005

  2. Outline • Parallel netCDF • Building blocks • Status report • Users and applications • Future works • MPI I/O file caching sub-system • Enhance client-side file caching for parallel applications • Scalable approach for enforcing file consistency and atomicity

  3. Compute node Compute node Compute node Compute node Parallel netCDF ROMIO User space ADIO File system space I/O Server I/O Server switch network I/O Server Parallel netCDF • Goals • Design parallel APIs • Keep the same file format • Backward compatible, easy to migrate from serial netCDF • Similar API names and argument lists but with parallel semantics • Tasks • Built on top of MPI for portability and high performance • Take advantage of existing MPI-IO optimization (collective I/O, etc.) • Additional functionality for sophisticated I/O patterns • A new set of flexible APIs incorporate MPI derived data type to address the mapping between memory and file data layout • Support C and Fortran interfaces • Support external data representations across platforms

  4. PnetCDF Current Status • High level APIs (mimicking serial netCDF API) • Fully supported both in C and Fortran • Flexible APIs (extended to utilize MPI derived datatype) • Allow complex memory layout for mapping between I/O buffer and file space • Support varm routines (strided memory layout) ported from serial netCDF • Support array shuffles, e.g. transposition • Test suites • C and Fortran self test codes ported from Unidata netCDF package to validate against single-process results • Parallel test codes for both sets of APIs • Latest release is v0.9.4 • Pre-release v1.0 • Sync with netCDF v3.6.0 (newest release from UniData) • Parallel API user manual

  5. PnetCDF Users and Applications • FLASH– Astrophysical Thermonuclear application from ASCI/Alliances Center at University of Chicago • ACTM– Atmospheric Chemical Transport Model from LLNL • ROMS–Regional Ocean Model System from NCSA HDF group • ASPECT – data understanding infrastructure from ORNL • pVTK– parallel visualization toolkit from ORNL • PETSc – Portable, Extensible Toolkit for Scientific Computation from ANL • PRISM– PRogram for Integrated Earth System Modeling

  6. PnetCDF Future Works • Data type conversion for external data representation • Reducing intermediate memory copy operations • int64  int32, little-endian  big-endian, int  double • Data type caching at PnetCDF level (w/o repeated decoding) • I/O hints • Currently, only MPI hints (MPI file info) are supported • Need netCDF level hints, eg. patterns of access sequence for multiple arrays • Non-blocking I/O • Large array support (dimensionality > 231-1) • More flexible and extendable file format • Allow adding new objects dynamically • Store arrays of structured data types, such as C structure

  7. client processors client processors local cache buffers application global cache pool program user space client-side system space file system interconnect I/O servers interconnect network network I/O server disks Client-side File Caching for MPI I/O • Traditional client-side file caching • Treats each client independently, targeting for distributed environment • Inadequate for parallel environment where clients are most likely related with each other (eg. read/write shared files) • Collective caching • Application processes cooperate with each other to perform data caching, coherence control (leaving I/O servers out of the task)

  8. File logical parititioning application process block 0 block 1 block 2 block 3 block 4 MPI library Distributed cache meta data P P P P processes MPI I/O 0 1 2 3 block 0 status block 1 status block 2 status block 3 status collective caching block 4 status block 5 status block 6 status block 7 status user space block 8 status block 9 status block 10 status block 11 status system space client-side file system Global cache pool P P P P processes 0 1 2 3 local memory local memory local memory local memory network page 1 page 1 page 1 page 1 page 2 page 2 page 2 page 2 page 3 page 3 page 3 page 3 server-side file system Design of Collective Caching • Caching sub-system is implemented at user space • Built at the MPI I/O level  Portable across different file systems • Distributed management • For cache metadata and lock control (vs. centralized) • Two designs: • Using an I/O thread • Using the MPI remote-memory-access (RMA) facility

  9. Read/write amount = 512 MB Read/write amount = 1 GB 600 1200 byte-range locking I/O byte-range locking I/O 500 1000 collective caching collective caching 400 800 I/O bandwidth in MB/s I/O bandwidth in MB/s 300 600 200 400 100 200 0 0 4 8 16 4 8 16 32 Number of nodes Number of nodes Read/write amount = 2 GB Read/write amount = 4 GB 1400 1600 byte-range locking I/O byte-range locking I/O 1400 1200 1200 collective caching collective caching 1000 1000 800 I/O bandwidth in MB/s I/O bandwidth in MB/s 800 600 600 400 400 200 200 0 0 4 8 16 32 4 8 16 32 Number of nodes Number of nodes Read/write amount = 8 GB Read/write amount = 16 GB 2000 2000 byte-range locking I/O byte-range locking I/O 1500 1500 collective caching collective caching I/O bandwidth in MB/s I/O bandwidth in MB/s 1000 1000 500 500 0 0 4 8 16 32 4 8 16 32 Number of nodes Number of nodes Performance Results 1 • IBM SP at SDSC using GPFS • System peak performance: 2.1 GB/s for reads, 1 GB/s for writes • Sliding-window benchmark • I/O requests are overlapped • Can cause cache coherence problem

  10. Original Original Original Original Collective Caching Collective Caching Collective Caching Collective Caching Performance Results 2 • BTIO benchmark • From NAS Ames Research Center -- Parallel Benchmarks version 2.4 • Block Tri-diagonal array partitioning pattern • Use MPI collective I/O calls • I/O requests are not overlapped • FLASH I/O benchmark • From U. of Chicago, ASCI Alliances Center • Access pattern is non-contiguous both in memory and in file • Use HDF5 • I/O requests are not overlapped BTIO Benchmark - class A BTIO Benchmark - class B 1200 800 700 1000 600 800 500 I/O Bandwidth in MB/s I/O Bandwidth in MB/s 600 400 300 400 200 200 100 0 0 4 9 16 25 36 49 64 4 9 16 25 36 49 64 Number of nodes Number of nodes FLASH I/O - 8 x 8 x 8 FLASH I/O - 16 x 16 x 16 900 200 800 180 160 700 140 600 120 500 I/O Bandwidth in MB/s I/O bandwidth in MB/s 100 400 80 300 60 200 40 100 20 0 0 4 8 16 32 4 8 16 32 64 Number of nodes Number of nodes

More Related