Jeffrey S. Vetter, Micah Beck, Philip Roth Future Technologies Group @ ORNL

DoE I/O characterizations, infrastructures for wide-area collaborative science, and future opportunities Jeffrey S. Vetter, Micah Beck, Philip Roth Future Technologies Group @ ORNL University of Tennessee, Knoxville

NCCS Resources September 2005 Summary Control Network 1 GigE 7 Systems Network Routers 10 GigE UltraScience Cray XT3 Jaguar Cray X1E Phoenix SGI Altix Ram IBM SP4 Cheetah IBM Linux NSTG Visualization Cluster IBM HPSS Supercomputers 7,622 CPUs 16TB Memory 45 TFlops 5294 2.4GHz 11TB Memory 1024 .5GHz 2 TB Memory (256) 1.5GHz 2TB Memory (864) 1.3GHz 1.1TB Memory (56) 3GHz 76GB Memory (128) 2.2GHz 128GB Memory Many Storage Devices Supported Shared Disk 9TB 120TB 32TB 36TB 32TB 4.5TB 5TB 238.5 TB Evaluation Platforms Scientific Visualization Lab Test Systems 5PB 5 PB Backup Storage • 144 processor Cray XD1 with FPGAs • SRC Mapstation • Clearspeed • BlueGene (at ANL) • 96 processor Cray XT3 • 32 processor Cray X1E* • 16 Processor SGI Altix • 27 projector Power Wall JV

Implications for Storage, Data, Networking • Decommissioning very important storage systems in GPFS • IBM SP3 decommissioned in June • IBM p690 cluster disabled on Oct 4 • Next generation systems bring new storage architectures • Lustre on 25TF XT3, lightweight kernel [limited number of os services] • ADIC StorNext on 18 TF X1E • Visualization and Analysis systems • SGI Altix, Viz cluster, 35 Megapixel Powerwall • Networking • Connected to ESnet, Internet2, TeraGrid, UltraNet, etc. • ORNL Internal network being upgraded now • End-to-end solutions underway – Klasky’s talk JV

Initial Lustre Performance on Cray XT3 • Acceptance criteria for XT3 is 5 GBps BW • 32 OSS, 64 OST configuration hit 6.7 GBps write, 6.2 GBps read JV

Preliminary I/O Survey (Spring 2005) Need I/O access patterns. JV

Initial Observations • Most users have limited I/O capability because the libraries and runtime systems are inconsistent across platforms • Little use of [Parallel] NetCDF or HDF5 • Seldom direct use of MPI-IO • Widely varying file size distribution • 1MB, 10MB, 100MB, 1GB, 10GB • Comments from community • POP: baseline parallel I/O works; not clear new decomposition scheme is easily parallelized • TSI: would like to use higher-level libraries (parallel netcdf, hdf5) but they are not implemented or perform poorly on target architectures JV

Preliminary I/O Performance of VH-1 on X1E using Different I/O Strategies JV

Learning From Experience with VH-1 • The fast way to write (today) • Each rank writes a separate architecture-specific file • Native mode MPI I/O file-per-rank O(1000) MB/s • Manipulating such a dataset in this form is awkward • Two solutions • Write one portable file using collective I/O • Native mode MPIO single file < 100 MB/s (no external32 on Phoenix) • Post-process sequentially, rewriting in portable format • In either case the computing platform is doing a lot of work in order to generate convenient metadata structure in the FS JV

Future Opportunities

Future Directions • Exposing structure to applications • High performance and fault tolerance in data intensive scientific work • Analysis and benchmarking of data intensive scientific codes • Advanced object storage devices • Adapting parallel I/O and parallel FS to wide area collaborative science JV

Exposing File Structure to Applications • VH-1 experiences motivate some issues • Expensive compute platforms perform I/O optimized to their own architecture and configuration • Metadata describes organization and encoding, using a portable schema • Processing is performed between writer and reader • File systems are local managers of structure & metadata • File description metadata can be managed outside a FS • In some cases, it may be possible to bypass file systems and access Object Storage Devices directly from the application • Exposing resources enables application autonomy. JV

High Performance and Fault Tolerance in Data Intensive Scientific Work • In complex workflows, performance and fault tolerance are defined in terms of global results, not locally • Logistical Networking have proved valuable to application and tool builders • We will continue collaborating in and investigating the design of SDM tools that incorporate LN functionality. • Caching and distribution of massive datasets • Interoperation with GridFTP-based infrastructure • Robustness through the use of widely distributed resources JV

Science Efforts Currently Leveraging Logistical Networking • Transfers over long networks decrease latency seen by TCP flows by storing, forwarding at intermediate point • Producer and consumer are buffered, transfer accelerated • Data movement within Terascale Supernova Initiative, Fusion Energy Simulation projects • Storage model for implementation of SRM interface for Open Science Grid project at Vanderbilt ACCRE • America View distribution of MODIS satellite data to a national audience JV

Continue Analysis and Benchmarking of Data Intensive Scientific Codes • Scientists may/should have abstract idea of I/O characteristics (esp. access patterns, access sizes), which are sensitive to systems and libraries • Instrumentation at the source level is difficult and may be misleading • Standard benchmarks and tools must be developed as a basis for comparison. • Some tools exist: ORNL’s mpiP provides statistics about MPI-IO runtime • I/O performance metrics should be routinely collected • Parallel I/O behavior should be accessible to users • Non-determinism in performance must be addressed JV

Move Toward Advanced Object Storage Devices • OSD is an emerging trend in file and storage systems (eg. Lustre, Panasas) • Current architecture is evolutionary from SCSI • Need to develop advanced forms of OSD that can be accessed directly by multiple users • Active OSD technology may enable pre- and post- processing at network storage nodes to offload hosts • These nodes must fit into larger middleware framework and workflow scheme JV

Adapt Parallel I/O and Parallel FS to Wide Area Collaborative Science • Emergence of massive digital libraries and shared workspaces requires common tools that provide more control than file transfer or distributed FS solutions • Direct control over tape, wide area transfer and local caching are important elements of application optimization • New standards are required for expressing file system concepts interoperably in a heterogeneous wide area environment. JV

Enable Uniform Component Coupling • Major components of scientific workflows interact through asynchronous file I/O interfaces • Granularity was traditionally the output of a complete run. • Today, as in Klasky & Bhat’s data streaming, granularity is one time step due to increased scale of computation • Flexible management of state required for customization of component interactions • localization (e.g.. caching) • fault tolerance (e.g. redundancy) • optimization (e.g. point-to-multipoint) JV

Questions?

Bonus Slides

Important Attributes of a Common Interface • Primitive, generic in order to serve many purposes • Sufficient to implement application requirements • Well layered in order to allow for diversity • Not imposing the costs of complex high layers on users of the lower layer functionality • Easily ported to new platforms and widely acceptable within the developer community • Who are the designers? What is the process? JV

Developer Conversations / POP • Parallel Ocean Program • Already have working parallel I/O scheme • Initial look at MPI-IO seemed to indicate impedance mismatch with POP’s decomposition scheme • NetCDF option doesn’t use parallel I/O • I/O is a low priority compared to other performance issues JV

Developer Conversations / TSI • Terascale Supernova Initiative • Would like to use Parallel NetCDF or HDF5, but they are unavailable or perform poorly on platform of choice (Cray X1) • Negligible performance impact of each rank writing individual timestep file, at least up to 140 PEs • Closer investigation of VH-1 shows • Performance impact of writing file-per-rank is not negligible • Major costs are imposed by writing architecture-independent files, forming a single timestep file • Parallel file systems address only some of these issues JV

Interfaces Used in Logistical Networking • On the network side • Sockets/TCP link clients to server • XML Metadata Schema • On the client side • Procedure calls in C/Fortran/Java/… • Application layer I/O libraries • End user tools • Command line • GUI implemented in TCL JV

Jeffrey S. Vetter, Micah Beck, Philip Roth Future Technologies Group @ ORNL