High Performance Computing: Concepts, Methods & Means Parallel I/O : File Systems and Libraries

High Performance Computing: Concepts, Methods & MeansParallel I/O : File Systems and Libraries Prof. Thomas Sterling Department of Computer Science Louisiana State University March 29th, 2007

Topics Introduction RAID Distributed File Systems (NFS) Parallel File Systems (PVFS2) Parallel I/O Libraries (MPI-IO) Parallel File Formats (HDF5) Additional Parallel File Systems (GPFS) Summary – Materials for Test

Permanent Storage: Hard Disks Review • Storage capacity: 1 TB per drive • Areal density: 132 Gbit/in2 (perpendicular recording) • Rotational speed: 15,000 RPM • Average latency: 2 ms • Seek time • Track-to-track: 0.2 ms • Average: 3.5 ms • Full stroke: 6.7 ms • Sustained transfer rate: up to 125 MB/s • Non-recoverable error rate: 1 in 1017 • Interface bandwidth: • Fibre channel: 400 MB/s • Serially Attached SCSI (SAS): 300 MB/s • Ultra320 SCSI: 320 MB/s • Serial ATA (SATA): 300 MB/s

Storage – SATA & Overview- Review PATA vs SATA Serial ATA is the newest commodity hard disk standard. SATA uses serial buses as opposed to parallel buses used by ATA and SCSI. The cables attached to SATA drives are smaller and run faster (around 150 MB/s). The Basic disk technologies remain the same across the three busses The platters in disk spin at variety of speeds, faster the platters spin the faster the data can be read off the disk and data on the far end of the platter will become available sooner. Rotational speeds range between 5400 RPM to 15000 RPM Faster the platters rotate, the lower the latency and higher the bandwidth.

I/O Needs on Parallel Computers • High Performance • Take advantage of parallel I/O paths (when available) • Support application-level data access and throughput needs • Data Integrity • sanely deal with hardware and power failures • Single Namespace • All nodes and users “see” the same file systems • Equal access from anywhere on the resource. • Ease of Use • Where possible, a parallel file system should be accessible in consistent way, in the same ways as a traditional UNIX-style file systems. Ohio Supercomputer Center

Parallel I/O - RAID RAID stands for Redundant Array of Inexpensive Disks provides a mechanism by which the performance and storage properties of individual disks can be aggregated Group of disks appear to be a single large disks; performance of multiple disks is better than single disks. Using multiple disks helps store data in multiple places allowing the system to continue functioning. Both software and hardware raid solutions available. Hardware solutions are more expensive, but provide better performance without CPU overhead. Software solutions provide various levels of flexibility but have associated computational overhead.

RAID : Key Concepts • Variety of RAID allocation schemes : • RAID 0 (disk striping without redundant storage) : • Data is striped across multiple disks. • The result of striping is a logical storage device that has the capacity of each disk times the number of disks present in the raid array. • Both read and write performances are accelerated. • Each byte of data can be read from multiple locations, so interleaving reads between disks can help double read performance. • No Fault tolerance • High transfer rates • High request rates http://www.drivesolutions.com/datarecovery/raid.shtml

RAID : Key Concepts • RAID 1 (disk mirroring): • Complete copies of data are stored on multiple locations. • Capacity of one of these RAID sets will be half of its raw capacity. Read performance is accelerated and is comparable to Raid 0. • Writes are slowed down, as new data needs to be transmitted multiple times. • RAID 5: • Like Raid 0 data is striped across multiple disks, with parity being distributed across the disks. • For any block of data stored across the drives, their parity checksum is computed and is stored on a predetermined disk. • Read performance of RAID 5 is reduced as the parity data is distributed across drives, and the write performance lags behind because of checksum computation. http://www.drivesolutions.com/datarecovery/raid.shtml

Distributed File Systems • A distributed file system is a file system that is stored locally on one system (server) but is accessible by processes on many systems (clients). • Multiple processes access multiple files simultaneously. • Other attributes of a DFS may include : • Access control lists (ACLs) • Client-side file replication • Server- and client- side caching • Some examples of DFSes: • NFS (Sun) • AFS (CMU) • DCE/DFS (Transarc / IBM) • CIFS (Microsoft) • Distributed file systems can be used by parallel programs, but they have significant disadvantages : • The network bandwidth of the server system is a limiting factor on performance • To retain UNIX-style file consistency, the DFS software must implement some form of locking which has significant performance implications Ohio Supercomputer Center

Distributed File System : NFS Popular means for accessing remote file systems in a local area network. Based on the client-server model , the remote file systems are “mounted” via NFS and accessed through the Linux virtual file system (VFS) layer. NFS clients cache file data, periodically checking with the original file for any changes. The loosely-synchronous model makes for convenient, low-latency access to shared spaces. NFS avoids the common locking systems used to implement POSIX semantics.

Why NFS is bad for Parallel I/O Clients can cache data indiscriminately, and tend to block boundaries. When nearby regions of a file are written by different processes on different clients, the result is undefined due to lack of consistency control. Secondly all file operations are remote operations. Extensive file locking required to implement sequential consistency Communication between client and server typically uses relatively slow communication channels, adding to performance degradation. Inefficient specification (eg. a read operation involves two RPC operations (one for look-up of file handle and second for reading of file data)

Parallel File Systems • Parallel File System is one in which there are multiple servers as well as clients for a given file system, equivalent of RAID across several file systems. • Multiple processes can access the same file simultaneously • Parallel File Systems are usually optimized for high performance rather than general purpose use, common optimization criterion being : • very large block sizes ( => 64kB) • relatively slow metadata operations (eg. fstat()) compared to reads and writes • Special APIs for direct access • Examples of Parallel file systems include : • GPFS (IBM) • LUSTRE (Cluster File Systems) • PVFS2 (Clemson/ANL) Ohio Supercomputer Center

Characteristics of Parallel File Systems High-Level I/O Library Parallel I/O (MPI I/O) Parallel File System Storage Hardware • Three Key Characteristics : • Various hardware I/O data storage resources • Multiple connections between these hardware devices and compute resources. • High-performance, concurrent access to these I/O resources. • Multiple physical I/O devices and paths ensures sufficient bandwidth for the high performance desired. • Parallel I/O systems include both the hardware and number of layers of software

Parallel File Systems: Hardware Layer Parallel File System High-Level I/O Library Parallel I/O (MPI I/O) Storage Hardware • I/O Hardware is usually comprised of disks, controllers, and interconnects for data movement. • Hardware determines the maximum raw bandwidth and the minimum latency of the system. • Bisection bandwidth of the underlying transport determines the aggregate bandwidth of the resulting parallel I/O system. • At the hardware level, data is accessed at the granularity of blocks, either physical disk blocks or logical blocks spread across multiple physical devices such as in a RAID array. • Parallel File Systems : • manage data on the storage hardware, • present this data as a directory hierarchy, • coordinate access to files and directories in a consistent manner • File systems usually provide a UNIX like interface, allowing users to access contiguous regions of files.

Parallel File Systems :Other Layers High-Level I/O Library Parallel I/O (MPI I/O) Parallel File System Storage Hardware Lower level interfaces may be provided by the file system for higher-performance access. Above the parallel file systems are the parallel I/O layers provided in the form of libraries such as MPI I/O. The parallel I/O layer provides a low level interface and operations such as collective I/O. Scientific applications work with structured data for which a higher level API written on top of MPI-IO such as HDF5 or parallel netCDF are used. HDF5 and parallel netCDF allow the scientists to represent the data sets in terms closer to those used in their applications.

PVFS2 • PVFS2 designed to provide : • a modular networking and storage subsystems • structured data request format modeled after MPI datatypes • flexible and extensible data distribution models • distributed metadata • tunable consistency semantics, and • support for data redundancy. • Supports variety of network technologies including Myrinet, Quadrics, and Infiniband. • Also supports variety of storage devices including locally attached hardware, SANs and iSCSI • Key abstractions include : • Buffered Message Interface (BMI) : non-blocking network interface • Trove : non-blocking storage interface • Flows : mechanism to specify a flow of data between network and storage

PVFS2 Software Architecture Client Server Client API Request Processing Job Sched Job Sched BMI Flo-ws Tro-ve BMI Flo-ws Dist Dist Network Disk • Buffered Messaging Interface (BMI) • Non blocking interface that can be used with many High performance network fabrics • Currently TCP/IP and Myrinet (GM) networks exist • Trove : • Non blocking interface that can be used with a number of underlying storage mechanisms. • Trove storage objects consist of stream of bytes and keyword/value pair space. • Keyword/value pairs are convenient for arbitrary metadata storage and directory entries, while stream of bytes provides ideal storage for the stream of bytes.

PVFS2 Software Architecture Client Server Client API Request Processing Job Sched Job Sched BMI Flo-ws Tro-ve BMI Flo-ws Dist Dist Network Disk • Flows : • Combines network and storage subsystems by providing mechanism to describe flow of data between network and storage. • Provide a point for optimization to optimize data movement between a particular network and storage pair to exploit fast paths. • The job scheduling layer provides a common interface to interact with BMI, Flows, and Trove and checks on their completion • The job scheduler is tightly integrated with a state machine that is used to track operations in progress.

The PVFS2 Components • The 4 major components to a PVFS system are : • Metadata Server (mgr) • I/O Server (iod) • PVFS native API (libpvfs) • PVFS Linux kernel support • Metadata Server (mgr) : • manages all the file metadata for PVFS files, using a daemon which atomically operates on the file metadata. • PVFS avoids the pitfalls of many storage area network approaches, which have to implement complex locking schemes to ensure that metadata stays consistent in the face of multiple accesses.

The PVFS2 Components metadata access data access • I/O daemon: • handles storing and retrieving file data stored on local disks connected to a node using traditional read(), write, etc for access to these files. • PVFS native API provides user-space access to the PVFS servers. • The library handles the operations necessary to move data between user buffers and PVFS servers. http://csi.unmsm.edu.pe/paralelo/pvfs/desc.html

Parallel File Systems Comparison

Comparison of NFS vs. GPFS

MPI-IO Overview Initially developed as a research project at the IBM T. J. Watson Research Center in 1994 Voted by the MPI Forum to be included in MPI-2 standard (Chapter 9) Most widespread open-source implementation is ANL’s ROMIO, written by Rajeev Thakur (http://www-unix.mcs.anl.gov/romio/ ) Integrates file access with the message passing infrastructure, using similarities between send/receive and file write/read operations Allows MPI datatypes to describe meaningfully data layouts in files instead of dealing with unorganized streams of bytes Provides potential for performance optimizations through the mechanism of “hints”, collective operations on file data, or relaxation of data access atomicity Enables better file portability by offering alternative data representations

MPI-IO Features (I) • Basic file manipulation (open/close, delete, space preallocation, resize, storage synchronization, etc.) • File views (define what part of a file each process can see and how it is interpreted) • Processes can view file data independently, with possible overlaps • The users may define patterns to describe data distributions both in file and in memory, including non-contiguous layouts • Permit skipping over fixed header blocks (“displacements”) • Views can be changed by tasks at any time • Data access positioning • Explicitly specified offsets (suffix “_at”) • Independent data access by each task via individual file pointers (no suffix) • Coordinated access through shared file pointer (suffix “_shared”) • Access synchronism • Blocking • Non-blocking (include split-collective operations)

MPI-IO Features (II) • Access coordination • Non-collective (no additional suffix) • Collective (suffix: “_all” for most blocking calls, “_begin” and “_end” for split-collective, or “_ordered” for equivalent of shared pointer access) • File interoperability (ensures portability of data representation) • Native: for purely homogeneous environments • Internal: heterogeneous environments with implementation-defined data representation (subset of “external32”) • External32: heterogeneous environments using data representation defined by the MPI-IO standard • Optimization hints (the “info” interface) • Access style (e.g. read_once, write_once, sequential, random, etc.) • Collective buffering components (buffer and block sizes, number of target nodes) • Striping unit and factor • Chunked I/O specification • Preferred I/O devices • C, C++ and Fortran bindings

MPI-IO Types Etype (elementary datatype): the unit of data access and positioning; all data accesses are performed in etype units and offsets are measured in etypes Filetype: basis for partitioning the file among processes: a template for accessing the file; may be identical to or derived from the etype Source:http://www.mhpcc.edu/training/workshop2/mpi_io/MAIN.html

MPI-IO File Views A view defines the current set of data visible and accessible from an open file as an ordered set of etypes • Each process has its own view of the file, defined by: a displacement, an etype, and a filetype • Displacement: an absolute byte position relative to the beginning of file; defines where a view begins

MPI-IO: File Open #include <mpi.h> ... MPI_Filefh; int err; ... /* create a writable file with default parameters */ err = MPI_File_open(MPI_COMM_WORLD, “/mnt/piofs/testfile”, MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); if (err != MPI_SUCCESS) {/* handle error here */} ...

MPI-IO: File Close #include <mpi.h> ... MPI_Filefh; int err; ... /* open a file storing the handle in fh */ /* perform file access */ ... err = MPI_File_close(&fh); if (err != MPI_SUCCESS) {/* handle error here */} ...

MPI-IO: Set File View #include <mpi.h> ... MPI_Filefh; int err; ... /* open file storing the handle in fh */ ... /* view the file as stream of integers with no header, using native data representation */ err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL); if (err != MPI_SUCCESS) {/* handle error */} ...

MPI-IO: Read File with Explicit Offset #include <mpi.h> ... MPI_Filefh; MPI_Status stat; intbuf[3], err; ... /* open file storing the handle in fh */ ... MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL); /* read the third triad of integers from file */ err = MPI_File_read_at(fh, 6, buf, 3, MPI_INT, &stat); ...

MPI-IO: Write to File with Explicit Offset #include <mpi.h> ... MPI_Filefh; MPI_Status stat; int err; double dt = 0.0005; ... /* open file storing the handle in fh */ ... MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL); /* store timestep as the first item in file */ err = MPI_File_write_at(fh, 0, &dt, 1, MPI_DOUBLE, &stat); ...

MPI-IO: Read File Collectively with Individual File Pointers #include <mpi.h> ... MPI_Filefh; MPI_Status stat; intbuf[20], err; ... /* open file storing the handle in fh */ ... MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL); /* read 20 integers at current file offset in every process */ err = MPI_File_read_all(fh, buf, 20, MPI_INT, &stat); ...

MPI-IO: Write to File Collectively with Individual File Pointers #include <mpi.h> ... MPI_Filefh; MPI_Status stat; double t; int err, rank; ... /* open file storing the handle in fh; compute t */ ... MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* interleave time values t from each process at the beginning of file */ MPI_File_set_view(fh, rank*sizeof(t), MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL); err = MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat); ...

MPI-IO: File Seek #include <mpi.h> ... MPI_Filefh; MPI_Status stat; double t; int rank; ... /* open file storing the handle in fh; compute t */ ... MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* interleave time values t from each process at the beginning of file */ MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL); MPI_File_seek(fh, MPI_SEEK_SET, rank); MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat); ...

MPI-IO Data Access Classification Source:http://www.mpi-forum.org/docs/mpi2-report.pdf

Example: Scatter to File Example created by Jean-Pierre Prost from IBM Corp.

Scatter Example Source #include "mpi.h" static intbuf_size = 1024; static intblocklen = 256; static char filename[] = "scatter.out"; main(intargc, char **argv) { char *buf, *p; intmyrank, commsize; MPI_Datatypefiletype, buftype; int length[3]; MPI_Aintdisp[3]; MPI_Datatype type[3]; MPI_Filefh; int mode, nbytes; MPI_Offset offset; MPI_Status status; /* initialize MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &commsize); /* initialize buffer */ buf = (char *) malloc(buf_size); memset(( void *)buf, '0' + myrank, buf_size); /* create and commit buftype */ MPI_Type_contiguous(buf_size, MPI_CHAR, &buftype); MPI_Type_commit(&buftype); /* create and commit filetype */ length[0] = 1; length[1] = blocklen; length[2] = 1; disp[0] = 0; disp[1] = blocklen * myrank; disp[2] = blocklen * commsize; type[0] = MPI_LB; type[1] = MPI_CHAR; type[2] = MPI_UB; MPI_Type_struct(3, length, disp, type, &filetype); MPI_Type_commit(&filetype); /* open file */ mode = MPI_MODE_CREATE | MPI_MODE_WRONLY;

Scatter Example Source (cont.) MPI_File_open(MPI_COMM_WORLD, filename, mode, MPI_INFO_NULL, &fh); /* set file view */ offset = 0; MPI_File_set_view(fh, offset, MPI_CHAR, filetype, "native", MPI_INFO_NULL); /* write buffer to file */ MPI_File_write_at_all(fh, offset, (void *)buf, 1, buftype, &status); /* print out number of bytes written */ MPI_Get_elements(&status, MPI_CHAR, &nbytes); printf( "TASK %d ====== number of bytes written = %d ======\n", myrank, nbytes); /* close file */ MPI_File_close(&fh); /* free datatypes */ MPI_Type_free(&buftype); MPI_Type_free(&filetype); /* free buffer */ free (buf); /* finalize MPI */ MPI_Finalize(); }

Data Access Optimizations Data Sieving 2-phase I/O Collective Read Implementation in ROMIO Source: http://www-unix.mcs.anl.gov/~thakur/papers/romio-coll.pdf

ROMIO Scaling Examples Write Operations Read Operations Bandwidths obtained for 5123 arrays (astrophysics benchmark) on Argonne IBM SP Source: http://www-unix.mcs.anl.gov/~thakur/sio-demo/astro.html

Independent vs. Collective Access Individual I/O on IBM SP Collective I/O on IBM SP Source: http://www-unix.mcs.anl.gov/~thakur/sio-demo/upshot.html

Introduction to HDF5 • Acronym for Hierarchical Data Format, a portable, freely distributable, and well supported library, file format, and set of utilities to manipulate it • Explicitly designed for use with scientific data and applications • Initial HDF version was created at NCSA/University of Illinois at Urbana-Champaign in 1988 • First revision in widespread use was HDF4 • Main HDF features include: • Versatility: supports different data models and associated metadata • Self-describing: allows an application to interpret the structure and contents of a file without any extraneous information • Flexibility: permits mixing and grouping various objects together in one file in a user-defined hierarchy • Extensibility: accommodates new data models, added both by the users and developers • Portability: can be shared across different platforms without preprocessing or modifications • HDF5 is the most recent incarnation of the format, adding support for new type and data models, parallel I/O, and streaming, and removing a number of existing restrictions (maximal file size, number of objects per file, flexibility of type use, storage management configurability, etc.) as well as improving the performance

HDF5 File Layout Low-level organization User’s view Major object classes: groups and datasets Namespace resembles file system directory hierarchy (groups ≡ directories, datasets ≡ files) Alias creation supported through links (both soft and hard) Mounting of sub-hierachies is possible

High Performance Computing: Concepts, Methods & Means Parallel I/O : File Systems and Libraries