Collective Buffering: Improving Parallel I/O Performance

Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo

Outline • Introduction • Concepts • Collective parallel I/O algorithms • Collective buffering experiments • Conclusion • Question

Introduction • Existing parallel I/O system evolved directly from I/O system for serial machines • Serial I/O systems are heavily tuned for: • Sequential, large accesses, limited file sharing between processes • High degree of both spatial and temporal locality

Introduction (cont.) • This paper presents a set of algorithms known as Collective Buffering algorithms • These algorithms seeks to improve I/O performance on distributed memory machines by utilizing global knowledge of the I/O operations

Concepts • Global data structure • Global data structure is the logical view of the data from the application’s point of view • Scientific applications generally use global data structures consisting of arrays distributed in one, two, or three dimensions

Concepts (cont.) • Data distribution • The global data structure is distributed among node memories by cutting it into data chunks. • The HPF BLOCK distribution partitions the global data structure into P equally sized pieces • The HPF CYCLIC divides the global data structure into small pieces (by distribution size or block size) and deals these pieces out to the P nodes in a round-robin fashion

Concepts (cont.)

Concepts (cont.) • File layout • File layout is another form of data distribution • The file represents a linearization of the global data structures, such as the row-major ordering of a three-dimensional array • This linearization is called canonical file • The file are distributed among I/O nodes

Concepts (cont.)

Collective parallel I/O algorithm • Naïve algorithm • Naïve algorithm treats parallel I/O the same as workstation I/O • The order of writes is dependent on data layout in node’s memory which as no relation to the layout of data on disks • The unit of data transferred in each I/O operation is the data block – the smallest unit of local data that is contiguous with respect to the canonical file

Collective parallel I/O algorithm (cont.) • Naïve algorithm (cont.) • The size of the data block is very small and is unrelated to the size of a file block because of the disparity between data distributions and file layout parameters • The overall effect are: • The network is flood with many small messages • Messages arrive at I/O nodes in an uncoordinated fashion resulting in highly inefficient disk writes

Collective parallel I/O algorithms (cont.)

Collective parallel I/O algorithms (cont.) • Collective buffering algorithm • This method rearranges the data on compute nodes prior to issuance of I/O operations to minimize the number of disk operations • The permutation can be performed “in place” where nodes transpose data among them self • It can also be performed “on auxiliary nodes” where the compute nodes transpose the data by sending it to a set of auxiliary buffering nodes

Collective parallel I/O algorithms (cont.)

Collective parallel I/O algorithms (cont.) • Four techniques are developed and evaluated: 1 - All compute nodes are used to permute the data to a simple HPF BLOCK intermediate distribution in a single step 2 – Refine the first technique by realistically limiting the amount of buffer space and using a distribution which matches the file layout

Collective parallel I/O algorithms (cont.) • Four techniques (cont.): • This technique uses HPF CYCLIC intermediate distribution • This method uses scatter/gather hardware to eliminate the latency dominated overhead of the permutation phase

Collective buffering experiments • Experiment systems: • The Paragon consists of 224 processing nodes connected in a 16x32 mesh. • Application space-share 208 compute nodes with 32 MB of memory each. • Nine I/O nodes each with one SCSI-1 RAID-3 disk array consisting of 5 disks, 2 gigabytes each. • The parallel file system, PFS is configured to use 6 of the 9 I/O nodes

Collective buffering experiments (cont.) • Experiments systems: • The SP2 consists of 160 nodes. Each node is an IBM RS6000/590 with 128 MB of memory and a SCSI-1 attached 2 GB disk • The Parallel file system, IBM AIX Parallel I/O File System (PIOFS) is configured with 8 I/O nodes (semi-dedicated servers) and 150 compute nodes

Collective buffering experiments (cont.)

Conclusion • Collective buffering significantly improves Naïve parallel I/O performance by two orders of magnitude for small data block sizes • Peak performance can be obtained with minimal buffer space (approximately 1 megabyte per I/O node) • Performance is dependent on intermediate distribution (up to a factor of 2)

Conclusion (cont.) • There is no single intermediate distribution which provides the best performance for all cases, but a few come close • Collective buffering with scatter/gather can potentially deliver peak performance for all data block sizes.

Question • What is the advantages and disadvantages of the Naïve algorithm ? • What is Collective Buffering and how this technique may improve parallel I/O performance ?

Collective Buffering: Improving Parallel I/O Performance