Effective I/O Strategies for Cray T3E System

I/O Strategies for the T3E Jonathan Carter NERSC User Services

T3E Overview • T3E is a set of Processing Elements (PE) connected by a fast 3D torus. • PEs do not have local disk • All PEs access all filesystems equivalently • Path for I/O generally looks like: • user buffer space • system buffer space • I/O device buffer space

Filesystems • /usr/tmp • fast • subject to 14 day purge, not backed up • check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes) • $TMPDIR • fast • purged at end of job or session • shares quota with /usr/tmp • $HOME • slower • permanent, backed up • check quota with quota (usually 2Gb and 3500 inodes)

Types of I/O • Language I/O: Fortran or C (ANSI or POSIX) • Cray FFIO library (can be used from Fortran or C) • MPI I/O • Cray extensions to Fortran and C I/O (mostly for compatibility with PVP systems)

I/O Strategies - Exclusive access files • Each PE reads and writes to a separate file • Language I/O • MPI I/O • Increase language I/O performance with FFIO library (C must use POSIX style calls)

I/O Strategies - Communication and I/O PE • One PE coordinates reading and writing and communicates data back and forth between other PEs via message passing • Language I/O • MPI I/O • Increase language I/O performance with FFIO library

I/O Strategies - Shared files • All PEs read and write the same file simultaneously • Language I/O with FFIO library global layer • MPI I/O • Language I/O with FFIO library global layer and Cray extensions for additional flexibility

Cray FFIO library • FFIO is a set of I/O layers tuned for different I/O characteristics • Buffering of data (configurable size) • Caching of data (configurable size) • Available to regular Fortran I/O without reprogramming • Available for C through POSIX-like calls, e.g. ffopen, ffwrite

The assign command • the assign command controls • controls which FFIO layer is active • striping across multiple partitions • lots more • scope of assign • File name • Fortran unit number • File type (e.g. all sequential unformatted files)

assign Examples • read and write to file restart.file from all PEs by using the FFIO library global layer assign -F global:128:2 f:restart.file • use the FFIO library bufa layer to improve performance for file opened on Fortran unit 10 assign -F bufa:128:2 u:10 • use the FFIO library bufa layer to improve performance for all unformatted sequential Fortran files assign -F bufa:128:2 g:su

assign Examples • To see all active assigns assign -V • To remove all active assigns assign -R

bufa FFIO layer • bufa is an asynchronous buffering layer • performs read-ahead, write-behind • specify buffer size with -F bufa:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers • buffer space increases your applications memory requirements

global FFIO layer • global is a caching and buffering layer which enables multiple PEs to read and write to the same file • if one PE has already read the data, an additional read request from another PE will result in a remote memory copy • file open is a synchronizing event • By default, all PEs must open a global file, this can be changed by calling GLIO_GROUP_MPI(comm) • specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE

File positioning with the global FFIO layer • Positioning of a read or write is your responsibility • File pointers are private • Fortran • Use a direct access file, and read/write(rec=num) • Use Cray extensions setpos and getpos to position file pointer (not portable) • C • Use ffseek

FFIO considerations • Examples above use an unblocked file structure, normal Fortran files are blocked. To read the file without the global or bufa layers you must use assign -s unblocked f:filename • bufa and global do not allow backspace, or skipping over a partially read record. You can allow this behavior by using the cos layer in addition to bufa or global, but then setpos doesn’t work. assign -s cos:128,bufa:128:2 f:filename

More on FFIO • There are many other FFIO layers, some pretty obscure • cache and cachea layers, good for random access files • man intro_ffio for a terse description • Cray Publication - Application Programmer’s I/O Guide

More on assign • Many text processing options • Switch between Fortran 77 and Fortran 90 namelist • File pre-allocation • File striping

Further Information • I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials • Cray Publication - Application Programmer’s I/O Guide • Cray Publication - Cray T3E Fortran Optimization Guide • man assign

MPI I/O • Part of MPI-2 • Interface for High Performance Parallel I/O • data partitioning • collective I/O • asynchronous I/O • portability and interoperability

MPI I/O Definitions • An MPI file is an ordered collection of MPI types. • A file may be opened individually or collectively by a group of processes • The fileview defines a template for accessing the file and is used to partition the file amongst processes

Fileviews • A fileview is composed of three pieces: • a displacement (in bytes) form the beginning of the file • an elementary datatype (etype), which is the unit of data access and positioning within the file • an filetype, which defines a template for accessing the file. A filetype can contain etypes or holes of the same extent as etypes.

Fileviews (cont.) • The filetype pattern is repeated, “tiling” the file • Only the non-empty slots are available to read or write

Fileview (cont.) • Each process can have a different filetype Process 0 Process 1 Process 2

MPI_File_set_view • Called after MPI_File_open to set fileview • MPI_File_set_view(fh, disp, etype, filetype, datarep, info) • fh is a file handle • disp, etype, and filetype define the fileview • datarep is one of “native”, “internal”, or “external32” • info is a set of hints to optimize performance

MPI Info object • An info object bundles up a set of parameters integer finfo call MPI_Info_create(finfo, ierr) call MPI_Info_set(finfo, ‘access_style’, ‘write_mostly’, ierr) • MPI I/O defines a set of parameters used to help optimize I/O performance • MPI_Info_null can be used instead of an info object

Open and Close • MPI_File_open(comm, filename, amode, info, fh) • comm, open is collective over this communicator • filename, string or character variable • file access mode: MPI_MODE_RDONLY, MPI_MODE_RDWR etc. • info object, used to pass hints to open • file handle • MPI_File_close(fh)

Utility routines • MPI_File_delete • MPI_File_set_size • MPI_File_preallocate • MPI_File_set_info

Query routines • MPI_File_get_size • MPI_File_get_group • MPI_File_get_amode • MPI_File_get_info • MPI_File_get_view

Data access routines • Positioning • Explicit, each call has an offset • Individual, each PE maintains an individual file pointer • Shared, the file pointer is maintained globally • Synchronism • Blocking, routine returns when complete • Non-blocking, must call a termination routine to ensure completion • Coordination • Non-collective • Collective

Summary of access routines

MPI_File_seek MPI_File_get_position MPI_File_get_byte_offset MPI_File_seek_shared (collective) MPI_File_get_position_shared Summery of access routines (cont.)

T3E Implementation • No shared file pointers • No non-blocking collective (split collective) • SPR filed on non-blocking read • Work in progress

Examples • All the program fragments are available as working programs on the T3E • Do “module load training”, then look in $EXAMPLES/mpi_io • All examples are of a distributed dot product • initialize data with random numbers • compute dot product of whole vector • write out data into a shared file • read back in and check dot product PE 0 PE 1 PE 2

Naming convention • First letter is positioning: explicit, individual, or shared • Second letter is synchronism: blocking or non-blocking • Third letter is coordination: non-collective or collective • ebn.f90 is the explicit, blocking non-collective example • There are several “ibn” examples dealing with different fileviews

Filetype Example • Process 0 • Process 1 • Process 2

Filetype Example filemode = MPI_MODE_RDWR + MPI_MODE_CREATE call MPI_INFO_CREATE(finfo, ierr) call MPI_INFO_SET(finfo, 'access_style','write_mostly',ierr) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'vector', filemode,& finfo, fhv, ierr) call MPI_TYPE_CREATE_SUBARRAY(1, m*nprocs, m, m*me,& MPI_ORDER_FORTRAN, MPI_REAL, mpi_fileslice, ierr) disp=0 call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL, mpi_fileslice,& 'native', MPI_INFO_NULL, ierr)

Individual, blocking, non-collective call MPI_FILE_WRITE(fhv, b, m, MPI_REAL, status, ierr) lresult=sdot(m, b, 1, b, 1) call MPI_REDUCE(lresult, result, 1, MPI_REAL, MPI_SUM, 0,& MPI_COMM_WORLD, ierr) if (me.eq.0) then write(6,*) 'dot product: ', result end if ! zero vector and read it back in b=0.0 disp=0 call MPI_FILE_SEEK(fhv, disp, MPI_SEEK_SET, ierr) call MPI_FILE_READ(fhv, b, m, MPI_REAL, status, ierr)

Further Information on MPI I/O • MPI-The Complete Reference • Volume 1, The MPI Core • Volume 2, The MPI Extensions

Effective I/O Strategies for Cray T3E System