Hiding Periodic I/O Costs in Parallel Applications

Hiding Periodic I/O Costsin Parallel Applications Xiaosong Ma Department of Computer Science University of Illinois at Urbana-Champaign Spring 2003

Roadmap • Introduction • Active buffering: hiding recurrent output cost • Ongoing work: hiding recurrent input cost • Conclusions

Introduction • Fast-growing technology propels high performance applications • Scientific computation • Parallel data mining • Web data processing • Games, movie graphics • Individual component’s growth un-coordinated • Manual performance tuning needed

We Need Adaptive Optimization • Flexible and automatic performance optimization desired • Efficient high-level buffering and prefetching for parallel I/O in scientific simulations

Scientific Simulations • Important • Detail and flexibility • Save money and lives • Challenging • Multi-disciplinary • High performance crucial

Parallel I/O in Scientific Simulations • Write-intensive • Collective and periodic • “Poor stepchild” • Bottleneck-prone • Existing collective I/O focused on data transfer … Computation I/O Computation I/O Computation I/O Computation …

My Contributions • Idea: I/O optimizations in larger scope • Parallelism between I/O and other tasks • Individual simulation’s I/O need • I/O related self-configuration • Approach: hide the I/O cost • Results • Publications, technology transfer, software

Latency Hierarchy on Parallel Platforms local memory access inter-processor communication disk I/O wide-area transfer • Along path of data transfer • Smaller throughput • Lower parallelism and less scalable

Basic Idea of Active Buffering • Purpose: maximize overlap between computation and I/O • Approach: buffer data as early as possible

Challenges • Accommodate multiple I/O architectures • No assumption on buffer space • Adaptive • Buffer availability • User request patterns

Roadmap • Introduction • Active buffering: hiding recurrent output cost • With client-server I/O architecture [IPDPS ’02] • With server-less architecture • Ongoing work: hiding recurrent input cost • Related work and future work • Conclusions

File System Client-Server I/O Architecture compute processors I/O servers

buffer data prepare enter collective buffer space write routine available out of buffer no overflow space exit send a block all data sent data to send Client State Machine

exit Server State Machine data to receive & enough buffer space prepare receive a block recv. write out of buffer space request fetch write a block idle- listen init. recv. recv. got write alloc. buffers idle, no exit req. all data message data to received fetch & no fetch a block write done data to data no request write idle & fetch & write all busy- listen to fetch write done exit msg.

Maximize Apparent Throughput • Ideal apparent throughput per server Dtotal Tideal = Dc-buffered Dc-overflow Ds-overflow Tmem-copy TMsg-passing Twrite • More expensive data transfer only becomes visible when overflow happens • Efficiently masks the difference in write speeds + +

Write Throughput without Overflow • Panda Parallel I/O library • SGI Origin 2000, SHMEM • Per client: 16MB output data per snapshot, 64MB buffer • Two servers, each with 256MB buffer

Write Throughput with Overflow • Panda Parallel I/O library • SGI Origin 2000, SHMEM, MPI • Per client: 96MB output data per snapshot, 64MB buffer • Two servers, each with 256MB buffer

Give Feedback to Application • “Softer” I/O requirements • Parallel I/O libraries have been passive • Active buffering allows I/O libraries to take more active role • Find optimal output frequency automatically

exit Server-side Active Buffering data to receive & enough buffer space prepare receive a block recv. write out of buffer space request fetch write a block idle- listen init. recv. recv. got write alloc. buffers idle, no exit req. all data message data to received fetch & no fetch a block write done data to data no request write idle & fetch & write all busy- listen to fetch write done exit msg.

Performance with Real Applications • Application overview – GENX • Large-scale, multi-component, detailed rocket simulation • Developed at Center for Simulation of Advanced Rockets (CSAR), UIUC • Multi-disciplinary, complex, and evolving • Providing parallel I/O support for GENX • Identification of parallel I/O requirements [PDSECA ’03] • Motivation and test case for active buffering

Overall Performance of GEN1 • SDSC IBM SP (Blue Horizon) • 64 clients, 2 I/O servers with AB • 160MB output data per snapshot (in HDF4)

Aggregate Write Throughput in GEN2 • LLNL IBM SP (ASCI Frost) • 1 I/O server per 16-way SMP node • Write in HDF4

internet Scientific Data Migration • Output data need to be moved • Online migration • Extend active buffering to migration • Local storage becomes another layer in buffer hierarchy … Computation I/O Computation I/O Computation I/O Computation

servers Internet File System workstation running visualization tool I/O Architecture with Data Migration compute processors

Active Buffering for Data Migration • Avoid unnecessary local I/O • Hybrid migration approach memory-to-memory transfer disk staging Combined with data compression [ICS ’02] Self-configuration for online visualization

Roadmap • Introduction • Active buffering: hiding recurrent output cost • With client-server I/O architecture • With server-less architecture [IPDPS ’03] • Ongoing work: hiding recurrent input cost • Conclusions

File System Server-less I/O Architecture I/O thread compute processors

ADIO ABT HFS NFS NTFS PFS PVFS UFS XFS Making ABT Transparent and Portable • Unchanged interfaces • High-level and file-system independent • Design and evaluation [IPDPS ’03] • Ongoing transfer to ROMIO

Active Buffering vs. Asynchronous I/O

I/O in Visualization • Periodic reads • Dual modes of operation • Interactive • Batch-mode • Harder to overlap reads with computation … Computation I/O Computation I/O Computation I/O Computation

Efficient I/O Through Data Management • In-memory database of datasets • Manage buffers or values • Hub for I/O optimization • Prefetching for batch mode • Caching for interactive mode • User-supplied read routine

Related Work • Overlapping I/O with computation • Replacing synchronous calls with async calls [Agrawal et al. ICS ’96] • Threads [Dickens et al. IPPS ’99, More et al. IPPS ’97] • Automatic performance optimization • Optimization with performance models [Chen et al. TSE ’00] • Graybox optimization [Arpaci-Dusseau et al. SOSP ’01]

Conclusions • If we can’t shrink it, hide it! • Performance optimization can be done • more actively • at higher-level • in larger scope • Make I/O part of data management

References • [IPDPS ’03] Xiaosong Ma, Marianne Winslett, Jonghyun Lee and Shengke Yu, Improving MPI-IO Output Performance with Active Buffering Plus Threads, 2003 International Parallel and Distributed Processing Symposium • [PDSECA ’03] Xiaosong Ma, Xiangmin Jiao, Michael Campbell and Marianne Winslett, Flexible and Efficient Parallel I/O for Large-Scale Multi-component Simulations, The 4th Workshop on Parallel and Distributed Scientific and Engineering Computing with Applications • [ICS ’02] Jonghyun Lee, Xiaosong Ma, Marianne Winslett and Shengke Yu, Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations' Data Transport Needs, the 16th ACM International Conference on Supercomputing • [IPDPS ’02] Xiaosong Ma, Marianne Winslett, Jonghyun Lee and Shengke Yu, Faster Collective Output through Active Buffering, 2002 International Parallel and Distributed Processing Symposium

Hiding Periodic I/O Costs in Parallel Applications