**Optimizing Parallel I/O for Large Datasets: Strategies and Tools

The Parallel and Grid I/O Perspective • MPI, MPI-IO, NetCDF, and HDF5 are in common use • Multi TB datasets also common • Testbeds needed for software at scale

Topics for Discussion • NetCDF • Other applications (leverage) • Read (parallel analysis tools) • PVFS opportunities • TB in your office • Scalability Testbed • Where do you test at scale? • Application Log files • Real app log files at > 1GB • Other • Can we quantify apps needs?

PVFS Peak Write Performance • Using compute nodes for storage in these tests • Peak at around 25-30 Mbytes/sec per I/O server • Clients cannot maintain this to disk

Processes Logfile Jumpshot Display Performance Visualizationwith Jumpshot • For detailed analysis of parallel program behavior, timestamped events are collected into a log file during the run. • A separate display program (Jumpshot) aids the user in conducting a post mortem analysis of program behavior. • Log files can become large (>1GB), making it impossible to inspect the entire program at once. • The FLASH Project motivated an indexed file format (SLOG) that uses a preview to select a time of interest and quickly display an interval. • We collaborated with IBM and LLNL to collect SLOG files directly from AIX trace records and display traces from multithreaded programs.

Chiba City Scalability Testbed http://www.mcs.anl.gov/chiba/

Notes • FAQ on Parallel I/O • Include performance graphs, tutorial links • Interaction with P2 (Data Mining and Access Pattern Discovery) • Parallel NetCDF (P2 as an application group) • Managing datasets of NetCDF files • Collect log files of application I/O • Explore use of WAN FTP for Grid I/O • Remote I/O through MPI-IO interface • PVFS Clusters for TB dataset experimentation • Close with John Drake on parallel NetCDF for Climate

SC02 Demo • Use Parallel NetCDF over MPI-IO over PVFS to access dataset • Extract time series from collection of files • Parallel reads as well as writes • New feature: handle dynamically changing datasets • Observe progress of running application • Perform data analysis and visualization • Contrast with nonparallel approach • Prototype on Chiba scalability testbed at ANL • Bonus: collect log files of I/O behavior and show analysis and visualizations of log files

Demo Steps • Select variable from collection of files, write a new NetCDF file • Illustrates fast I/O • (address open performance for collections of files) • Perform PCA • Illustrates algorithmically efficient methods • Visualize at each time step

Vision for the Future • Databases and parallel I/O integration • Data representations for standard file formats that provide better performance for typical access patterns (post NetCDF/HDF) • Transparent parallel I/O to/from everywhere (grid transparent, file system hierarchy transparent)

**Optimizing Parallel I/O for Large Datasets: Strategies and Tools

**Optimizing Parallel I/O for Large Datasets: Strategies and Tools

Presentation Transcript

The Global Wordnet Grid: anchoring languages to universal meaning

Grid Computing Using Modern Technologies

GRID COMPUTING

Grid Scheduling

DS-RT 2008 Tutorial – Distributed Simulation on the Grid

Selenium Grid and Jenkins

Introduction to Grid Computing and the Globus Toolkit™

Parallel Programming Models, Languages and Compilers

Parallel Programming Models, Languages and Compilers

GRID COMPUTING

SAM: Tevatron Experiments Using the Grid

ETM 555 Supplementary Lecture Notes Version 5. / 201 2 Contents:

Parallel computing

Scheduling for Grid Computing

Grid Portals – A User’s Gateway to the Grid

Parallel Algorithms on Networks of Processors

MPI Parallel Programming

Parallel Real-Time Systems

MPI Parallel Programming