1 / 57

Modeling and Acceleration of File-IO Dominated Parallel Workloads

Presented at Analogic Corporation July 11 th , 2005. Modeling and Acceleration of File-IO Dominated Parallel Workloads. Yijian Wang David Kaeli Department of Electrical and Computer Engineering Northeastern University yiwang@ece.neu.edu. Important File-base I/O Workloads.

xarles
Télécharger la présentation

Modeling and Acceleration of File-IO Dominated Parallel Workloads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presented at Analogic Corporation July 11th, 2005 Modeling and Acceleration of File-IO Dominated Parallel Workloads Yijian Wang David Kaeli Department of Electrical and Computer Engineering Northeastern University yiwang@ece.neu.edu

  2. Important File-base I/O Workloads • Many subsurface sensing and imaging workloads involve file-based I/O • Cellular biology – in-vitro fertilization with NU biologists • Medical imaging – cancer therapy with MGH • Underwater mapping – multi-sensor fusion with Woods Hole Oceanographic Institution • Ground-penetrating radar – toxic waste tracking with Idaho National Labs

  3. Air Mine Soil The Impact of Profile-guided Parallelization on SSI Applications • Reduced the runtime of a single-body Steepest Descent Fast Multipole Method (SDFMM) application by 74% on a 32-node Beowulf cluster • Hot-path parallelization • Data restructuring • Reduced the runtime of a Monte Carlo • scattered light simulation by 98% on • a 16-node Silicon Graphics Origin 2000 • Matlab-to-C compliation • Hot-path parallelization • Obtained superlinear speedup of Ellipsoid • Algorithm run on a 16-node IBM SP2 • Matlab-to-C compliation • Hot-path parallelization

  4. Limits of Parallelization • For compute-bound workloads, Beowulf clusters can be used effectively to overcome computational barriers • Middlewares (e.g., MPI and MPI/IO) can significantly reduce the programming effort on parallel systems • Multiple clusters can be combined, utilizing Grid Middleware (Globus Toolkit) • For file-based I/O-bound workloads, Beowulf clusters and Grid systems are presently ill-suited to exploit the potential parallelism present on these systems

  5. Outline • Introduction • Characterization of Parallel I/O Access Patterns • Profile-Guided I/O Partitioning • Parallel I/O Modeling and Simulation • Work in Progress

  6. Introduction • The I/O bottleneck • The growing gap between the speed of processors, networks and underlying I/O devices • Many imaging and scientific applications access disks very frequently • IO intensive applications • Out-of-Core applications • Large dataset that cannot fit into main memory • File-IO intensive applications • Database applications • Randomly access small data chunks • Multimedia servers • Sequentially access large data chunks • Parallel scientific applications (target applications)

  7. Parallel Scientific Applications • Application classes • Sub-surface sensing and imaging • Medical image processing • Seismic processing • Fluid dynamics • Weather forecasting and simulation • High energy physics • Bio-informatics image processing • Aerospace applications • Application characteristics • Access patterns: a large number of non-contiguous data chunks • Multiple processes read/write simultaneously • Data sharing among multiple processes

  8. Cluster Storage • General purpose shared file storage • Files (i.e., source codes, executables, scripts, etc.) need to be accessed and be available to all nodes • Stored on a centralized storage system (RAID, high capacity, high throughput) • Parallel file system to provide concurrent access • I/O requests are forwarded to I/O node, which complete I/O requests and send back to compute nodes through a message passing network Local disk • Local disk • Hosts OS • Virtual memory and swap space • Temporary files Ethernet Local disk Shared file space

  9. An I/O intensive application Multiple Processes (i.e. MPI-IO) Disk Disk Multiple disks (i.e. RAID) … Disk Disk Disk Data Striping I/O Models

  10. An I/O intensive application Multiple Processes (i.e. MPI-IO) Disk Disk Multiple disks (i.e. RAID) … … … Disk Disk Disk Disk Disk Disk Data Striping Data Partitioning I/O Models Disk/network contention Disk/network contention

  11. Outline • Introduction • Characterization of Parallel I/O Access Patterns • Profile-Guided I/O Partitioning • Parallel I/O Modeling and Simulation • Work in Progress

  12. Parallel I/O Access Patterns (Spatial) Stride: distance between two contiguous accesses for every process Simple Strided 1 1 0 2 3 Process ID 0 2 3 Multiple Level Strided 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3

  13. Parallel I/O Access Patterns (Spatial) Varied Extent 1 1 0 2 3 0 2 3 Segmented 1 0 2 N End of File Start of File

  14. Parallel I/O Access Patterns (Spatial) 0 1 2 3 Tiled Access 4 5 6 7 8 9 10 11 12 13 14 15 Overlapped Tiled Access 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  15. Parallel I/O Access Patterns (Temporal) read once computation computation write once write read computation computation burst read burst write computation

  16. Outline • Introduction • Characterization of Parallel I/O Access Patterns • Profile-Guided I/O Partitioning • Parallel I/O Modeling and Simulation • Work in Progress

  17. I/O Partitioning • I/O is parallelized at both the application level (using MPI and MPI-IO) and the disk level (using file partitioning) • Final goal • Integrate these levels into a system wide approach • Scalability • Ideally, every process will only access files on local disk (though this is typically not possible due to data sharing) • How to recognize the access patterns? • Profile-guided approach

  18. Profile Generation Run the instrumented application Capture I/O execution profiles Apply our partitioning algorithm Rerun the tuned application

  19. I/O traces and partitioning • For every process, for every contiguous file access, we capture the following I/O profile information: • Process ID • File ID • Address • Chunk size • I/O operation (read/write) • Timestamp • Generate a partition for every process • Optimal partitioning is NP-complete, so we develop a greedy algorithm • We have found we can use partial profiles to guide partitioning

  20. Greedy File Partitioning Algorithm for each IO process, create a partition; for each contiguous data chunk { total up the # of read/write accesses on a process-ID basis; if the chunk is accessed by only one process ID assign the chunk to the associated partition; if the chunk is read (but never written) by multiple processes duplicate the chunk in all partitions where read; if the chunk is written by one partition, but later read by multiple { assign the chunk to all partitions where read and broadcast the updates on writes; else assign the chunk to a shared partition; }} For each partition sort chunks based on the earliest timestamp for each chunk;

  21. Parallel I/O Workloads • NASA Parallel Benchmark (NPB2.4)/BT • Computational fluid dynamics • Generates a file (~1.6 GB) dynamically and then reads it back • Writes/reads sequentially in chunk sizes of 2040 Bytes • SPEChpc96/seismic • Seismic processing • Generates a file (~1.5 GB) dynamically and then reads it back • Writes sequential chunks of 96 KB and reads sequential chunks of 2 KB • Tile-IO • Parallel Benchmarking Consortium • Tile access to a two-dimensional matrix (~1 GB) with overlap • Writes/reads sequential chunks of 32 KB, with 2KB of overlap • Perf • Parallel I/O test program within MPICH • Writes a 1 MB chunk at a location determined by rank, no overlap • Mandelbrot • An image processing application that includes visualization • Chunk size is dependent on the number of processes • Jacobi • File-based out-of-core jacobi applications from U. of Georgia • FFT • File-based out-of-core FFT application from MPI-SIM

  22. The Joulian Cluster RAID Node P2-350Mhz P2-350Mhz P2-350Mhz 10/100Mb Ethernet Switch Local PCI-IDE Disk Local PCI-IDE Disk P2-350Mhz P2-350Mhz P2-350Mhz RAID Node

  23. Write/Read Bandwidth for NBT2.4/BT

  24. Write/Read Bandwidth for SPEChpc96/seis

  25. Write/Read Bandwidth for Tile-IO

  26. Write/Read Bandwidth for Perf

  27. Write/Read Bandwidth for Mandelbrot

  28. Write/Read Bandwidth for Jacobi

  29. Write/Read Bandwidth for FFT

  30. Overall Execution Time

  31. Profile training sensitivity analysis • We have found that IO access patterns are independent of file-based data values • When we increase the problem size or reduce the number of processes, either: • the number of IO increases, but access patterns and chunk size remain the same (SPEChpc96, Mandelbrot), or • the number of IOs and IO access patterns remain the same, but the chunk size increases (NBT, Tile-IO, Perf) • Re-profiling can be avoided

  32. Outline • Introduction • Characterization of Parallel I/O Access Patterns • Profile-Guided I/O Partitioning • Parallel I/O Modeling and Simulation • Work in Progress

  33. Parallel I/O Simulation • Explore larger I/O design space • Studying new disk devices and technologies • Efficient implementation of storage architectures can significantly improve system performance • An accurate simulation environment for users to test and evaluate different storage architectures and applications

  34. Storage Architecture • Direct Attached Storage (DAS) • Storage device is directly attached to the computer • Network Attached Storage (NAS) • Storage subsystem is attached to a network of servers and file requests are passed through a parallel file system to the centralized storage device server … server server server LAN/WAN DAS NAS

  35. Storage Architecture • Storage Area Network (SAN) • A dedicated network to provide an any-to-any connection between processors and disks • To offload I/O traffic from backbone network LAN/WAN server server server … SAN ……

  36. Execution-driven Parallel I/O Simulation • Use DiskSim as the underlying disk drive simulator • DiskSim 3.0 – Carnegie Mellon University • Direct execution to model CPU and network communication • We execute the real parallel I/O accesses and meanwhile, calculate the simulated I/O response time

  37. Simulation Framework - NAS Local I/O traces Local I/O traces Local I/O traces Local I/O traces LAN/WAN Network File System RAID controller Filesystem metadata Logical file access addresses I/O traces I/O requests Disk Sim

  38. A variety of SAN where disks are distributed across the network and each • server is directly connected to a single device • File partitioning • Utilize I/O profiling and data partitioning heuristics to distribute portions of • files to disks close to the processing nodes Simulation Framework – SAN direct LAN/WAN FileSystem FileSystem FileSystem FileSystem I/O traces I/O traces I/O traces I/O traces Disk Sim Disk Sim Disk Sim Disk Sim

  39. DAS configuration A standalone PC, Western Digital WD800BB (IDE), 80GB, 7200RPM Beowulf cluster (base configuration) Fast Ethernet 100Mbits/sec Network-Attached RAID - Morstor TF200 with 6-9GB Seagate SCSI disks, 7200rpm, RAID-5 Local attached IDE disks – IBM UltraATA-350840, 5400rpm Fibre channel disks Seagate Cheetah X15 ST-336752FC, 15000rpm Experimental Hardware Specifics

  40. Validation – MicroBenchmarks on DAS

  41. Validation - Overall Execution Time of NPB2.4/BT (NAS)

  42. Validation - Overall Execution Time of NPB2.4/BT (SAN)

  43. I/O Throughput of NPB2.4/BT base configuration

  44. I/O Throughput of NPB2.4/BT Fibre-channel disk

  45. I/O Throughput of SPEC/seis Fibre-channel disk

  46. I/O Throughput of SPEC/seis Fibre-channel disk SAN all-to-all: all nodes have a direct connection to each disk

  47. Simulation of Disk Interfaces and Interconnections • Study the overall system performance as a function of underlying storage architectures • Interconnections: NAS-RAID and SAN-direct • Disk interfaces:

  48. Overall Execution Time of NPB2.4/BT

  49. Overall Execution Time of SPEChpc/seis

  50. Overall Execution Time of Perf

More Related