Enhancing High-Performance I/O with Compiler Parallelism

Improving I/O with Compiler-Supported Parallelism Anna Youssefi, Ken Kennedy Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds. Disk I/O may be a major bottleneck in applications such as: • scientific codes related to image processing • multimedia applications • out-of-core computations Computational optimizations alone may not provide any significant improvements to these programs. Why Should Compilers Be Involved? Compilers have knowledge of both the application and the computer architecture or operating system. Compilers can reduce the burden on the programmer and increase code portability by requiring little to no change in the user level program to achieve good performance on different architectures. Compilers can automatically translate programs written in high-level languages, which may lack robust I/O or operating system interfaces, into higher performance languages that provide more control over low-level system activities. Human Neuroimaging Lab http://www.hnl.bcm.tmc.edu/ The Human Neuroimaging Laboratory at the Baylor College of Medicine conducts research in the physiology and functional anatomy of the human brain using fMRI technology. fMRI Technology Functional Magnetic Resonance imaging is a technique for determining which parts of the brain are activated when a person responds to stimuli. A high resolution brain scan is followed by a series of low resolution scans taken on regular time slices. Brain activity is identified by increased blood flow to specific regions of the brain. Motivating Application The HNL wants to optimize a preprocessing application, which normalizes brain images of human subjects to a canonical brain in order to make the images comparable and enable data analysis. The program uses calls to the SPM (Statistical Parametric Mapping) library. Transformation: Loop Distribution & Parallelization Single processor Processor 1 Processor 2 Processor 3 Processor 4 Hand transformation on I/O-intensive loop in HNL preprocessing application The original loop reads a different input file and writes a portion of a single output file on each iteration. The loop is distributed into two separate loops: the first loop runs in parallel on four different processors; the second loop runs sequentially across all processors. Standard compiler transformations are implemented by hand to parallelize the loop. Dependence analysis can be used to automate the transformation. Performance Results Performance of the transformed loop was constrained by shortcomings of the MPI (Message Passing Interface) implementation we used. This implementation relies on file I/O to share data and results in excessive communication times, as demonstrated by the broadcast overhead. Even with these performance constraints, we achieved 30-40% improvement in running time. We expect to achieve even better results from using a different MPI implementation. Conclusion and Future Work Through parallelization, we achieved a minimum of 30% improvement in the running time of an I/O-intensive loop. Standard compiler transformations can be extended to reveal the parallelism in such loops. We plan to implement compiler strategies to automate these transformations. We also plan to implement compiler support for other application-level I/O transformations, such as converting synchronous to asynchronous I/O, prefetching and overlapping I/O with computation. for i=1 to 192 READ PROCESS WRITE for i=1 to 48 READ PROCESS for i=1 to 48 READ PROCESS for i=1 to 48 READ PROCESS for i=1 to 48 READ PROCESS for i=1 to 48 WRITE for i=1 to 48 WRITE for i=1 to 48 WRITE for i=1 to 48 WRITE

Enhancing High-Performance I/O with Compiler Parallelism

Enhancing High-Performance I/O with Compiler Parallelism

Presentation Transcript

Lecture 3 Instruction-Level Parallelism and Its Exploitation (Chapter 2 in textbook)

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Compiler Design

Compiler techniques for exposing ILP

Programmability

Parallelism

Parallelism

Parallelism

PARALLELISM

Detecting Parallelism in C Programs with Recursive Data Structures

Programming Explicit Thread-level Parallelism

Parallelism

Compiler Design Yacc Example " Yet Another Compiler Compiler"

High Performance Fortran (HPF)

Compiler 3.2

9. Alternative Konzepte: Parallele funktionale Programmierung

Advanced Compiler Design

CS61C Summer 2014 Final Review

Computer Architecture Instruction Level Parallelism

Parallelism

Enhancing High-Performance I/O with Compiler Parallelism