10 likes | 146 Vues
This paper introduces a multi-platform Co-Array Fortran (CAF) compiler, designed for high-performance computing. It supports SPMD programming, leveraging both private and shared data models for efficient asynchronous operations. The paper elaborates on advanced constructs like one-sided communication, flexible synchronization, and various optimization techniques to enhance programmer productivity. Implemented on various platforms, including Linux and SGI systems, this compiler is optimized for computational fluid dynamics and matrix operations, demonstrating performance that rivals traditional MPI applications.
E N D
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi, johnmc}@cs.rice.edu NSF a(10,20) a(10,20) a(10,20) image 1 image 2 image N image 1 image 2 image N integer :: a(10,20)[*] Copies from left neighbor if (this_image() > 1)a(1:10,1:2)=a(1:10,19:20)[this_image()-1] 3 2 1 2 1 1 2D wave-front parallelism Programming Ultra-scale Parallel Systems Co-Array Fortran Language CAF Model Refinements Rice CAF Compiler • Parallel extension of Fortran 90 • SPMD programming model • fixed number of images during execution • images operate asynchronously • Both private and shared data • real a(20,20)private: a 20x20 array in each image • real a(20,20) [*] shared: a 20x20 array in each image • Simple one-sided communication (PUT & GET) • x(:,j:j+2) = a(r,:) [p:p+2] copy rows from p:p+2 intolocal columns • Flexible explicit synchronization • sync_team(team [,wait]) • team= a vector of process ids to synchronize with • wait=a vector of processes to wait for • Pointers and dynamic allocation • Point-to-point synchronization • sync_notify(p) • sync_wait(p) • Less restrictive memory fences at call site • Collective operations • Source-to-source code generation • Open source compiler • Build on Open64/SL infrastructure • Support for core language features • Code generation: • library-based communication: portable ARMCI and GASNet communication libraries and array descriptor CHASM library • load/store communication: on shared-memory platforms • Operating systems: • Linux IA64/IA32 • Alpha Tru64 • SGI IRIX64 • Interconnects & Platforms: • Quadrics QSNet (Elan 3), QSNet II (Elan 4) • Myrinet 2000 • Ethernet • SGI Altix 3000, SGI Origin 2000 • CHALLENGES • High-performance and good scalability • Programmer productivity • CAF: a promising near-term alternative • As expressive as MPI • Simpler to program than MPI • More amenable to compiler optimizations • User has control over performance-critical factors • MPI: a library-based parallel programming model • Portable and widely used • The developer has explicit control over data and communication placement • Difficult and error prone to program • Most of the burden for communication optimization falls on application developers; compiler support is underutilized Current Optimizations • Procedure Splitting • Hints for non-blocking communication • Library-based and load/store communication • Packing of strided communication Planned Optimizations • Communication vectorization and aggregation • Synchronization strength-reduction • Automatic split-phase communication • Platform-driven communication optimizations • transform communication from one-sided into two- sided and collective, if useful • multi-model code for hierarchical architectures • convert GETs into PUTs • Multi-buffer co-arrays for asynchrony tolerance • Employ virtualization for latency tolerance • Interoperability with other programming models CAF Applications and Benchmarks • Sweep3D – wave-front parallelism • Spark98 – sparse matrix vector multiply • NAS Parallel Benchmarks 2.3: MG, CG, SP, BT, LU • Random Access, STREAM San Fernando Valley Earthquake Simulation – Spark98 Neutron transport problem – Sweep3D Computational Fluid Dynamics: Cluster Platforms NAS BT C on Itanium2+Myrinet 2000 NAS MG C on Itanium2+Myrinet 2000 Spark98 on SGI Altix 3000 Sweep3D 1503on Itanium2+Quadrics Mesh Computational Fluid Dynamics: SGI Altix 3000 • Sparse matrix vector multiply (sf2 traces) • Performance of all CAF versions is comparable to that of MPI and better on large number of CPUs • CAF GETs is simple and more “natural” to code, but up to 13% slower • Without considering locality, applications do not scale on NUMA architectures (Hybrid) • ARMCI library is more efficient than MPI Sweep3D 1503on Itanium2+Myrinet Sweep3D 1503on SGI Altix 3000 NAS BT B on SGI Altix 3000 NAS MG C on SGI Altix 3000 Partitioned Mesh