Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

Overview • Introduction • Pure Message-passing Model • Hybrid Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work IPDPS 2004

Motivation • Active research interest in • SMP clusters • Hybrid programming models • However: • Mostly fine-grain hybrid paradigms (masteronly model) • Mostly DOALL multi-threaded parallelization IPDPS 2004

Contribution • Comparison of 3 programming models for the parallelization of tiled loops algorithms • pure message-passing • fine-grain hybrid • coarse-grain hybrid • Advanced hyperplane scheduling • minimize synchronization need • overlap computation with communication • preserves data dependencies IPDPS 2004

Algorithmic Model Tiled nested loops with constant flow data dependencies FORACROSS tile0 DO … FORACROSS tilen-2 DO FOR tilen-1 DO Receive(tile); Compute(tile); Send(tile); END FOR END FORACROSS … END FORACROSS IPDPS 2004

Target Architecture SMP clusters IPDPS 2004

Pure Message-passing Model tile0 = pr0; … tilen-2 = prn-2; FOR tilen-1 = 0 TO DO Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr); END FOR IPDPS 2004

Pure Message-passing Model IPDPS 2004

Hyperplane Scheduling • Implements coarse-grain parallelism assuming inter-tile data dependencies • Tiles are organized into data-independent subsets (groups) • Tiles of the same group can be concurrently executed by multiple threads • Barrier synchronization between threads IPDPS 2004

Hyperplane Scheduling tile (mpi_rank,omp_tid,tile) group IPDPS 2004

Hyperplane Scheduling #pragma omp parallel { group0 = pr0; … groupn-2 = prn-2; tile0 = pr0 * m0 + th0; … tilen-2 = prn-2 * mn-2 + thn-2; FOR(groupn-1){ tilen-1 = groupn-1 - ; if(0 <= tilen-1 <= ) compute(tile); #pragma omp barrier } } IPDPS 2004

Fine-grain Model • Incremental parallelization of computationally intensive parts • Pure MPI + hyperplane scheduling • Inter-node communication outside of multi-threaded part (MPI_THREAD_MASTERONLY) • Thread synchronization through implicit barrier of omp parallel directive IPDPS 2004

Fine-grain Model FOR(groupn-1){ Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,groupn-1)) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr); } IPDPS 2004

Coarse-grain Model • Threads are only initialized once • SPMD paradigm (requires more programming effort) • Inter-node communication inside multi-threaded part (requires MPI_THREAD_FUNNELED) • Thread synchronization through explicit barrier (omp barrier directive) IPDPS 2004

Coarse-grain Model #pragma omp parallel { thread_id=omp_get_thread_num(); FOR(groupn-1){ #pragma omp master{ Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); } if(valid(tile,thread_id,groupn-1)) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr); } #pragma omp barrier } } IPDPS 2004

Experimental Results • 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) • MPICH v.1.2.5 (--with-device=ch_p4, --with-comm=shared) • Intel C++ compiler 7.0 (-O3 -mcpu=pentiumpro -static) • FastEthernet interconnection • ADI micro-kernel benchmark (3D) IPDPS 2004

Alternating Direction Implicit (ADI) • Stencil computation used for solving partial differential equations • Unitary data dependencies • 3D iteration space (X x Y x Z) IPDPS 2004

ADI – 2 dual SMP nodes IPDPS 2004

ADI X=128 Y=512 Z=8192 – 2 nodes IPDPS 2004

ADI X=128 Y=512 Z=8192 – 2 nodes Computation Communication IPDPS 2004

ADI X=512 Y=128 Z=8192 – 2 nodes Computation Communication IPDPS 2004

Conclusions • Tiled loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm • Hybrid models can be competitive to the pure message-passing paradigm • Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated • Programming efficiently in OpenMP not easier than programming efficiently in MPI IPDPS 2004

Future Work • Application of methodology to real applications and standard benchmarks • Work balancing for coarse-grain model • Investigation of alternative topologies, irregular communication patterns • Performance evaluation on advanced interconnection networks (SCI, Myrinet) IPDPS 2004

Thank You! Questions? IPDPS 2004

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Presentation Transcript

Hybrid Programming with OpenMP and MPI

MPI Program Performance

Hybrid MPI/CUDA

Hybrid openmp / mpi

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer

MPI on WinNT-Clusters

Hybrid OpenMP and MPI Programming

Optimizing Threaded MPI Execution on SMP Clusters

Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code

Hybrid OpenMP and MPI Programming and Tuning

MPI and OpenMP

Introduction to MPI, OpenMP, Threads

Programming Models and Languages for Clusters of Multi-core Nodes Part 2: Hybrid MPI and OpenMP

Integrated MPI/OpenMP Performance Analysis

MPI Program Performance

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Optimizing Threaded MPI Execution on SMP Clusters

MPI and MPICH on Clusters