1 / 34

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters. Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr.

rex
Télécharger la présentation

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

  2. Overview • Introduction • Pure Message-passing Model • Hybrid Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work IPDPS 2004

  3. Motivation • Active research interest in • SMP clusters • Hybrid programming models • However: • Mostly fine-grain hybrid paradigms (masteronly model) • Mostly DOALL multi-threaded parallelization IPDPS 2004

  4. Contribution • Comparison of 3 programming models for the parallelization of tiled loops algorithms • pure message-passing • fine-grain hybrid • coarse-grain hybrid • Advanced hyperplane scheduling • minimize synchronization need • overlap computation with communication • preserves data dependencies IPDPS 2004

  5. Algorithmic Model Tiled nested loops with constant flow data dependencies FORACROSS tile0 DO … FORACROSS tilen-2 DO FOR tilen-1 DO Receive(tile); Compute(tile); Send(tile); END FOR END FORACROSS … END FORACROSS IPDPS 2004

  6. Target Architecture SMP clusters IPDPS 2004

  7. Overview • Introduction • Pure Message-passing Model • Hybrid Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work IPDPS 2004

  8. Pure Message-passing Model tile0 = pr0; … tilen-2 = prn-2; FOR tilen-1 = 0 TO DO Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr); END FOR IPDPS 2004

  9. Pure Message-passing Model IPDPS 2004

  10. Overview • Introduction • Pure Message-passing Model • Hybrid Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work IPDPS 2004

  11. Hyperplane Scheduling • Implements coarse-grain parallelism assuming inter-tile data dependencies • Tiles are organized into data-independent subsets (groups) • Tiles of the same group can be concurrently executed by multiple threads • Barrier synchronization between threads IPDPS 2004

  12. Hyperplane Scheduling tile (mpi_rank,omp_tid,tile) group IPDPS 2004

  13. Hyperplane Scheduling #pragma omp parallel { group0 = pr0; … groupn-2 = prn-2; tile0 = pr0 * m0 + th0; … tilen-2 = prn-2 * mn-2 + thn-2; FOR(groupn-1){ tilen-1 = groupn-1 - ; if(0 <= tilen-1 <= ) compute(tile); #pragma omp barrier } } IPDPS 2004

  14. Overview • Introduction • Pure Message-passing Model • Hybrid Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work IPDPS 2004

  15. Fine-grain Model • Incremental parallelization of computationally intensive parts • Pure MPI + hyperplane scheduling • Inter-node communication outside of multi-threaded part (MPI_THREAD_MASTERONLY) • Thread synchronization through implicit barrier of omp parallel directive IPDPS 2004

  16. Fine-grain Model FOR(groupn-1){ Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,groupn-1)) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr); } IPDPS 2004

  17. Overview • Introduction • Pure Message-passing Model • Hybrid Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work IPDPS 2004

  18. Coarse-grain Model • Threads are only initialized once • SPMD paradigm (requires more programming effort) • Inter-node communication inside multi-threaded part (requires MPI_THREAD_FUNNELED) • Thread synchronization through explicit barrier (omp barrier directive) IPDPS 2004

  19. Coarse-grain Model #pragma omp parallel { thread_id=omp_get_thread_num(); FOR(groupn-1){ #pragma omp master{ Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); } if(valid(tile,thread_id,groupn-1)) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr); } #pragma omp barrier } } IPDPS 2004

  20. Overview • Introduction • Pure Message-passing Model • Hybrid Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work IPDPS 2004

  21. Experimental Results • 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) • MPICH v.1.2.5 (--with-device=ch_p4, --with-comm=shared) • Intel C++ compiler 7.0 (-O3 -mcpu=pentiumpro -static) • FastEthernet interconnection • ADI micro-kernel benchmark (3D) IPDPS 2004

  22. Alternating Direction Implicit (ADI) • Stencil computation used for solving partial differential equations • Unitary data dependencies • 3D iteration space (X x Y x Z) IPDPS 2004

  23. ADI – 2 dual SMP nodes IPDPS 2004

  24. ADI X=128 Y=512 Z=8192 – 2 nodes IPDPS 2004

  25. ADI X=256 Y=512 Z=8192 – 2 nodes IPDPS 2004

  26. ADI X=512 Y=512 Z=8192 – 2 nodes IPDPS 2004

  27. ADI X=512 Y=256 Z=8192 – 2 nodes IPDPS 2004

  28. ADI X=512 Y=128 Z=8192 – 2 nodes IPDPS 2004

  29. ADI X=128 Y=512 Z=8192 – 2 nodes Computation Communication IPDPS 2004

  30. ADI X=512 Y=128 Z=8192 – 2 nodes Computation Communication IPDPS 2004

  31. Overview • Introduction • Pure Message-passing Model • Hybrid Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work IPDPS 2004

  32. Conclusions • Tiled loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm • Hybrid models can be competitive to the pure message-passing paradigm • Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated • Programming efficiently in OpenMP not easier than programming efficiently in MPI IPDPS 2004

  33. Future Work • Application of methodology to real applications and standard benchmarks • Work balancing for coarse-grain model • Investigation of alternative topologies, irregular communication patterns • Performance evaluation on advanced interconnection networks (SCI, Myrinet) IPDPS 2004

  34. Thank You! Questions? IPDPS 2004

More Related