Optimization of Collective Communication in Intra-Cell MPI

Optimization of Collective Communication in Intra-Cell MPI Ashok Srinivasan Florida State University asriniva@cs.fsu.edu Goals Efficient implementation of collectives for intra-Cell MPI Evaluate the impact of different algorithms on the performance Collaborators:A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1, P.K. Baruah1, R. Sarma1, S. Kapoor2 1 Sri Sathya Sai University, Prashanthi Nilayam, India 2 IBM, Austin Acknowledgment: IBM, for providing access to a Cell blade under the VLP program

Outline • Cell Architecture • Intra-Cell MPI Design Choices • Barrier • Broadcast • Reduce • Conclusions and Future Work

Cell Architecture DMA put times • Memory to Memory Copy using: • SPE local store • memcpy by PPE • A PowerPC core, with 8 co-processors (SPE) with 256 K local store each • Shared 512 MB - 2 GB main memory - SPEs can DMA • Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops in double precision for SPEs • 204.8 GB/s EIB bandwidth, 25.6 GB/s for memory • Two Cell processors can be combined to form a Cell blade with global shared memory

Intra-Cell MPI Design Choices • Cell features • In order execution, but DMAs can be out of order • Over 100 simultaneous DMAs can be in flight • Constraints • Unconventional, heterogeneous architecture • SPEs have limited functionality, and can act directly only on local stores • SPEs access main memory through DMA • Use of PPE should be limited to get good performance • MPI design choices • Application data in: (i) local store or (ii) main memory • MPI data in: (i) local store or (ii) main memory • PPE involvement: (i) active or (ii) only during initialization and finalization • Collective calls can: (i) synchronize or (ii) not synchronize

Barrier (1) • OTA List: “Root” receives notification from all others, and then acknowledges through a DMA list • OTA: Like OTA List, but root notifies others through individual non-blocking DMAs • SIG: Like OTA, but others notify root through a signal register in OR mode • Degree-k TREE • In each step, a node has k-1 children • In the first phase, children notify parents • In the second phase, parents acknowledge children

Barrier (2) • PE: Consider SPUs to be a logical hypercube – in each step, each SPU exchanges messages with neighbor along one dimension • DIS: In step i, SPU j sends to SPU j + 2i and receives from j – 2i Comparison of MPI_Barrier on different hardware • Alternatives • Atomic increments in main memory – several microseconds • PPE coordinates using mailbox – tens of microseconds

Broadcast (1) • OTA on 4 SPUs • OTA: Each SPE copies data to its location • Different shifts are used to avoid hotspots in memory • Different shifts on larger number of SPUs yield results that are close to each other • AG on 16 SPUs • AG: Each SPE is responsible for a different portion of data • Different minimum sizes are tried

Broadcast (2) • TREEMM on 12 SPUs • TREEMM: Tree structured Send/Recv type implementation • Data for degrees 2 and 4 are close • Degree 3 is best, or close to it, for all SPU counts • TREE on 16 SPUs • TREE: Pipelined tree structured communication based on local stores • Results are similar to this figure for other SPU counts

Broadcast (3) • Broadcast on 16 SPEs (2 processors) • TREE: Pipelined tree structured communication based on LS • TREEMM: Tree structured Send/Recv type implementation • AG: Each SPE is responsible for a different portion of data • OTA: Each SPE copies data to its location • G: Root copies all data • Broadcast with good choice of algorithms for each data size and SPE count • Maximum main memory bandwidth is also shown

Broadcast (4) • Each node of the SX-8 has 8 vector processors capable of 16 Gflop/s, with 64 GB/s bandwidth to memory from each processor • The total bandwidth to memory for a node is 512 GB/s • Nodes are connected through a crossbar switch capable of 16 GB/s in each direction • The Altix is a CC-NUMA system with a global shared memory • Each node contains eight Itanium 2 processors • Nodes are connected using NUMALINK4 -- bandwidth between processors on a node is 3.2 GB/s, and between nodes 1.6 GB/s Comparison of MPI_Bcast on different hardware

Reduce • Reduce of MPI_INT with MPI_SUM on 16 SPUs • Similar trends were observed for other SPU counts Comparison of MPI_Bcast on different hardware • Each node of the IBM SP was a 16-processor SMP

Conclusions and Future Work Conclusions The Cell processor has good potential for MPI implementations PPE should have a limited role High bandwidth and low latency even with application data in main memory But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck Current and future work Implemented Collective communication operations optimized for contiguous data Future work Optimize collectives for derived data types with non-contiguous data

Optimization of Collective Communication in Intra-Cell MPI

Optimization of Collective Communication in Intra-Cell MPI

Presentation Transcript

Communication in Collective Action

Collective Communication Implementations

Cell-Cell Communication In Multicellular Organisms

Cell-Cell Communication

Automatic Tuning of Collective Communications in MPI

Intra/Interpersonal communication

Non-Collective Communicator Creation in MPI

Collective Communication

MPI Collective Communication Kadin Tseng Scientific Computing and Visualization Group

MPI Collective Communications

Intra Body communication

MPI Collective Communication

Hemodynamic optimization in intra-abdominal hypertension

Collective Communication

Introduction to Collective Operations in MPI

Communication in Collective Action

MPI implementation – collective communication

Collective Communication

Introduction to Collective Operations in MPI

Collective Communication

Design an MPI collective communication scheme