Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013

Advances in PUMI for High Core Count Machines • Dan Ibanez, Micah Corah, SeegyoungSeol, Mark Shephard • 2/27/2013 • Scientific Computation Research Center • Rensselaer Polytechnic Institute

Outline • Distributed Mesh Data Structure • Phased Message Passing • Hybrid (MPI/thread) Programming Model • Hybrid Phased Message Passing • Hybrid Partitioning • Hybrid Mesh Migration

Unstructured Mesh Data Structure Mesh Part Pointer in Data Structure Regions Faces Edges Vertices

Distributed Mesh Representation Mesh elements assigned to parts • Uniquely identified by handle or global ID • Treated as a serial mesh with the addition of part boundaries • Part boundary: groups of mesh entities on shared links between parts • Remote copy: duplicated entity copy on non-local part • Resident part set : list of parts where the entity exists • Can have multiple parts per process.

Message Passing • Primitive functional set: • Size – members in group • Rank – ID of self in group • Send – non-blocking synchronous send • Probe – non-blocking probe • Receive – blocking receive Non-blocking barrier (ibarrier) • API Call 1: Begin ibarrier • API Call 2: Wait for ibarrier termination • Used for phased message passing • Will be available in MPI3, right now custom solution

ibarrier Implementation • Using all non-blocking point-to-point calls: • For N ranks, lg(N) go to and from rank 0 • Uses a separate MPI communicator 0 1 2 3 4 Reduce Broadcast

Phased Message Passing • Similar to Bulk Synchronous Parallel • Uses non-blocking barrier • Begin phase • Send all messages • Receive any messages sent this phase • End phase • Benefits: • Efficient termination detection when neighbors unknown • Phases are implicit barriers – simplify algorithms • Allows buffering all messages per rank per phase

Phased Message Passing • Implementation: • Post all sends for this phase • While local sends incomplete: receive any • Now local sends complete (remember they are synchronous) • Begin “stopped sending” ibarrier • While ibarrier incomplete: receive any • Now all sends complete, can stop receiving • Begin “stopped receiving” ibarrier • While ibarrier incomplete: compute • Now all ranks stopped receiving, safe to send next phase • Repeat recv recv recv send send send ibarriers are signal edges

Hybrid System Blue Gene/Q Program MAPPING Node Process Thread Thread Thread Thread Core Core Core Core Thread Thread Thread Thread Core Core Core Core Thread Thread Thread Thread Core Core Core Core Thread Thread Thread Thread Core Core Core Core *Processes per node and threads per core are variable

Hybrid Programming System 1. Message Passing is the de facto standard programming model for distributed memory architectures. 2. The classic shared memory programming model: mutexes, atomic operations, lockless structures Most massively parallel code is currently using model 1. Models are very different, hard to convert from 1 to 2.

Hybrid Programming System We will try message passing between threads. Threads can send to other threads in the same process And to threads in a different process. Same model as MPI, replace “process” with “thread”. Porting is faster: change the message passing API. Shared memory is still exploited, lock with messages: Thread 1: Write(A) Release(lockA) Thread2: Lock(lockA) Write(A) Thread 1: Write(A) SendTo(2) Thread2: ReceiveFrom(1) Write(A) becomes

Parallel Control Utility Multi-threading API for hybrid MPI/thread mode • Launch a function pointer on N threads • Get thread ID, number of threads in process • Uses pthread directly Phased communication API • Send messages in batches per phase, detect end of phase Hybrid MPI/thread communication API • Uses hybrid ranks and size • Same phased API, automatically changes to hybrid when called within threads Future: Hardware queries by wrapping hwloc* * Portable Hardware Locality (http://www.open-mpi.org/projects/hwloc/)

Hybrid Message Passing • Everything built from primitives, need hybrid primitives: • Size: # of threads on the whole machine • Rank: machine-unique ID of the thread • Send, Probe, and Receive using hybrid ranks Process rank 0 1 Thread rank 0 1 2 3 0 1 2 3 Hybrid rank 4 5 6 7 0 1 2 3

Hybrid Message Passing • Initial simple hybrid primitives: just wrap MPI primitives • MPI_Init_thread with MPI_THREAD_MULTIPLE • MPI rank = floor(Hybrid rank / threads per process) • MPI tag bit fields: From thread To thread Hybrid tag Phased Phased MPI mode: ibarrier Hybrid mode: ibarrier MPI primitives Hybrid primitives MPI primitives

Hybrid Partitioning • Partition mesh to processes, then partition to threads • Map Parts to threads, 1-to-1 • Share entities on inter-thread part boundaries Process 1 Process 3 MPI Process 2 Process 4 Part Part Part Part Part Part Part Part pthreads pthreads pthreads pthreads Part Part Part Part Part Part Part Part

P 0 P 2 0 M j 0 M i P 1 Hybrid Partitioning • Entities are shared within a process • Part boundary entity is created once per process • Part boundary entity is shared by all local parts • Only owning part can modify entity (avoids almost all contention) • Remote copy: duplicate entity copy on another process • Parallel control utility can provide architecture info to mesh, which is distributed accordingly. inter-process boundary process j process i intra-process part boundary (implicit)

Mesh Migration • Moving mesh entities between parts • Input: local mesh elements to send to other parts • Other entities to move are determined by adjacencies • Complex Subtasks • Reconstructing mesh adjacencies • Re-structuring the partition model • Re-computing remote copies • Considerations • Neighborhoods change: try to maintain scalability despite loss of communication locality • How to benefit from shared memory

2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 Mesh Migration • Migration Steps       1 1 (B) Get affected entities and compute post-migration residence parts (A) Mark destination part id P0 P0 P2 2 2 P2 2 2 2 2 2 1 1 1 1 1 1 P1 P1 (C) Exchange entities and update part boundary (D) Delete migrated entities

Hybrid Migration • Shared memory optimizations: • Thread to part matching: use partition model for concurrency • Threads handle part boundary entities which they own • Other entities are ‘released’ • Inter-process entity movement • Send entity to one thread per process • Intra-process entity movement • Send message containing pointer 0 1 0 1 release grab 3 3 2 2

Hybrid Migration • Release shared entities • Update entity resident part sets • Move entities between processes • Move entities between threads • Grab shared entities Two-level temporary ownership: Master and Process Master Master: smallest resident part ID Process Master: smallest on-process resident part ID 0 1 3 2

Hybrid Migration Representative Phase: • Old Master Part sends entity to new Process Master Parts • Receivers bounce back addresses of created entities • Senders broadcast union of all addresses Old Resident Parts: {1,2,3} New Resident Parts: {5,6,7} 0 1 Data to create copy Address of local copy Addresses of all copies 4 5 7 6

Hybrid Migration Many subtle complexities: • Most steps have to be done one dimension at a time • Assigning upward adjacencies causes thread contention • Use a separate phase of communication to make them • Use another phase to remove them when entities are deleted • Assigning downward adjacencies requires addresses on the new process • Use a separate phase to gather remote copies

Preliminary Results • Model: bi-unit cube • Mesh: 260K tets, 16 parts • Migration: sort by X coordinate

Preliminary Results • First test of hybrid algorithm: • Using 1 node of the CCNI Blue Gene /Q: • Cases: • 16 MPI ranks, 1 thread per rank • 18.36 seconds for migration • 433 MB mesh memory use (sum of all MPI ranks) • 1 MPI rank, 16 threads per rank • 9.62 seconds for migration + thread create/join • 157 MB mesh memory use (sum of all threads)

Thank You SeegyoungSeol – FMDB architect, part boundary sharing Micah Corah – SCOREC undergraduate, threaded part loading

Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013