Introduction to Adaptive MPI (AMPI) for Dynamic Parallel Applications

AMPI: Adaptive MPI Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI

AMPI: Motivation • Challenges • New generation parallel applications are: • Dynamically varying: load shifting during execution • Adaptively refined • Composed of multi-physics modules • Typical MPI Implementations: • Not naturally suitable for dynamic applications • Available processor set may not match algorithm • Alternative: Adaptive MPI (AMPI) • MPI & Charm++ virtualization: VP (“Virtual Processors”) AMPI: Adaptive MPI

MPI “tasks” Implemented as user-level migratable threads ( VPs: virtual processors ) Real Processors AMPI: Overview • Virtualization: MPI ranks → Charm++ threads AMPI: Adaptive MPI

AMPI: Overview (cont.) • AMPI Execution Model: • Multiple user-level threads per process • Typically, one process per physical processor • Charm++ Scheduler coordinates execution • Threads (VPs) can migrate across processors • Virtualization ratio: R = #VP / #P (over-decomposition) Charm++ Scheduler P=1 , VP=4 AMPI: Adaptive MPI

AMPI: Overview (cont.) • AMPI’s Over-Decomposition in Practice • MPI: P=4, ranks=4 • AMPI: P=4, VP=ranks=16 AMPI: Adaptive MPI

Benefits of AMPI • Overlap between Computation/Communication • Automatically achieved • When one thread blocks for a message, another thread in the same processor can execute • Charm++ Scheduler picks next thread among those that are ready to run Charm++ Scheduler AMPI: Adaptive MPI

Benefits of AMPI (cont.) • Potentially Better Cache Utilization • Gains occur when subdomain is accessed repeatedly (e.g. by multiple functions, called in sequence) 12 might fit in cache, but 3 might not fit AMPI: Adaptive MPI

Benefits of AMPI (cont.) • Thread Migration for Load Balancing • Migration of thread 13: AMPI: Adaptive MPI

Benefits of AMPI (cont.) • Load Balancing in AMPI: MPI_Migrate() • Collective operation informing the load balancer that the threads can be migrated, if needed, for balancing load • Easy to insert in the code of iterative applications • Leverages Load-Balancing framework of Charm++ • Balancing decisions can be based on • Measured parameters: computation load, communication pattern • Application-provided information AMPI: Adaptive MPI

Benefits of AMPI (cont.) • Decoupling of Physical/Virtual Processors Problem setup: 3D stencil calculation of size 2403 run on Lemieux. AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs P=K3 AMPI: Adaptive MPI

Benefits of AMPI (cont.) • Asynchronous Implementation of Collectives • Collective operation is posted, returns immediately • Test/wait for its completion; meanwhile, do useful work e.g. MPI_Ialltoall( … , &req); /* other computation */ MPI_Wait(req); • Other operations available: MPI_Iallreduce, MPI_Iallgather • Example: 2D FFT benchmark time (ms) AMPI: Adaptive MPI

Motivation for Collective Communication Optimization Time breakdown of an all-to-all operation using Mesh library • Computation is only a small proportion of the elapsed time • A number of optimization techniques are developed to improve collective communication performance AMPI: Adaptive MPI

Benefits of AMPI (cont.) • Fault Tolerance via Checkpoint/Restart • State of application checkpointed to disk or memory • Capable of restarting on different number of physical processors! • Synchronous checkpoint, collective call: • In-disk: MPI_Checkpoint(DIRNAME) • In-memory: MPI_MemCheckpoint(void) • Restart: • In-disk: charmrun +p4 prog +restart DIRNAME • In-memory: automatic restart upon failure detection AMPI: Adaptive MPI

Converting MPI Codes to AMPI • AMPI needs its own initialization, before user-code • Fortran program entry-point: MPI_Main • program pgm  subroutine MPI_Main • ... ... • end program  end subroutine • C program entry-point is handled automatically, via mpi.h - include in same file as main() if absent • If the code has no global/static variables, this is all that is needed to convert! AMPI: Adaptive MPI

Thread 1 Thread 2 var = myid (1)‏ MPI_Recv() (block...) b = var var = myid (2)‏ MPI_Recv() (block...) Handling Global/Static Variables • Global and static variables are a problem in multi-threaded programs (similar problem in OpenMP): • Globals/statics have a single instance per process • They become shared by all threads in the process • Example: time If var is a global/static, incorrect value is read! AMPI: Adaptive MPI

Handling Global/Static Variables (cont.) • General Solution: Privatize variables in thread • Approaches: • Swap global variables • Source-to-source transformation via Photran • Use TLS scheme (in development) Specific approach to use must be decided on a case-by-case basis AMPI: Adaptive MPI

Handling Global/Static Variables (cont.) • First Approach: Swap global variables • Leverage ELF – Execut. & Linking Format (e.g. Linux) • ELF maintains a Global Offset Table (GOT) for globals • Switch GOT contents at thread context-switch • Implemented in AMPI via build flag –swapglobals • No source code changes needed • Works with any language (C, C++, Fortran, etc) • Does not handle static variables • Context-switch overhead grows with num.variables AMPI: Adaptive MPI

Handling Global/Static Variables (cont.) • Second Approach: Source-to-source transform • Move globals/statics to an object, then pass it around • Automatic solution for Fortran codes: Photran • Similar idea can be applied to C/C++ codes • Totally portable across systems/compilers • May improve locality and cache utilization • No extra overhead at context-switch • Requires new implementation for each language AMPI: Adaptive MPI

Handling Global/Static Variables (cont.) • Example of Transformation: C Program • Original Code: Transformed Code: AMPI: Adaptive MPI

Handling Global/Static Variables (cont.) • Example of Photran Transformation: Fortran Prog. • Original Code: Transformed Code: AMPI: Adaptive MPI

Handling Global/Static Variables (cont.) • Photran Transformation Tool • Eclipse-based IDE, implemented in Java • Incorporates automatic refactorings for Fortran codes • Operates on “pure” Fortran 90 programs • Code transformation infrastructure: • Construct rewriteable ASTs • ASTs are augmented with binding information • Source: Stas Negara & Ralph Johnson • http://www.eclipse.org/photran/ AMPI: Adaptive MPI

Handling Global/Static Variables (cont.) • Photran-AMPI • GUI: • Source: Stas Negara & Ralph Johnson • http://www.eclipse.org/photran/ AMPI: Adaptive MPI

NAS Benchmark AMPI: Adaptive MPI

FLASH Results AMPI: Adaptive MPI

Handling Global/Static Variables (cont.) • Third Approach: TLS scheme (Thread-Local-Store) • Originally employed in kernel threads • In C code, variables are annotated with __thread • Modified/adapted gfortran compiler available • Handles uniformly both globals and statics • No extra overhead at context-switch • Although popular, not yet a standard for compilers • Current Charm++ support only for x86 platforms AMPI: Adaptive MPI

Handling Global/Static Variables (cont.) • Summary of Current Privatization Schemes: • Program transformation is very portable • TLS scheme may become supported on Blue Waters, depending on work with IBM AMPI: Adaptive MPI

NAS Benchmark AMPI: Adaptive MPI

FLASH Results • FLASH is a parallel, multi-dimensional code used to study astrophysical fluids. • Many astrophysical environments are highly turbulent, and have structure on scales varying from large scale, like galaxy clusters, to small scale, like active galactic nuclei, in the same system. AMPI: Adaptive MPI

Object Migration AMPI: Adaptive MPI

Object Migration • How do we move work between processors? • Application-specific methods • E.g., move rows of sparse matrix, elements of FEM computation • Often very difficult for application • Application-independent methods • E.g., move entire virtual processor • Application’s problem decomposition doesn’t change AMPI: Adaptive MPI

How to Migrate a Virtual Processor? • Move all application state to new processor • Stack Data • Subroutine variables and calls • Managed by compiler • Heap Data • Allocated with malloc/free • Managed by user • Global Variables • Open files, environment variables, etc. (not handled yet!) AMPI: Adaptive MPI

Stack Data • The stack is used by the compiler to track function calls and provide temporary storage • Local Variables • Subroutine Parameters • C “alloca” storage • Most of the variables in a typical application are stack data AMPI: Adaptive MPI

Migrate Stack Data • Without compiler support, cannot change stack’s address • Because we can’t change stack’s interior pointers (return frame pointer, function arguments, etc.) • Solution: “isomalloc” addresses • Reserve address space on every processor for every thread stack • Use mmap to scatter stacks in virtual memory efficiently • Idea comes from PM2 AMPI: Adaptive MPI

Migrate Stack Data Processor A’s Memory Processor B’s Memory 0xFFFFFFFF 0xFFFFFFFF Thread 1 stack Thread 2 stack Migrate Thread 3 Thread 3 stack Thread 4 stack Heap Heap Globals Globals Code Code AMPI: Adaptive MPI 0x00000000 0x00000000

Migrate Stack Data • Isomalloc is a completely automatic solution • No changes needed in application or compilers • Just like a software shared-memory system, but with proactive paging • But has a few limitations • Depends on having large quantities of virtual address space (best on 64-bit) • 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine • Depends on unportable mmap • Which addresses are safe? (We must guess!) • What about Windows? Blue Gene? AMPI: Adaptive MPI

Heap Data • Heap data is any dynamically allocated data • C “malloc” and “free” • C++ “new” and “delete” • F90 “ALLOCATE” and “DEALLOCATE” • Arrays and linked data structures are almost always heap data AMPI: Adaptive MPI

Migrate Heap Data • Automatic solution: isomalloc all heap data just like stacks! • “-memory isomalloc” link option • Overrides malloc/free • No new application code needed • Same limitations as isomalloc • Manual solution: application moves its heap data • Need to be able to size message buffer, pack data into message, and unpack on other side • “pup” abstraction does all three AMPI: Adaptive MPI

Migrate Heap Data: PUP • Same idea as MPI derived types, but datatype description is code, not data • Basic contract: here is my data • Sizing: counts up data size • Packing: copies data into message • Unpacking: copies data back out • Same call works for network, memory, disk I/O ... • Register “pup routine” with runtime • F90/C Interface: subroutine calls • E.g., pup_int(p,&x); • C++ Interface: operator| overloading • E.g., p|x; AMPI: Adaptive MPI

Migrate Heap Data: PUP Builtins • Supported PUP Datatypes • Basic types (int, float, etc.) • Arrays of basic types • Unformatted bytes • Extra Support in C++ • Can overload user-defined types • Define your own operator| • Support for pointer-to-parent class • PUP::able interface • Supports STL vector, list, map, and string • “pup_stl.h” • Subclass your own PUP::er object AMPI: Adaptive MPI

Migrate Heap Data: PUP C++ Example #include “pup.h” #include “pup_stl.h” class myMesh { std::vector<float> nodes; std::vector<int> elts; public: ... void pup(PUP::er &p) { p|nodes; p|elts; } }; AMPI: Adaptive MPI

Migrate Heap Data: PUP C Example struct myMesh { int nn,ne; float *nodes; int *elts; }; void pupMesh(pup_er p,myMesh *mesh) { pup_int(p,&mesh->nn); pup_int(p,&mesh->ne); if(pup_isUnpacking(p)) { /* allocate data on arrival */ mesh->nodes=new float[mesh->nn]; mesh->elts=new int[mesh->ne]; } pup_floats(p,mesh->nodes,mesh->nn); pup_ints(p,mesh->elts,mesh->ne); if (pup_isDeleting(p)) { /* free data on departure */ deleteMesh(mesh); } } AMPI: Adaptive MPI

Migrate Heap Data: PUP F90 Example TYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: elts END TYPE SUBROUTINE pupMesh(p,mesh) USE MODULE ... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh); END SUBROUTINE AMPI: Adaptive MPI

Global Data • Global data is anything stored at a fixed place • C/C++ “extern” or “static” data • F77 “COMMON” blocks • F90 “MODULE” data • Problem if multiple objects/threads try to store different values in the same place (thread safety) • Compilers should make all of these per-thread; but they don’t! • Not a problem if everybody stores the same value (e.g., constants) AMPI: Adaptive MPI

Introduction to Adaptive MPI (AMPI) for Dynamic Parallel Applications

Introduction to Adaptive MPI (AMPI) for Dynamic Parallel Applications

Presentation Transcript

MPI

AMPI: Adaptive MPI Tutorial

Grid Computing With Charm And Adaptive MPI

AMPI: Adaptive MPI

AMPI and Charm++

Programming in AMPI

Adaptive MPI

MPI

MPI

MPI

MPI

Adaptive MPI

MPI

Adaptive MPI

Grid Computing With Charm++ And Adaptive MPI

MPI

AMPI: Adaptive MPI Tutorial

Adaptive MPI Tutorial