630 likes | 749 Vues
This presentation introduces Adaptive MPI (AMPI), a parallel programming framework designed to tackle the challenges posed by dynamically varying workload in modern parallel applications. We discuss the motivation behind AMPI, its unique features such as user-level migratable threads and virtualization of resources, and the benefits it offers, including better load balancing and improved cache utilization. The conversion process from standard MPI codes to AMPI is also explained, along with methods for handling global/static variables. Join us to explore how AMPI enhances load balancing and fault tolerance through advanced techniques.
E N D
AMPI: Adaptive MPI Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign
Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI
AMPI: Motivation • Challenges • New generation parallel applications are: • Dynamically varying: load shifting during execution • Adaptively refined • Composed of multi-physics modules • Typical MPI Implementations: • Not naturally suitable for dynamic applications • Available processor set may not match algorithm • Alternative: Adaptive MPI (AMPI) • MPI & Charm++ virtualization: VP (“Virtual Processors”) AMPI: Adaptive MPI
Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI
MPI “tasks” Implemented as user-level migratable threads ( VPs: virtual processors ) Real Processors AMPI: Overview • Virtualization: MPI ranks → Charm++ threads AMPI: Adaptive MPI
AMPI: Overview (cont.) • AMPI Execution Model: • Multiple user-level threads per process • Typically, one process per physical processor • Charm++ Scheduler coordinates execution • Threads (VPs) can migrate across processors • Virtualization ratio: R = #VP / #P (over-decomposition) Charm++ Scheduler P=1 , VP=4 AMPI: Adaptive MPI
AMPI: Overview (cont.) • AMPI’s Over-Decomposition in Practice • MPI: P=4, ranks=4 • AMPI: P=4, VP=ranks=16 AMPI: Adaptive MPI
Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI
Benefits of AMPI • Overlap between Computation/Communication • Automatically achieved • When one thread blocks for a message, another thread in the same processor can execute • Charm++ Scheduler picks next thread among those that are ready to run Charm++ Scheduler AMPI: Adaptive MPI
Benefits of AMPI (cont.) • Potentially Better Cache Utilization • Gains occur when subdomain is accessed repeatedly (e.g. by multiple functions, called in sequence) 12 might fit in cache, but 3 might not fit AMPI: Adaptive MPI
Benefits of AMPI (cont.) • Thread Migration for Load Balancing • Migration of thread 13: AMPI: Adaptive MPI
Benefits of AMPI (cont.) • Load Balancing in AMPI: MPI_Migrate() • Collective operation informing the load balancer that the threads can be migrated, if needed, for balancing load • Easy to insert in the code of iterative applications • Leverages Load-Balancing framework of Charm++ • Balancing decisions can be based on • Measured parameters: computation load, communication pattern • Application-provided information AMPI: Adaptive MPI
Benefits of AMPI (cont.) • Decoupling of Physical/Virtual Processors Problem setup: 3D stencil calculation of size 2403 run on Lemieux. AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs P=K3 AMPI: Adaptive MPI
Benefits of AMPI (cont.) • Asynchronous Implementation of Collectives • Collective operation is posted, returns immediately • Test/wait for its completion; meanwhile, do useful work e.g. MPI_Ialltoall( … , &req); /* other computation */ MPI_Wait(req); • Other operations available: MPI_Iallreduce, MPI_Iallgather • Example: 2D FFT benchmark time (ms) AMPI: Adaptive MPI
Motivation for Collective Communication Optimization Time breakdown of an all-to-all operation using Mesh library • Computation is only a small proportion of the elapsed time • A number of optimization techniques are developed to improve collective communication performance AMPI: Adaptive MPI
Benefits of AMPI (cont.) • Fault Tolerance via Checkpoint/Restart • State of application checkpointed to disk or memory • Capable of restarting on different number of physical processors! • Synchronous checkpoint, collective call: • In-disk: MPI_Checkpoint(DIRNAME) • In-memory: MPI_MemCheckpoint(void) • Restart: • In-disk: charmrun +p4 prog +restart DIRNAME • In-memory: automatic restart upon failure detection AMPI: Adaptive MPI
Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI
Converting MPI Codes to AMPI • AMPI needs its own initialization, before user-code • Fortran program entry-point: MPI_Main • program pgm subroutine MPI_Main • ... ... • end program end subroutine • C program entry-point is handled automatically, via mpi.h - include in same file as main() if absent • If the code has no global/static variables, this is all that is needed to convert! AMPI: Adaptive MPI
Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI
Thread 1 Thread 2 var = myid (1) MPI_Recv() (block...) b = var var = myid (2) MPI_Recv() (block...) Handling Global/Static Variables • Global and static variables are a problem in multi-threaded programs (similar problem in OpenMP): • Globals/statics have a single instance per process • They become shared by all threads in the process • Example: time If var is a global/static, incorrect value is read! AMPI: Adaptive MPI
Handling Global/Static Variables (cont.) • General Solution: Privatize variables in thread • Approaches: • Swap global variables • Source-to-source transformation via Photran • Use TLS scheme (in development) Specific approach to use must be decided on a case-by-case basis AMPI: Adaptive MPI
Handling Global/Static Variables (cont.) • First Approach: Swap global variables • Leverage ELF – Execut. & Linking Format (e.g. Linux) • ELF maintains a Global Offset Table (GOT) for globals • Switch GOT contents at thread context-switch • Implemented in AMPI via build flag –swapglobals • No source code changes needed • Works with any language (C, C++, Fortran, etc) • Does not handle static variables • Context-switch overhead grows with num.variables AMPI: Adaptive MPI
Handling Global/Static Variables (cont.) • Second Approach: Source-to-source transform • Move globals/statics to an object, then pass it around • Automatic solution for Fortran codes: Photran • Similar idea can be applied to C/C++ codes • Totally portable across systems/compilers • May improve locality and cache utilization • No extra overhead at context-switch • Requires new implementation for each language AMPI: Adaptive MPI
Handling Global/Static Variables (cont.) • Example of Transformation: C Program • Original Code: Transformed Code: AMPI: Adaptive MPI
Handling Global/Static Variables (cont.) • Example of Photran Transformation: Fortran Prog. • Original Code: Transformed Code: AMPI: Adaptive MPI
Handling Global/Static Variables (cont.) • Photran Transformation Tool • Eclipse-based IDE, implemented in Java • Incorporates automatic refactorings for Fortran codes • Operates on “pure” Fortran 90 programs • Code transformation infrastructure: • Construct rewriteable ASTs • ASTs are augmented with binding information • Source: Stas Negara & Ralph Johnson • http://www.eclipse.org/photran/ AMPI: Adaptive MPI
Handling Global/Static Variables (cont.) • Photran-AMPI • GUI: • Source: Stas Negara & Ralph Johnson • http://www.eclipse.org/photran/ AMPI: Adaptive MPI
NAS Benchmark AMPI: Adaptive MPI
FLASH Results AMPI: Adaptive MPI
Handling Global/Static Variables (cont.) • Third Approach: TLS scheme (Thread-Local-Store) • Originally employed in kernel threads • In C code, variables are annotated with __thread • Modified/adapted gfortran compiler available • Handles uniformly both globals and statics • No extra overhead at context-switch • Although popular, not yet a standard for compilers • Current Charm++ support only for x86 platforms AMPI: Adaptive MPI
Handling Global/Static Variables (cont.) • Summary of Current Privatization Schemes: • Program transformation is very portable • TLS scheme may become supported on Blue Waters, depending on work with IBM AMPI: Adaptive MPI
NAS Benchmark AMPI: Adaptive MPI
FLASH Results • FLASH is a parallel, multi-dimensional code used to study astrophysical fluids. • Many astrophysical environments are highly turbulent, and have structure on scales varying from large scale, like galaxy clusters, to small scale, like active galactic nuclei, in the same system. AMPI: Adaptive MPI
Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI
Object Migration AMPI: Adaptive MPI
Object Migration • How do we move work between processors? • Application-specific methods • E.g., move rows of sparse matrix, elements of FEM computation • Often very difficult for application • Application-independent methods • E.g., move entire virtual processor • Application’s problem decomposition doesn’t change AMPI: Adaptive MPI
How to Migrate a Virtual Processor? • Move all application state to new processor • Stack Data • Subroutine variables and calls • Managed by compiler • Heap Data • Allocated with malloc/free • Managed by user • Global Variables • Open files, environment variables, etc. (not handled yet!) AMPI: Adaptive MPI
Stack Data • The stack is used by the compiler to track function calls and provide temporary storage • Local Variables • Subroutine Parameters • C “alloca” storage • Most of the variables in a typical application are stack data AMPI: Adaptive MPI
Migrate Stack Data • Without compiler support, cannot change stack’s address • Because we can’t change stack’s interior pointers (return frame pointer, function arguments, etc.) • Solution: “isomalloc” addresses • Reserve address space on every processor for every thread stack • Use mmap to scatter stacks in virtual memory efficiently • Idea comes from PM2 AMPI: Adaptive MPI
Migrate Stack Data Processor A’s Memory Processor B’s Memory 0xFFFFFFFF 0xFFFFFFFF Thread 1 stack Thread 2 stack Migrate Thread 3 Thread 3 stack Thread 4 stack Heap Heap Globals Globals Code Code AMPI: Adaptive MPI 0x00000000 0x00000000
Migrate Stack Data Processor A’s Memory Processor B’s Memory 0xFFFFFFFF 0xFFFFFFFF Thread 1 stack Thread 2 stack Migrate Thread 3 Thread 3 stack Thread 4 stack Heap Heap Globals Globals Code Code AMPI: Adaptive MPI 0x00000000 0x00000000
Migrate Stack Data • Isomalloc is a completely automatic solution • No changes needed in application or compilers • Just like a software shared-memory system, but with proactive paging • But has a few limitations • Depends on having large quantities of virtual address space (best on 64-bit) • 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine • Depends on unportable mmap • Which addresses are safe? (We must guess!) • What about Windows? Blue Gene? AMPI: Adaptive MPI
Heap Data • Heap data is any dynamically allocated data • C “malloc” and “free” • C++ “new” and “delete” • F90 “ALLOCATE” and “DEALLOCATE” • Arrays and linked data structures are almost always heap data AMPI: Adaptive MPI
Migrate Heap Data • Automatic solution: isomalloc all heap data just like stacks! • “-memory isomalloc” link option • Overrides malloc/free • No new application code needed • Same limitations as isomalloc • Manual solution: application moves its heap data • Need to be able to size message buffer, pack data into message, and unpack on other side • “pup” abstraction does all three AMPI: Adaptive MPI
Migrate Heap Data: PUP • Same idea as MPI derived types, but datatype description is code, not data • Basic contract: here is my data • Sizing: counts up data size • Packing: copies data into message • Unpacking: copies data back out • Same call works for network, memory, disk I/O ... • Register “pup routine” with runtime • F90/C Interface: subroutine calls • E.g., pup_int(p,&x); • C++ Interface: operator| overloading • E.g., p|x; AMPI: Adaptive MPI
Migrate Heap Data: PUP Builtins • Supported PUP Datatypes • Basic types (int, float, etc.) • Arrays of basic types • Unformatted bytes • Extra Support in C++ • Can overload user-defined types • Define your own operator| • Support for pointer-to-parent class • PUP::able interface • Supports STL vector, list, map, and string • “pup_stl.h” • Subclass your own PUP::er object AMPI: Adaptive MPI
Migrate Heap Data: PUP C++ Example #include “pup.h” #include “pup_stl.h” class myMesh { std::vector<float> nodes; std::vector<int> elts; public: ... void pup(PUP::er &p) { p|nodes; p|elts; } }; AMPI: Adaptive MPI
Migrate Heap Data: PUP C Example struct myMesh { int nn,ne; float *nodes; int *elts; }; void pupMesh(pup_er p,myMesh *mesh) { pup_int(p,&mesh->nn); pup_int(p,&mesh->ne); if(pup_isUnpacking(p)) { /* allocate data on arrival */ mesh->nodes=new float[mesh->nn]; mesh->elts=new int[mesh->ne]; } pup_floats(p,mesh->nodes,mesh->nn); pup_ints(p,mesh->elts,mesh->ne); if (pup_isDeleting(p)) { /* free data on departure */ deleteMesh(mesh); } } AMPI: Adaptive MPI
Migrate Heap Data: PUP F90 Example TYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: elts END TYPE SUBROUTINE pupMesh(p,mesh) USE MODULE ... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh); END SUBROUTINE AMPI: Adaptive MPI
Global Data • Global data is anything stored at a fixed place • C/C++ “extern” or “static” data • F77 “COMMON” blocks • F90 “MODULE” data • Problem if multiple objects/threads try to store different values in the same place (thread safety) • Compilers should make all of these per-thread; but they don’t! • Not a problem if everybody stores the same value (e.g., constants) AMPI: Adaptive MPI