Parallel Programming using the PGAS Approach

Parallel Programming using thePGAS Approach

Outline • Introduction • Programming parallel systems: threading, message passing • PGAS as a middle ground • UPC (Unified Parallel C) • History of UPC • Shared scalars and arrays • Work-sharing in UPC: parallel loop • DASH – PGAS in the form of a C++ template library • A quick overview of the project • Conclusion

… Memory System Mem Mem Mem … Process/thread Physical memory Private data Shared data Memory Access (read/write) Explicit Message Programming Parallel Machines The two most widely used approaches for parallel programming: Shared Memory Programming using Threads Message Passing

… Memory System Process/thread Physical memory Private data Shared data Memory Access (read/write) Explicit Message Shared Memory Programming using Threads • Examples: • OpenMP, Pthreads, C++ threads, Java threads • Limited to shared memory systems • Shared data can be directly accessed • Implicit communication, direct reads, writes • Advantages • Typically easier to program, natural extension of sequential programming • Disadvantages • Subtle bugs, race conditions • False sharing as a performance problem

Mem Mem Mem … Process/thread Physical memory Private data Shared data Memory Access (read/write) Explicit Message Message Passing • Example • MPI (message passing interface) • Disadvantages • Complex programming paradigm • Manual data partitioning required • Explicit coordination of communication (send/receive pairs) • Data replication (memory requirement) • Advantages • Highly efficient and scalable (to the largest machines in use today) • Data locality is “automatic” • No false sharing, no race conditions • Runs everywhere

Process/thread PGAS Layer Physical memory Private data Shared data Memory Access (put/get) Explicit Message Partitioned Global Address Space • Best of both worlds • Can be used on large scale distributed memory machines but also on shared memory machines • A PGAS program looks much like a regular threaded program, but • Sharing data is declared explicitly • The data partitioning is made explicit • Both needed for performance! shared data space is partitioned!

Thread 0 Thread 1 Thread n-1 ours Shared mine mine mine … Global Address Space Private Partitioned Global Address Space • Example • Let’s call the members of our program threads • Let’s assume we use the SPMD (single program multiple data) paradigm • Let’s assume we have a new keyword “shared” that puts variables in the shared global address space This is how PGAS is ex- pressed in UPC (Unified Parallel C)! (more later) shared int ours; int mine; • n copies of mine (one per thread) • Each thread can only access its own copy • 1 copy of ours • Accessible by every thread

Thread 0 Thread 1 Thread 2 Thread 3 ours[0] ours[1] ours[2] ours[3] mine mine mine mine Shared Arrays • Example: a shared array (UPC) shared int[4] ours; int mine; Shared Global Address Space Private • Affinity – in which partition a data item “lives” • ours (previous slide) lives in partition 0 (by convention) • ours[i] lives in partition i

Local-view vs. Global-view • Two ways to organize access to shared data: • Global-viewE.g., Unified Parallel C • Local-viewE.g., Co-array Fortran • X is declared in terms of its global size • X is accessed in terms of global indices • process is not specified explicitly shared int X[100]; X[i]=23; Global size, Global index • a, b are declared in terms of their local size • a,b are accessed in terms of local indices • process (image) is specified explicitly (the co-index) integer :: a(100)[*], b(100)[*] b(17) = a(17)[2] Local size, Local index co-dimension / co-index in square brackets

UPC History and Status • UPC is an extension to ANSI C • New keywords, library functions • Developed in the late 1990s and early 2000s • Based on previous projects at UCB, IDA, LLNL, … • Status • Berkeley UPC • GCC version • Vendor compilers (Cray, IBM, …) • Most often used on graph problems, irregular parallelism

UPC Execution Model • A number of threads working independently in a SPMD fashion • Number of threads specified at compile-time or run-time; available as program variable THREADS • Note: “thread” is the UPC terminology. UPC threads are most often implemented as a full OS processes • MYTHREAD specifies thread index (0...THREADS-1) • upc_barrier is a global synchronization: all wait before continuing • There is a form of parallel loop (later) • There are two compilation modes • Static threads mode: • THREADS is specified at compile time by the user • The program may use THREADS as a compile-time constant • Dynamic threads mode: • Compiled code may be run with varying numbers of thread

Hello World in UPC • Any legal C program is also a legal UPC program • SPMD Model • If you compile and run it as UPC with N threads, it will run N copies of the program (same model as MPI) #include <upc.h> /* needed for UPC extensions */ #include <stdio.h> main() { printf("Thread %d of %d: hello UPC world\n", MYTHREAD, THREADS); } Thread 0 of 4: hello UPC world Thread 1 of 4: hello UPC world Thread 3 of 4: hello UPC world Thread 2 of 4: hello UPC world

r =1 A Bigger Example in UPC: Estimate p • Estimate Pi by throwing darts at a unit square • Calculate percentage that fall in the unit circle • Area of square = r2 = 1 • Area of circle quadrant = ¼ p r2 = p/4 • Randomly throw darts at (x,y) positions • If x2 + y2 < 1, then point is inside circle • Compute ratio R: • R = # points inside / # points total • p ≈ 4 R

Pi in UPC, First version #include <stdio.h> #include <math.h> #include <upc.h> main(int argc, char *argv[]) { int i, hits, trials = 0; double pi; if (argc != 2) trials = 1000000; else trials = atoi(argv[1]); srand(MYTHREAD*17); for (i=0; i < trials; i++) hits += hit(); pi = 4.0*hits/trials; printf("PI estimated to %f.", pi); } Each thread gets its own copy of these variables Each thread can use the input arguments Initialize RNG in math library hit() : get random numbers and return 1 if inside circle This program computes N independent estimates of Pi (when run with N threads)

Pi in UPC, Shared Memory Style • Problem with this program: race condition!Reading/writing to hits is not synchronized shared variable to record hits shared int hits=0; main(int argc, char **argv) { int i, my_trials = 0; int trials = atoi(argv[1]); my_trials = (trials + THREADS-1)/THREADS; srand(MYTHREAD*17); for (i=0; i < my_trials; i++) hits += hit(); upc_barrier(); if (MYTHREAD == 0) { printf("PI estimated to %f.", 4.0*hits/trials); } } divide up work evenly accumulate hits There is a problem with this program…

Fixing the Race Condition • A possible fix for the race condition • Have a separate counter per thread (use a shared array) • One thread computes the total sum int hits=0; shared int all_hits[THREADS]; main(int argc, char **argv) { // declarations and initialization code omitted for (i=0; i < my_trials; i++) all_hits[MYTHREAD] += hit(); upc_barrier(); if (MYTHREAD == 0) { for (i=0; i < THREADS; i++) hits += all_hits[i]; printf("PI estimated to %f.", 4.0*hits/trials); } } Shared array: 1 element per thread Each thread accesses its local element, no race condition, no remote communication Thread 0 computes overall sum

Other UPC Features • Locks • upc_lock_t: pairwise synchronization between threads • Can also be used to fix race condition in previous example • Customizable layout of one and multi-dimensional arrays • Blocked, cyclic, block-cyclic; cyclic is the default • Split-phase barrier • upc_notify() and upc_wait() instead of upc_barrier() • Shared pointer and pointer to shared • Work-sharing (parallel loop)

v1 … 0 1 2 3 0 1 2 3 0 v2 … 0 1 2 3 0 1 2 3 0 sum … 0 1 2 3 0 1 2 3 0 Worksharing: Vector Addition Example • Each thread iterates over the indices that it “owns” • This is a common idiom called “owner computes” • UPC supports this idiom directly with a parallel version of the for loop: upc_forall #include <upc_relaxed.h> #define N 100*THREADS sharedint v1[N], v2[N], sum[N]; void main() { int i; for(i=0; i<N; i++) { if(MYTHREAD == i%THREADS) { sum[i]=v1[i]+v2[i]; } } } Default layout: cyclic (round robin) Access local elements only: sum[i] has affinity to thread i

UPC work-sharing with forall • upc_forall • init, test, loop: same as regular C for loop:defines loop start, increment, and end • affinity: defines which iterations a thread is responsible for • Syntactic sugar for loop on previous slide: • Loop over all • Work on those with affinity to this thread • Programmer guarantees that the iterations are independent • Undefined if there are dependencies across threads • Affinity expression: two options • Integer: affinity%THREADS is MYTHREAD • Pointer: upc_threadof(affinity) is MYTHREAD upc_forall(init; test; loop; affinity) statement;

Vector Addition with upc_forall • The vector addition example can be rewritten as follows • Equivalent code could use „&sum[i]“ for the affinity test • The code would be correct (but slow) if the affinity expression is i+1 rather than i. #define N 100*THREADS shared int v1[N], v2[N], sum[N]; void main() { int i; upc_forall(i=0; i<N; i++; i) sum[i]=v1[i]+v2[i]; } Affinity expression

UPC Summary • UPC is an extention to C, implementing the PGAS model • Available as a gcc version, Berkeley UPC, from some vendors • Today most often used for graph problems, irregular parallelism • PGAS is a concept realized in UPC and other languages • Co-array Fortran, Titanium • Chapel, X10, Fortress (HPCS languages) • Not covered • Collective operations (reductions, etc. similar to MPI) • Dynamic memory allocation in shared space • UPC shared pointers

DASH array Node Node e.g., STL vector, array DASH – Overview • DASH – PGAS in the form of a C++ Template library • Focus on data structures • Array a can be stored in the memory of several nodes • a[i] transparently refers to local memory or to remote memory via operator overloading dash::array<int> a(1000); a[23]=412; std::cout<<a[42]<<std::endl; • Not a new language to learn • Can be integrated with existing (MPI) applications • Support for hierarchical locality • Team hierarchies and locality iterators

Hierarchical Machines • Machines are getting increasingly hierarchical • Both within nodes and between nodes • Data locality is the most crucial factor for performance and energy efficiency Source: LRZ SuperMUC system description. Source: Bhatele et al.: Avoiding hot-spots in two-level direct networks. SC 2011. Source: Steve Keckler et al.: Echelon System Sketch Hierarchical locality not well supported by current approaches. PGAS languages usually only offer a two-level differentiation (local vs. remote).

Tools and Interfaces DASH Application DASH C++ Template Library DASH Runtime Hardware: Network, Processor, Memory, Storage One-sided Communication Substrate MPI GASnet ARMCI GASPI DASH – Overview and Project Partners • Project Partners • LMU Munich (K. Fürlinger) • HLRS Stuttgart (J. Gracia) • TU Dresden (A. Knüpfer) • KIT Karlsruhe (J. Tao) • CEODE Beijing (L. Wang, associated) Component of DASH Existing component/ Software

Tools and Interfaces DASH Application DASH C++ Template Library DART API DASH Runtime (DART) Hardware: Network, Processor, Memory, Storage One-sided Communication Substrate MPI GASnet ARMCI GASPI DART: The DASH Runtime Interface • The DART API • Plain-C based interface • Follows the SPMD execution model • Defines Units and Teams • Defines a global memory abstraction • Provides a global pointer • Defines one-sided access operations (puts and gets) • Provides collective and pair-wise synchronization mechanisms

Units and Teams • Unit: individual participants in a DASH/DART program • Unit ≈ process (MPI) ≈ thread (UPC) ≈ image (CAF) • Execution model follows the classical SPMD (single program multiple data) paradigm • Each unit has a global ID that remains unchanged during the execution • Team: • Ordered subset of units • Identified by an integer ID • DART_TEAM_ALL represents all units in a program • Units that are members of a team have a local ID with respect to that team

Communication • Communication: One-sided puts and gets • Blocking and non-blocking versions Performance of blocking puts and gets closely matches MPI performance

DASH (C++ Template Library) • 1D array as the basic data type • DASH follows a global-view approach, but local-view programming is supported too • Standard algorithms can be used but may not yield best performance • lbegin(), lend() allow iteration over local elements

Data Distribution Patterns • A Pattern controls the mapping of an index space onto units • A team can be specified (the default team is used otherwise) • No datatype is specified for a pattern • Patterns guarantee a similar mapping for different containers • Patterns can be used to specify parallel execution

Accessing and Working with Data in DASH (1) • GlobalPtr<T> abstraction that serves as the global iterator • GlobalRef<T> abstraction “reference to element in global memory” that is returned by subscript and iterator dereferencing

Accessing and Working with Data in DASH (2) • Range based for works on the global object per default • Proxy object can be used instead to access the local part of the data

Summary • Parallel programming is difficult • PGAS is an interesting middle ground between message passing and shared memory programming with threads • Inherits the advantages of both but also shares some of the disadvantages – specifically race conditions • PGAS • Today mostly used when working with applications with irregular parallelism - random data accesses • UPC is the most widely used PGAS approach today • Co-array Fortran and other new PGAS languages • DASH and other C++ libraries Thank you for your attention!

Parallel Programming using the PGAS Approach

Parallel Programming using the PGAS Approach

Presentation Transcript

Parallel Matlab programming using Distributed Arrays

Parallel Programming

PARALLEL programming

Programming Parallel Hardware using MPJ Express

PGAS Programming: The ARMCI Approach

Parallel Programming Using the Global Arrays Toolkit

PGAS: Principle, Programming and Performance

Parallel Programming

Parallel Programming

Parallel Programming using MPI

Parallel Programming

Productive Parallel Programming in PGAS

Parallel Programming using the Iteration Space Visualizer

Parallel Computing/Programming using MPI

Parallel Programming

Introductions to Parallel Programming Using OpenMP

Introductions to Parallel Programming Using OpenMP

Parallel Matlab programming using Distributed Arrays