Unified Parallel C

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell

Outline • Global Address Space Languages in General • Programming models • Overview of Unified Parallel C (UPC) • Programmability advantage • Performance opportunity • Status • Next step • Related projects

P1 Programming Model 1: Shared Memory • Program is a collection of threads of control. • Many languages allow threads to be created dynamically, • Each thread has a set of private variables, e.g. local variables on the stack. • Collectively with a set of shared variables, e.g., static variables, shared common blocks, global heap. • Threads communicate implicitly by writing/reading shared variables. • Threads coordinate using synchronization operations on shared variables x = ... Shared y = ..x ... Private . . . P0 Pn

P1 Programming Model 2: Message Passing • Program consists of a collection of named processes. • Usually fixed at program startup time • Thread of control plus local address space -- NO shared data. • Logically shared data is partitioned over local processes. • Processes communicate by explicit send/receive pairs • Coordination is implicit in every communication event. • MPI is the most common example send P0,X recv Pn,Y Y X . . . P0 Pn

Tradeoffs Between the Models • Shared memory • Programming is easier • Can build large shared data structures • Machines don’t scale • SMPs typically < 16 processors (Sun, DEC, Intel, IBM) • Distributed shared memory < 128 (SGI) • Performance is hard to predict and control • Message passing • Machines easier to build from commodity parts • Can scale (given sufficient network) • Programming is harder • Distributed data structures only in the programmers mind • Tedious packing/unpacking of irregular data structures

Global Address Space Programming • Intermediate point between message passing and shared memory • Program consists of a collection of processes. • Fixed at program startup time, like MPI • Local and shared data, as in shared memory model • But, shared data is partitioned over local processes • Remote data stays remote on distributed memory machines • Processes communicate by reads/writes to shared variables • Examples are UPC, Titanium, CAF, Split-C • Note: These are not data-parallel languages • heroic compilers not required

GAS Languages on Clusters of SMPs • Cluster of SMPs (CLUMPs)hb • IBM SP: 16-way SMP nodes • Berkeley Millennium: 2-way and 4-way nodes • What is an appropriate programming model? • Use message passing throughout • Most common model • Unnecessary packing/unpacking overhead • Hybrid models • Write 2 parallel programs (MPI + OpenMP or Threads) • Global address space • Only adds test (on/off node) before local read/write

Support for GAS Languages • Unified Parallel C (UPC) • Funded by the NSA • Compaq compiler for Alpha/Quadrics • HP, Sun and Cray compilers under development • Gcc-based compiler for SGI (Intrepid) • Gcc-based compiler (SRC) for Cray T3E • MTU and Compaq effort for MPI-based compiler • LBNL compiler based on Open64 • Co-Array Fortran (CAF) • Cray compiler • Rice and UMN effort based on Open64 • SPMD Java (Titanium) • UCB compiler available for most machines

Parallelism Model in UPC • UPC uses an SPMD model of parallelism • A set if THREADS threads working independently • Two compilation models • THREADS may be fixed at compile time or • Dynamically set at program startup time • MYTHREAD specifies thread index (0..THREADS-1) • Basic synchronization mechanisms • Barriers (normal and split-phase), locks • What UPC does not do automatically: • Determine data layout • Load balance – move computations • Caching – move data • These are intentionally left to the programmer

lp: lp: lp: UPC Pointers • Pointers may point to shared or private variables • Same syntax for use, just add qualifier shared int *sp; int *lp; • sp is a pointer to an integer residing in the shared memory space. • sp is called a shared pointer (somewhat sloppy). x: 3 Shared sp: sp: sp: Global address space Private

Shared Arrays in UPC • Shared array elements are spread across the threads shared int x[THREADS] /*One element per thread */ shared int y[3][THREADS] /* 3 elements per thread */ shared int z[3*THREADS] /* 3 elements per thread, cyclic */ • In the pictures below • Assume THREADS = 4 • Elements with affinity to processor 0 are red Of course, this is really a 2D array x y blocked z cyclic

Overlapping Communication in UPC • Programs with fine-grained communication require overlap for performance • UPC compiler does this automatically for “relaxed” accesses. • Accesses may be designated as strict, relaxed, or unqualified (the default). • There are several ways of designating the ordering type. • A type qualifier, strict or relaxed can be used to affect all variables of that type. • Labels strict or relaxed can be used to control the accesses within a statement. strict : { x = y ; z = y+1; } • A strict or relaxed cast can be used to override the current label or type qualifier.

Performance of UPC • Reason why UPC may be slower than MPI • Shared array indexing is expensive • Small messages encouraged by model • Reasons why UPC may be faster than MPI • MPI encourages synchrony • Buffering required for many MPI calls • Remote read/write of a single word may require very little overhead • Cray t3e, Quadrics interconnect (next version) • Assuming overlapped communication, the real issues is overhead: how much time does it take to issue a remote read/write?

UPC vs. MPI: Sparse MatVec Multiply • Short term goal: • Evaluate language and compilers using small applications • Longer term, identify large application • Show advantage of t3e network model and UPC • Performance on Compaq machine worse: • Serial code • Communication performance • New compiler just released

UPC versus MPI for Edge detection b. Scalability a. Execution time • Performance from Cray T3E • Benchmark developed by El Ghazawi’s group at GWU

UPC versus MPI for Matrix Multiplication a. Execution time b. Scalability • Performance from Cray T3E • Benchmark developed by El Ghazawi’s group at GWU

Implementing UPC • UPC extensions to C are small • < 1 person-year to implement in existing compiler • Simplest approach • Reads and writes of shared pointers become small message puts/gets • UPC has “relaxed” keyword for nonblocking communication • Small message performance is key • Advanced optimizations include conversions to bulk communication by either • Application programmer • Compiler

Overview of NERSC Compiler • Compiler • Portable compiler infrastructure (UPC->C) • Explore optimizations: communication, shared pointers • Based on Open64: plan to release sources • Runtime systems for multiple compilers • Allow use by other languages (Titanium and CAF) • And in other UPC compilers, e.g., Intrepid • Performance of small message put/get are key • Designed to be easily ported, then tuned • Also designed for low overhead (macros, inline functions)

Compiler and Runtime Status • Basic parsing and type-checking complete • Generates code for small serial kernels • Still testing and debugging • Needs runtime for complete testing • UPC runtime layer • Initial implementation should be done this month • Based on processes (not threads) on GASNet • GASNet • Initial specification complete • Reference implementation done on MPI • Working on Quadrics and IBM (LAPI…)

Benchmarks for GAS Languages • EEL – End to end latency or time spent sending a short message between two processes. • BW – Large message network bandwidth • Parameters of the LogP Model • L – “Latency”or time spent on the network • During this time, processor can be doing other work • O – “Overhead” or processor busy time on the sending or receiving side. • During this time, processor cannot be doing other work • We distinguish between “send” and “recv” overhead • G – “gap” the rate at which messages can be pushed onto the network. • P – the number of processors

Non-overlapping overhead Send and recv overhead can overlap P0 osend L orecv P1 LogP Parameters: Overhead & Latency P0 osend orecv P1 EEL = osend + L + orecv EEL = f(osend, L, orecv)

Benchmarks • Designed to measure the network parameters • Also provide: gap as function of queue depth • Measured for “best case” in general • Implemented once in MPI • For portability and comparison to target specific layer • Implemented again in target specific communication layer: • LAPI • ELAN • GM • SHMEM • VIPL

Results: EEL and Overhead

Results: Gap and Overhead

Send Overhead Over Time • Overhead has not improved significantly; T3D was best • Lack of integration; lack of attention in software

Summary • Global address space languages offer alternative to MPI for large machines • Easier to use: shared data structures • Recover users left behind on shared memory? • Performance tuning still possible • Implementation • Small compiler effort given lightweight communication • Portable communication layer: GASNet • Difficulty with small message performance on IBM SP platform

Unified Parallel C