The Berkeley UPC Compiler: Implementation and Performance

The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Kathy Yelick the LBNL/Berkeley UPC Group http://upc.lbl.gov

Outline • An Overview of UPC • Design and Implementation of the Berkeley UPC Compiler • Preliminary Performance Results • Communication Optimizations

Unified Parallel C (UPC) • UPC is an explicitly parallel global address space language with SPMD parallelism • An extension of C • Shared memory is partitioned by threads • One-sided (bulk and fine-grained) communication through reads/writes of shared variables • Collective Efforts by industry, academia, and government • http://upc.gwu.edu Shared X[0] X[1] X[P] Global addressspace Private

UPC Programming Model Features • Block cyclically distributed arrays • Shared and private pointers • Global synchronization -- barriers • Pair-wise synchronization – locks • Parallel loops • dynamic shared memory allocation • Bulk Shared Memory accesses • Strict vs. Relaxed memory consistency models

Design and Implementation of the Berkeley UPC Compiler

Overview of the Berkeley UPC Compiler TwoGoals: Portability and High-Performance Lower UPC code into ANSI-C code UPC Code Translator Shared Memory Management and pointer operations Platform- independent Translator Generated C Code Network- independent Berkeley UPC Runtime System Compiler- independent GASNet Communication System Language- independent Network Hardware Uniform get/put interface for underlying networks

A Layered Design • Portable: • C is our intermediate language • Can run on top of MPI (with performance penalty) • GASNet has a layered design with a small core • High-Performance: • Native C compiler optimizes serial code • Translator can perform high-level communication optimizations • GASNet can access network hardware directly

Implementing the UPC to C Translator Preprocessed File • Based on the Open64 compiler • Source to source transformation • Convert shared memory • operations into runtime library • calls • Designed to incorporate existing • optimization framework in open64 • Communicate with runtime via a • standard API C front end Whirl w/ shared types Backend lowering Whirl w/ runtime calls Whirl2c ANSI-compliant C Code

Shared Arrays and Pointers in UPC • Cyclic shared int A[n]; • Block Cyclic shared [2] int B[n]; • Indefinite shared [0] int * C = (shared [0] int *) upc_alloc(n); • Use global pointers (pointer-to-shared) to access shared (possibly remote) data • Block size part of the pointer type • A generic pointer-to-shared contains: • Address, Thread id, Phase A[0] A[2] A[4] … B[0] B[1] B[4] B[5]… C[0] C[1] C[2] … A[1] A[3] A[5] … B[2] B[3] B[6] B[7]… T1 T0

Address Thread Phase Accessing Shared Memory in UPC start of array object start of block Shared Memory … block size Phase … Thread 0 Thread 1 Thread N -1 0 addr 2

Phaseless Pointer-to-Shared • A pointer needs a “phase” to keep track of where it is in a block • Source of overhead for pointer arithmetic • Special case for “phaseless” pointers: Cyclic + Indefinite • Cyclic pointers always have phase 0 • Indefinite pointers only have one block • Don’t need to keep phase in pointer operations for cyclic and indefinite • Don’t need to update thread id for indefinite pointer arithmetic

Pointer-to-Shared Representation • Pointer Size • Want to allow pointers to reside in a register • But very large machines may require a longer representation • Datatype • Use of scalar types (long) rather than a struct may improve backend code quality • Faster pointer manipulation, e.g., ptr+int as well as dereferencing • Portability and performance balance in UPC compiler • 8-byte scalar vs. struct format (configuration time option) • The pointer representation is hidden in the runtime layer • Modular design means easy to add new representations

Performance Results

Performance Evaluation • Testbed • HP AlphaServer (1GHz processor), with Quadrics interconnect • Compaq C compiler for the translated C code • Compare with HP UPC 2.1 • Cost of Language Features • Shared pointer arithmetic, shared memory accesses, parallel loops, etc • Application Benchmarks • EP: no communication • IS: large bulk memory operations - MG: bulk memget • CG: fine-grained vs. bulk memput • Potentials of Optimizations • Measure the effectiveness of various communication optimizations

Performance of Shared Pointer Arithmetic 1 cycle = 1ns Struct: 16 bytes • Phaseless pointer an important optimization • Packed representation also helps.

Cost of Shared Memory Access • Local accesses somewhat slower than private accesses • Layered design does not add additional overhead • Remote accesses a few magnitude worse than local

Parallel Loops in UPC • UPC has a “forall” construct for distributing computation shared int v1[N], v2[N], v3[N]; upc_forall(i=0; i < N; i++; &v3[i]) v3[i] = v2[i] + v1[i]; • Affinity tests performed on every iteration to decide if it should execute • Two kinds of affinity expressions: • Integer (compare with thread id) • Shared address (check the affinity of address)

Application Performance

NAS Benchmarks (EP and IS) • EP shows the backend C compiler can still successfully optimize translated C code • IS shows Berkeley UPC compiler is effective for communication operations

NAS Benchmarks (CG and MG) • Berkeley UPC compiler scales well

Performance of Fine-grained Applications • Doesn’t scale well due to nature of the benchmark (lots of small reads) • HP UPC’s software caching helps performance

Observations on the Results • Acceptable worst-case overhead for shared memory access latencies • < 10 cycle overhead for shared local accesses • ~ 1.5 usec overhead compared to end-to-end network latency • Optimizations on pointer-to-shared representation are effective • Both phaseless pointers and packed 8-byte format • Good performance compared to HP UPC 2.1

Communication Optimizations for UPC

Communication Optimizations • Hiding Communication Latencies • Use of non-blocking operations • Possible placement analysis, by separating get(), put() as far as possible from sync() • Message pipelining to overlap communication with more communication • Optimizing Shared Memory Accesses • Eliminating locality test for local shared pointers: flow- and context-sensitive analysis • Transforming forall loops into equivalent for loops • Eliminate redundant pointer arithmetic for pointers with same thread and phase

More Optimizations • Message Vectorization and Aggregation • Scatter/gather techniques • Packing generally pays off for small (< 500 byte) messages • Software Caching and Prefetching • A prototype implementation: • Local knowledge only – no coherence messages • Cache remote reads and buffer outgoing writes • Based on the weak ordering consistency model

Example: Optimizing Loops

Experimental Results • Computation/communication overlap better than communication/communication overlap, for Quadrics • Results likely different for other networks

Example: Optimizing Local Shared Accesses

Experimental Results • Neither compiler performs well for naïve version • Culprit: pointer-to-shared operations + affinity tests • Privatizing local shared accesses improves performance by an order of magnitude

Compiler Status • A fully UPC 1.1 compliant public release in April • Supported Platforms: • HP AlphaServer, IBM SP, Linux x86/Itanium, SGI Origin2000, Solaris Sparc/x86, Mac OSX PowerPC • Supported Networks: • Quadrics/Elan, Myrinet/GM, IBM/LAPI, and MPI • A release this summer will include: • Pthreads/System V shared memory support • GASNet Infiniband support

Conclusion • The Berkeley UPC Compiler achieves both portability and good performance • Layered, modular design • Effective pointer-to-shared optimizations • Good performance compared to commercially available UPC compiler • Still lots of opportunities for communication optimizations • Available for download at: http://upc.lbl.gov

The Berkeley UPC Compiler: Implementation and Performance