PGAS: Principle, Programming and Performance

PGAS: Principle, Programming and Performance EPCC University of Edinburgh

Outline • Petascale  Exascale: Programming just got harder • Why? What can be done • PGAS Principle and Programming • PGAS Performance • CAF benchmarks • comparing traditional MPI with CAF • Distributed hash table benchmark • comparing One-sided MPI with UPC and SHMEM implementations • Conclusions WPSE 2012 RIKEN, Kobe, Japan

The Future ain’t what it used to be Past performance is not a guide to future performance. The value of the investment and the income deriving from it can go down as well as up and can't be guaranteed. You may get back less than you invested. Moore’s Law WPSE 2012 RIKEN, Kobe, Japan

Moore’s Law Herb Sutter “The free lunch is over!” Clock speed fixed due to power constraints More cores per chip, not faster WPSE 2012 RIKEN, Kobe, Japan

The rise of many core chips 1000 GPUs many-core 100 processor cores per silicon chip 10 multi-core hyperthreading 1 circa 2003 WPSE 2012 RIKEN, Kobe, Japan

Memory Heterogeneity I: NUMA AMD Magny-Cours (24 cores) WPSE 2012 RIKEN, Kobe, Japan

Memory Hetrogeneity II: Accelerators • GPUs – disruptive technology • Complicated logic of CPUs (caches, controllers etc) traded for computational power • Inherently very parallel • Explicit memory hierarchy within GPU • Explicit memory transfer between host and device WPSE 2012 RIKEN, Kobe, Japan

Make up of a supercomputer • Emphasis over last 15 years on commoditisation • HPC machines assembled from PC-like nodes • Distinct memory spaces • network (interconnect) differentiates clusters from supercomputers • Ethernet  Infiniband  Gemini (Cray) • Each node runs UNIX-like OS • explicit inter-node data transfers  message passing • majority of programs written in C or FORTRAN • MPI ubiquitous for distributed data management • Some threaded programming • mostly OpenMP for shared-memory systems WPSE 2012 RIKEN, Kobe, Japan

Three challenges of exascale • Power • Scaling today’s nodes to 230 cores  power station required for exascale machine • Future nodes must be low power • Fault tolerance • Scaling MTTI on today’s nodes to 230 cores  uptime of minutes for exascale machine • Future nodes must be fault tolerant • Programming • Will current MPI approach scale? • Memory per core will decrease dramatically • LMA/RMA power cost • Hetrogeneous memory landscape • NUMA / acceletators HDW problem HDW problem? SFW problem WPSE 2012 RIKEN, Kobe, Japan

Scaling MPI • Traditional MPI parallelism is data parallelism • Communication pattern is Halo Swap • Even weak scaling to 230 cores stress MPI • Speaker is not aware of many apps that do • New Algorithms required! • Halo Swap requires memory redundancy • As cores per node increases • memory per core decreases • strong scaling Halo size relative to interior per core increases WPSE 2012 RIKEN, Kobe, Japan

Scaling MPI II • Large numbers of MPI tasks creates other problems • MPI has lots of O(Ntask) state data • significant as Ntask  • Memory is already squeezed • Dominant communication pattern is linked send and receive • load balance and synchronisation issues. image – computing.llnl.gov WPSE 2012 RIKEN, Kobe, Japan

Hybrid Programming • Mixed-mode • message passing with MPI inter-node • threading with OpenMP intra-node • Relatively straightforward to update codes • Multi-core architecture is not like shared-memory machines • NUMA means naively threading across all cores is not usually effective • Program for data layout and thread/NUMA region affinity • Limited facility in OpenMP • Code base must maintain pure MPI and hybrid versions of code WPSE 2012 RIKEN, Kobe, Japan

PGAS Principle and Programming

Partition Global Address Space (PGAS) • Distributed memory is globally addressable - GAS • Partitioned into shared and local – P • Direct read/write access to remote memory • Asynchronous put/get • remote node is not interrupted whilst memory access occurs • no explicit buffering • May map directly onto hardware / Direct compiler support • Cray XE6 hdw and Cray compiler • Language extensions • Unified Parallel C (UPC) • Co-Array Fortran (CAF) • Libraries • SHMEM (Cray SHMEM) • One-sided MPI How Super is PGAS? WPSE 2012 RIKEN, Kobe, Japan

Distributed memory machine process memory memory memory cpu cpu cpu

MPI • Mature, widely used and very portable • Implemented as calls to an external library • Linked send and receive messages (both known to the other) • Collective calls, Broadcasts, gather/scatter, reductions, synchronisation • One sided MPI calls in MPI 2 standard • Remote memory access (RMA) • puts, gets and synchronisation • Not widely used WPSE 2012 RIKEN, Kobe, Japan

MPI process0 process1 memory memory cpu cpu MPI_Send(a,...,1,…) MPI_Recv(a,...,0,…)

WPSE 2012 RIKEN, Kobe, Japan CAF: extension to Fortran • SPMD execution model on multiple images • Declare data to be remotely accessible • Access remote data directly • Explicit synchronisation: global and point-to-point • Supports multiple dimensions and codimensions real, dimension(10), codimension[*] :: x real x(10)[*] x(2) = x(3)[7] ! remote read from image 7 x(6)[4] = x(1) ! remote write to image 4 sync all; sync images(imagelist)

Fortran coarrays image image image memory memory memory cpu cpu cpu

UPC • Parallel extension to ISO C 99 • multiple threads with shared and private memory • Direct addressing of shared memory • Synchronisation blocking and non-blocking • work sharing • private and shared pointers to private and shared objects • Data distribution of shared arrays • Cray compiler on XE6 supports UPC • Portable Berkely UPC compiler • built with GCC upc_barrier; upc_forall(exp;exp;exp;affinity); WPSE 2012 RIKEN, Kobe, Japan

UPC thread thread thread memory memory memory cpu cpu cpu

WPSE 2012 RIKEN, Kobe, Japan SHMEM • Symmetric variables for RMA • Same size, type and relative address on all PEs • Non-stack variables, global or local static • Dynamic allocation with shmalloc() • shmem_put() and shmem_get routines for RMA • Cray SHMEM • OpenSHMEM • Synchronisation • shmem_barrier_all(); SHMEM Symmetric variables for RMA Same size, type and relative address on all PEs Non-stack variables, globalor local static Dynamic allocation with shmalloc() shmem_put() and shmem_get routines for RMA Synchronisation by an (user specified size) array of locks with shmem_lock() SHMEM Symmetric variables for RMA Same size, type and relative address on all PEs Non-stack variables, globalor local static Dynamic allocation with shmalloc() shmem_put() and shmem_get routines for RMA Synchronisation by an (user specified size) array of locks with shmem_lock() SHMEM Symmetric variables for RMA Same size, type and relative address on all PEs Non-stack variables, globalor local static Dynamic allocation with shmalloc() shmem_put() and shmem_get routines for RMA Synchronisation by an (user specified size) array of locks with shmem_lock()

PGAS Performance CAF benchmark Dr David Henty Presented at PARCO 2011

Hardware HECToR UK national supercomputer Cray machine in various stages Phase 2 XT6 AMD 24 core Magny-Cours and SeaStar2+ interconnect Phase 2 XE6 AMD 24 core Magny-Cours and Gemini interconnect Phase 3 XE6 AMD 32 core Interlagos and Gemini interconnect WPSE 2012 RIKEN, Kobe, Japan

Initial benchmark • Single contiguous point-to-point read and write • Multiple contiguous point-to-point read and write • Strided point-to-point read and write • All basic synchronisation operations • Various representative synchronisation patterns • Halo-swapping in a multi-dimensional regular domain decomposition • All communications include synchronisation cost • use double precision (8-byte) values as the basic type • use Fortran array syntax • Synchronisations overhead measured separately as overhead = (delay + sync) – delay WPSE 2012 RIKEN, Kobe, Japan

Platforms • Limited compiler support at present • Cray Compiler Environment CCE 7.4.1 • Cray X2 • parallel vector system whose fat-tree interconnect natively supports Remote Memory Access (RMA) • Cray XT6 • MPP with dual 12-core AMD Opteron Magny Cours nodes and Cray’s Seastar2+ torus interconnect • coarrays implementated over GASNET • Cray XE6 • same nodes as XT6 but newer Cray GEMINI torus interconnect which natively supports RMA WPSE 2012 RIKEN, Kobe, Japan

Pingpong WPSE 2012 RIKEN, Kobe, Japan

Global synchronisation WPSE 2012 RIKEN, Kobe, Japan

Strided pingpong WPSE 2012 RIKEN, Kobe, Japan

Point-to-point synchronisation patterns • Should be advantageous for many images • images must mutually synchronise sync images(imagelist) • user must ensure they all match up • LX • each image pairs with X images in a (periodic) line • e.g. L4 pairs with image-2, image-1, image+1 and image+2 • RX • same as LX but images are randomly permuted • ensures many off-node messages • 3D • six neighbours in a 3D cartesian grid WPSE 2012 RIKEN, Kobe, Japan

Point-to-point perfomance WPSE 2012 RIKEN, Kobe, Japan

3D halo swap • Fixed cubic volume, weak scaling • 3D domain decomposition as for MPI_Cart_dims • Measure time for complete halo swap operation • remote read (get) and write (put) • sync images and sync all WPSE 2012 RIKEN, Kobe, Japan

3D Halo swap results: XE6 WPSE 2012 RIKEN, Kobe, Japan

Conclusions • Simple benchmarks give good insight into performance characteristics of Fortran coarrays • Cray compiler quite mature • good ability to pattern match (i.e. “vectorises” remote operations) • performance over GASNET (on XT6) has poor latency • performance generally good on XE6 • significant advantages from point-to-point synchronisation • explicit buffering would be faster than direct strided access WPSE 2012 RIKEN, Kobe, Japan

Acknowledgements • People • David Henty and Mark Bull (EPCC), • Harvey Richardson (Cray Inc) for help and advice • Funding • Edinburgh University Exascale Technology Centre • EU-funded PRACE project • Facilities • Cray Inc • HECToR, the UK’s national high-performance computing service, which is provided by UoE HPCx Ltd at the University of Edinburgh, Cray Inc and NAG Ltd, and funded by the Office of Science and Technology through EPSRC’s High End Computing Programme WPSE 2012 RIKEN, Kobe, Japan

PGAS performance Distributed hash table

The hash table code • Simple C hash table code • supplied by Max Neunhoffer, St Andrews • creates integer like objects • computes the hash of the object • populates the hash table, with a pointer to the object • if entry already exists, inserts pointer at next free place • revisits the hash table WPSE 2012 RIKEN, Kobe, Japan

Distributed hash table • The almost no computation in the test code • code is memory bound • In parallel cannot use a local pointer to a remote object • distributed table has hold a copy of the object itself • Increases memory access cost compared to sequential code • UPC version shared pointer to shared object possible • consumes more shared memory, not considered • RMA access cost much greater than direct memory access Parallel code is slower than sequential code More nodes  more communications WPSE 2012 RIKEN, Kobe, Japan

MPI hash table • Declare memory structures on each MPI task • link them together with MPI_Win_create() MPI_Win win; numb *hashtab; numb nobj; hashtab = (numb *)calloc(nobj,sizeof(numb)); MPI_Win_create(hashtab,nobj*sizeof(int),sizeof(numb),MPI_INFO_NULL,comm,&win); WPSE 2012 RIKEN, Kobe, Japan

MPI data structure rank 0 rank 1 rank 2 rank 3 Window 0 WPSE 2012 RIKEN, Kobe, Japan

MPI_Put, MPI_Get • MPI ranks cannot see data on other ranks • Data is accessed by MPI RMA calls to the data structure in the window • Similarly for MPI_Put MPI_Get(&localHash,1,MPI_INT,destRank,destPos,1,MPI_INT,win); origin_addr, count and type, where, count and type, Win WPSE 2012 RIKEN, Kobe, Japan

Synchronisation • Several mechanisms • MPI_Win_fence - barrier • MPI_Win_lock, MPI_Win_unlock • ensure no other MPI_task can access data element • avoid race condition • Locks all the memory of specified MPI task in the specified window • MPI_Win_unlock to release the lock MPI_Win_lock(MPI_LOCK_EXCLUSIVE,destRank,0,win); lock type, assertion,which rank WPSE 2012 RIKEN, Kobe, Japan

thread 0 thread 1 thread 2 thread 3 Private Memory Space x x x x y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] Shared Memory Space y[8] y[9] y[10] y[11] y[12] y[13] y[14] y[15] UPC: Shared arrays • If shared variable is array, space allocated across shared memory space in cyclic fashion by default int x; sharedint y[16]; WPSE 2012 RIKEN, Kobe, Japan

thread 0 thread 1 thread 2 thread 3 Private Memory Space x x x x y[0] y[2] y[4] y[6] y[1] y[3] y[5] y[7] Shared Memory Space y[8] y[10] y[12] y[14] y[9] y[11] y[13] y[15] UPC: Data Distribution • Possible to “block” shared arrays by definingshared[blocksize] type array[n] int x; shared[2]int y[16]; WPSE 2012 RIKEN, Kobe, Japan

Shared Memory Space N N N Private Memory Space p1 p1 p1 Collective dynamic allocation • The shared memory allocated is contiguous. Similar to an array shared [B] int *ptr; ptr = (shared [B] int *)upc_all_alloc(THREADS, N*sizeof(int)); WPSE 2012 RIKEN, Kobe, Japan

Memory allocation and affinity shared int*p1; p1 = (shared int*) upc_all_alloc(THREADS,N*sizeof(int)); thread 0 thread 1 thread 2 shared 0 3 1 4 2 5 Private N = 2 p1 p1 p1 WPSE 2012 RIKEN, Kobe, Japan

UPC hash table • Use upc pointer and collective memory allocation • No blocking factor • cyclic distribution is as good as any other • if hash function is good enough • Shared memory  all threads can see all data elements • Use upc functions to determine which thread “own” data element sharednumb *hashtab; hashtab = (sharednumb *) upc_all_alloc(THREADS,nobj*sizeof(numb)); vthread = upc_threadof(&hashtab[v]); WPSE 2012 RIKEN, Kobe, Japan

Synchronisation • As with MPI, use locks to control access to data • UPC declare array of locks • Allocate memory – non-collective call • locks are not all on thread 0 • Array of locks can be any size • one, NTHREADS, NDATA, NDATA/NTHREADS • In this example NTHREADS • Lock entire threads local data upc_lock_t *sharedlock[THREADS]; lock[MYTHREAD] = upc_global_lock_alloc(); WPSE 2012 RIKEN, Kobe, Japan

SHMEM • Create symmetric data arrays with shmalloc() • create array of locks of type long, size numb pes • initialise locks with shmem_clear_lock(&lock) • Like MPI • data access via put/get calls • user determines remote destination and array position • Unlike MPI • data created and allocation symmetrically • No window of grouped data WPSE 2012 RIKEN, Kobe, Japan

Environment • All codes compiled with Cray C compiler • and Cray SHMEM • benchmarks run on XE6 Gemini interconnect • Phase 3 AMD 32 core interlagos • Phase 2 AMD 24 core Magny-cours • Integer object of 8 bytes with two passes of the hash table • Weak scaling results show • wall clock time versus number of cores • fixed size of local hash table • different sizes are shown as different curves on the plots WPSE 2012 RIKEN, Kobe, Japan

PGAS: Principle, Programming and Performance

PGAS: Principle, Programming and Performance

Presentation Transcript

Performance Analysis using Windows Performance Toolkit

The Cooperative Principle and Politeness Principle

Prolog: Programming in Logic

Chapter 6 The Structural Risk Minimization Principle

Principle of LCD Display

INTRODUCTION TO FUNCTIONAL PROGRAMMING

CS 484 Parallel Programming spring 2014

Introduction to PGAS (UPC and CAF) and Hybrid for Multicore Programming

Programming (SE7)

Introduction to Programming

Python Programming

Introducing the ‘c’ Programming language

Programming (SE7)

52.223 Low Level Programming Lecturer: Duncan Smeed

Graphics Processing Unit (GPU) Architecture and Programming

DB-04 Tuning OpenEdge™ SQL: Boosting Your SQL Application Performance

The C Programming Language

Network Programming

CMPS 135 Introduction to Programming