690 likes | 817 Vues
Compilation Technology for Computational Science Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley. Joint work with The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea, K. Datta, E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su, T. Wen
E N D
Compilation Technology for Computational Science Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work with The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea, K. Datta, E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su, T. Wen The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome
Parallelism Everywhere • Single processor Moore’s Law effect is ending • Power density limitations; device physics below 90nm • Multicore is becoming the norm • AMD, IBM, Intel, Sun all offering multicore • Number of cores per chip likely to increase with density • Fundamental software change • Parallelism is exposed to software • Performance is no longer solely a hardware problem • What has the HPC community learned? • Caveat: Scale and applications differ
High-end simulation in the physical sciences = 7 methods: Phillip Colella’s “Seven dwarfs” • Structured Grids (including Adaptive Mesh Refinement) • Unstructured Grids • Spectral Methods (FFTs, etc.) • Dense Linear Algebra • Sparse Linear Algebra • Particles • Monte Carlo Simulation • Add 4 for embedded; covers all 41 EEMBC benchmarks 8. Search/Sort 9. Filter 10. Comb. logic 11. Finite State Machine Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same Games/Entertainment close to scientific computing Slide source: Phillip Colella, 2004 and Dave Patterson, 2006
Parallel Programming Models • Parallel software is still an unsolved problem ! • Most parallel programs are written using either: • Message passing with a SPMD model • for scientific applications; scales easily • Shared memory with threads in OpenMP, Threads, or Java • non-scientific applications; easier to program • Partitioned Global Address Space (PGAS) Languages off 3 features • Productivity: easy to understand and use • Performance: primary requirement in HPC • Portability: must run everywhere
Partitioned Global Address Space • Global address space: any thread/process may directly read/write data allocated by another • Partitioned: data is designated as local (near) or global (possibly far); programmer controls layout By default: • Object heaps are shared • Program stacks are private x: 1 y: x: 5 y: x: 7 y: 0 Global address space l: l: l: g: g: g: p0 p1 pn • 3 Current languages: UPC, CAF, and Titanium • Emphasis in this talk on UPC & Titanium (based on Java)
PGAS Language Overview • Many common concepts, although specifics differ • Consistent with base language • Both private and shared data • int x[10]; and shared int y[10]; • Support for distributed data structures • Distributed arrays; local and global pointers/references • One-sided shared-memory communication • Simple assignment statements: x[i] = y[i]; ort = *p; • Bulk operations: memcpy in UPC, array ops in Titanium and CAF • Synchronization • Global barriers, locks, memory fences • Collective Communication, IO libraries, etc.
Example: Titanium Arrays • Ti Arrays created using Domains; indexed using Points: double [3d] gridA = new double [[0,0,0]:[10,10,10]]; • Eliminates some loop bound errors using foreach foreach (p in gridA.domain()) gridA[p] = gridA[p]*c + gridB[p]; • Rich domain calculus allow for slicing, subarray, transpose and other operations without data copies • Array copy operations automatically work on intersection data[neighborPos].copy(mydata); intersection (copied area) “restrict”-ed (non-ghost) cells ghost cells mydata data[neighorPos]
Productivity: Line Count Comparison • Comparison of NAS Parallel Benchmarks • UPC version has modest programming effort relative to C • Titanium even more compact, especially for MG, which uses multi-d arrays • Caveat: Titanium FT has user-defined Complex type and cross-language support used to call FFTW for serial 1D FFTs UPC results from Tarek El-Gazhawi et al; CAF from Chamberlain et al; Titanium joint with Kaushik Datta & Dan Bonachea
Case Study 1: Block-Structured AMR • Adaptive Mesh Refinement (AMR) is challenging • Irregular data accesses and control from boundaries • Mixed global/local view is useful Titanium AMR benchmarks available AMR Titanium work by Tong Wen and Philip Colella
Titanium AMR Entirely in Titanium Finer-grained communication No explicit pack/unpack code Automated in runtime system AMR in Titanium C++/Fortran/MPI AMR • Chombo package from LBNL • Bulk-synchronous comm: • Pack boundary data between procs 10X reduction in lines of code! *Somewhat more functionality in PDE part of Chombo code Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su
Performance of Titanium AMR Comparable performance • Serial: Titanium is within a few % of C++/F; sometimes faster! • Parallel: Titanium scaling is comparable with generic optimizations - additional optimizations (namely overlap) not yet implemented
Immersed Boundary Simulation in Titanium • Modeling elastic structures in an incompressible fluid. • Blood flow in the heart, blood clotting, inner ear, embryo growth, and many more • Complicated parallelization • Particle/Mesh method • “Particles” connected into materials Joint work with Ed Givelberg, Armando Solar-Lezama
High Performance Strategy for acceptance of a new language • Within HPC: Make it run faster than anything else Approaches to high performance • Language support for performance: • Allow programmers sufficient control over resources for tuning • Non-blocking data transfers, cross-language calls, etc. • Control over layout, load balancing, and synchronization • Compiler optimizations reduce need for hand tuning • Automate non-blocking memory operations, relaxed memory,… • Productivity gains though parallel analysis and optimizations • Runtime support exposes best possible performance • Berkeley UPC and Titanium use GASNet communication layer • Dynamic optimizations based on runtime information
One-Sided vs Two-Sided • A one-sided put/get message can be handled directly by a network interface with RDMA support • Avoid interrupting the CPU or storing data from CPU (preposts) • A two-sided messages needs to be matched with a receive to identify memory address to put data • Offloaded to Network Interface in networks like Quadrics • Need to download match tables to interface (from host) one-sided put message host CPU address data payload network interface two-sided message memory message id data payload
(up is good) Performance Advantage of One-Sided Communication: GASNet vs MPI • Opteron/InfiniBand (Jacquard at NERSC): • GASNet’s vapi-conduit and OSU MPI 0.9.5 MVAPICH • Half power point (N ½ ) differs by one order of magnitude Joint work with Paul Hargrove and Dan Bonachea
(down is good) GASNet: Portability and High-Performance GASNet better for latency across machines Joint work with UPC Group; GASNet design by Dan Bonachea
(up is good) GASNet: Portability and High-Performance GASNet at least as high (comparable) for large messages Joint work with UPC Group; GASNet design by Dan Bonachea
(up is good) GASNet: Portability and High-Performance GASNet excels at mid-range sizes: important for overlap Joint work with UPC Group; GASNet design by Dan Bonachea
Case Study 2: NAS FT • Performance of Exchange (Alltoall) is critical • 1D FFTs in each dimension, 3 phases • Transpose after first 2 for locality • Bisection bandwidth-limited • Problem as #procs grows • Three approaches: • Exchange: • wait for 2nd dim FFTs to finish, send 1 message per processor pair • Slab: • wait for chunk of rows destined for 1 proc, send when ready • Pencil: • send each row as it completes Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea
Overlapping Communication • Goal: make use of “all the wires all the time” • Schedule communication to avoid network backup • Trade-off: overhead vs. overlap • Exchange has fewest messages, less message overhead • Slabs and pencils have more overlap; pencils the most • Example: Class D problem on 256 Processors Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea
NAS FT Variants Performance Summary .5 Tflops • Slab is always best for MPI; small message cost too high • Pencil is always best for UPC; more overlap Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea
Case Study 3: LU Factorization • Direct methods have complicated dependencies • Especially with pivoting (unpredictable communication) • Especially for sparse matrices (dependence graph with holes) • LU Factorization in UPC • Use overlap ideas and multithreading to mask latency • Multithreaded: UPC threads + user threads + threaded BLAS • Panel factorization: Including pivoting • Update to a block of U • Trailing submatrix updates • Status: • Dense LU done: HPL-compliant • Sparse version underway Joint work with Parry Husbands
UPC HPL Performance • Comparison to ScaLAPACK on an Altix, a 2 x 4 process grid • ScaLAPACK (block size 64) 25.25 GFlop/s (tried several block sizes) • UPC LU (block size 256) - 33.60 GFlop/s, (block size 64) - 26.47 GFlop/s • n = 32000 on a 4x4 process grid • ScaLAPACK - 43.34 GFlop/s (block size = 64) • UPC - 70.26 Gflop/s (block size = 200) • MPI HPL numbers from HPCC database • Large scaling: • 2.2 TFlops on 512p, • 4.4 TFlops on 1024p (Thunder) Joint work with Parry Husbands
Automating Support for Optimizations • The previous examples are hand-optimized • Non-blocking put/get on distributed memory • Relaxed memory consistency on shared memory • What analyses are needed to optimize parallel codes? • Concurrency analysis: determine which blocks of code could run in parallel • Alias analysis: determine which variables could access the same location • Synchronization analysis: align matching barriers, locks… • Locality analysis: when is a general (global pointer) used only locally (can convert to cheaper local pointer) Joint work with Amir Kamil and Jimmy Su
Reordering in Parallel Programs In parallel programs, a reordering can change the semantics even if no local dependencies exist. Initially, flag = data = 0 T1 T1 data = 1 flag = 1 T2 T2 f = flag f = flag d = data d = data flag = 1 data = 1 {f == 1, d == 0}is possible after reordering; not in original Compiler, runtime, and hardware can produce such reorderings Joint work with Amir Kamil and Jimmy Su
Memory Models • Sequential consistency: a reordering is illegal if it can be observed by another thread • Relaxed consistency: reordering may be observed, but local dependencies and synchronization preserved (roughly) • Titanium, Java, & UPC are not sequentially consistent • Perceived cost of enforcing it is too high • For Titanium and UPC, network latency is the cost • For Java shared memory fences and code transformations are the cost Joint work with Amir Kamil and Jimmy Su
Conflicts • Reordering of an access is observable only if it conflicts with some other access: • The accesses can be to the same memory location • At least one access is a write • The accesses can run concurrently • Fences (compiler and hardware) need to be inserted around accesses that conflict T1 T2 data = 1 f = flag flag = 1 d = data Conflicts Joint work with Amir Kamil and Jimmy Su
Sequential Consistency in Titanium • Minimize number of fences – allow same optimizations as relaxed model • Concurrency analysis identifies concurrent accesses • Relies on Titanium’s textual barriers and single-valued expressions • Alias analysis identifies accesses to the same location • Relies on SPMD nature of Titanium Joint work with Amir Kamil and Jimmy Su
Barrier Alignment • Many parallel languages make no attempt to ensure that barriers line up • Example code that is legal but will deadlock: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else ; // odd ID threads Joint work with Amir Kamil and Jimmy Su
Structural Correctness • Aiken and Gay introduced structural correctness(POPL’98) • Ensures that every thread executes the same number of barriers • Example of structurally correct code: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else Ti.barrier(); // odd ID threads Joint work with Amir Kamil and Jimmy Su
Textual Barrier Alignment • Titanium has textual barriers: all threads must execute the same textual sequence of barriers • Stronger guarantee than structural correctness – this example is illegal: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else Ti.barrier(); // odd ID threads • Single-valued expressions used to enforce textual barriers Joint work with Amir Kamil and Jimmy Su
Single-Valued Expressions • A single-valued expression has the same value on all threads when evaluated • Example: Ti.numProcs() > 1 • All threads guaranteed to take the same branch of a conditional guarded by a single-valued expression • Only single-valued conditionals may have barriers • Example of legal barrier use: if (Ti.numProcs() > 1) Ti.barrier(); // multiple threads else ; // only one thread total Joint work with Amir Kamil and Jimmy Su
Concurrency Analysis • Graph generated from program as follows: • Node added for each code segment between barriers and single-valued conditionals • Edges added to represent control flow between segments 1 // code segment 1 if ([single]) // code segment 2 else // code segment 3 // code segment 4 Ti.barrier() // code segment 5 2 3 4 barrier 5 Joint work with Amir Kamil and Jimmy Su
Concurrency Analysis (II) • Two accesses can run concurrently if: • They are in the same node, or • One access’s node is reachable from the other access’s node without hitting a barrier • Algorithm: remove barrier edges, do DFS 1 2 3 4 barrier 5 Joint work with Amir Kamil and Jimmy Su
Alias Analysis • Allocation sites correspond to abstract locations (a-locs) • All explicit and implict program variables have points-to sets • A-locs are typed and have points-to sets for each field of the corresponding type • Arrays have a single points-to set for all indices • Analysis is flow,context-insensitive • Experimental call-site sensitive version – doesn’t seem to help much Joint work with Amir Kamil and Jimmy Su
Thread-Aware Alias Analysis • Two types of abstract locations: local and remote • Local locations reside in local thread’s memory • Remote locations reside on another thread • Exploits SPMD property • Results are a summary over all threads • Independent of the number of threads at runtime Joint work with Amir Kamil and Jimmy Su
Alias Analysis: Allocation • Creates new local abstract location • Result of allocation must reside in local memory class Foo { Object z; } static void bar() { L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a; L2: a.z = new Object(); } Joint work with Amir Kamil and Jimmy Su
Alias Analysis: Assignment • Copies source abstract locations into points-to set of target class Foo { Object z; } static void bar() { L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a; L2: a.z = new Object(); } Joint work with Amir Kamil and Jimmy Su
Alias Analysis: Broadcast • Produces both local and remote versions of source abstract location • Remote a-loc points to remote analog of what local a-loc points to class Foo { Object z; } static void bar() { L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a; L2: a.z = new Object(); } Joint work with Amir Kamil and Jimmy Su
Aliasing Results • Two variables A and B may alias if: $xÎpointsTo(A). xÎpointsTo(B) • Two variables A and B may alias across threads if: $xÎpointsTo(A). R(x)ÎpointsTo(B), (where R(x) is the remote counterpart of x) Joint work with Amir Kamil and Jimmy Su
Benchmarks 1 Line counts do not include the reachable portion of the 137,000 line Titanium/Java 1.0 libraries Joint work with Amir Kamil and Jimmy Su
Analysis Levels • We tested analyses of varying levels of precision Joint work with Amir Kamil and Jimmy Su
Static (Logical) Fences GOOD Percentages are for number of static fences reduced over naive Joint work with Amir Kamil and Jimmy Su
Dynamic (Executed) Fences GOOD Percentages are for number of dynamic fences reduced over naive Joint work with Amir Kamil and Jimmy Su
Dynamic Fences: gsrb • gsrb relies on dynamic locality checks • slight modification to remove checks (gsrb*) greatly increases precision of analysis GOOD Joint work with Amir Kamil and Jimmy Su
Two Example Optimizations • Consider two optimizations for GAS languages • Overlap bulk memory copies • Communication aggregation for irregular array accesses (i.e. a[b[i]]) • Both optimizations reorder accesses, so sequential consistency can inhibit them • Both are addressing network performance, so potential payoff is high Joint work with Amir Kamil and Jimmy Su
Array Copies in Titanium • Array copy operations are commonly used dst.copy(src); • Content in the domain intersection of the two arrays is copied from dst to src • Communication (possibly with packing) required if arrays reside on different threads • Processor blocks until the operation is complete. src dst Joint work with Amir Kamil and Jimmy Su
Non-Blocking Array Copy Optimization • Automatically convert blocking array copies into non-blocking array copies • Push sync as far down the instruction stream as possible to allow overlap with computation • Interprocedural: syncs can be moved across method boundaries • Optimization reorders memory accesses – may be illegal under sequential consistency Joint work with Amir Kamil and Jimmy Su
Communication Aggregation on Irregular Array Accesses (Inspector/Executor) • A loop containing indirect array accesses is split into phases • Inspector examines loop and computes reference targets • Required remote data gathered in a bulk operation • Executor uses data to perform actual computation • Can be illegal under sequential consistency schd = inspect(remote, b); tmp = get(remote, schd); for (...) { a[i] = tmp[i]; // other accesses } for (...) { a[i] = remote[b[i]]; // other accesses } Joint work with Amir Kamil and Jimmy Su
Relaxed + SC with 3 Analyses • We tested performance using analyses of varying levels of precision Joint work with Amir Kamil and Jimmy Su