Titanium: A High-Performance Java-Based Language for Effective Parallel Computing
Titanium is a high-performance programming language designed for parallel computing and optimized domain-specific languages, built on Java for safety and expressiveness. It provides features such as immutable classes, multidimensional arrays, and advanced memory management techniques like zone-based allocation. The language allows for efficient uniprocessor and parallel performance through controlled parallelism constructs and an optimizing compiler. By leveraging a scalable parallel execution model with global address space, Titanium addresses challenges in large-scale computations while enhancing programmer productivity and expressiveness.
Titanium: A High-Performance Java-Based Language for Effective Parallel Computing
E N D
Presentation Transcript
Titanium: A High Performance Java-Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul Hilfinger, Arvind Krishnamurthy, Ben Liblit, Carleton Miyamoto, Geoff Pike, Luigi Semenzato,
Talk Outline • Motivation • Extensions for uniprocessor performance • Extensions for parallelism • A framework for domain-specific languages • Status and performance
Programming Challenges on Millennium • Large scale computations • Optimized simulation algorithms are complex • Use of hierarchical parallel machine • Cost-conscious programming Minimization algorithms Unstructured meshes ? Adaptive meshes
Titanium Approach • Performance is primary goal • High uniprocessor performance • Designed for shared and distributed memory • Parallelism constructs with programmer control • Optimizing compiler for caches, communication scheduling, etc. • Expressiveness secondary goal • Based on safe language: Java • Safety simplifies programming and compiler analysis • Framework for domain-specific language extensions
New Language Features • Immutable classes • Multidimensional arrays • also: points and index sets as first-class values • multidimensional iterators • Memory management • semi-automated zone-based allocation • Scalable parallelism • SPMD model of execution with global address space • Language-level synchronization • Support for grid-based computation
Java Objects • Primitive scalar types: boolean, double, int, etc. • access is fast • Objects: user-defined and from the standard library • has level of indirection (pointer to) implicit • arrays are objects • all objects can be checked for equality and a few other operations 3 true r: 7.1 i: 4.3
Immutable Classes in Titanium • For small objects, would sometimes prefer • to avoid level of indirection • pass by value • extends the idea of primitive values (1, 4.2, etc.) to user-defined values • Titanium introduces immutable classes • all fields are final(implicitly) • cannot inherit from (extend) or be inherited by other classes • needs to have 0-argument constructor, e.g., Complex () immutable class Complex { ... } Complex c = new Complex(7.1, 4.3);
Arrays in Java • Arrays in Java are objects • Only 1D arrays are directly supported • Array bounds are checked (as in Fortran) • Multidimensional arrays as arrays of arrays are slow and cannot transform into contiguous memory
Titanium Arrays • Fast, expressive arrays • multidimensional • lower bound, upper bound, stride • concise indexing: A[p] instead of A(i, j, k) • Points • tuple of integers as primitive type • Domains • rectangular sets of points (bounds and stride) • arbitrary sets of points • Multidimensional iterators
Example: Point, RectDomain, Array Point<2> lb = [1, 1]; Point<2> ub = [10, 20]; RectDomain<2> R = [lb : ub : [2, 2]]; double [2d] A = new double[R]; … foreach (p in A.domain()) { A[p] = B[2 * p]; } • Standard optimizations: • strength reduction • common subexpression elimination • invariant code motion • removing bounds checks from body
Memory Management • Java implemented with garbage collection • Distributed GC too unpredictable • Compile-time analysis can improve performance • Zone-based memory management • extends existing model • good performance • safe • easy to use
Zone-Based Memory Management • Allocate objects in zones • Release zones manually Z1 Zone Z1 = new Zone(); Zone Z2 = new Zone(); T x = new(Z1) T(); x T y = new(Z2) T(); x.field = y; x = y; delete Z1; Z2 y delete Z2; // error
Sequential Performance Times in seconds (lower is better).
Model of Parallelism { • Single Program, Multiple Data • fixed number of processes • each process has own local data • global synchronization (barrier) n processes start ... barrier ... barrier ... ... barrier ... end
lv lv lv lv lv lv gv gv gv gv gv gv Global Address Space • Each process has its own heap • References can span process boundaries Other processes Process 0 LOCAL HEAP LOCAL HEAP Class T { … } T gv; T lv = null; if (thisProc() == 0) { lv = new T(); // allocate locally } gv = broadcast lv from 0; // distribute … gv.field ...
Global vs. Local References • Global references may be slow • distributed memory: overhead of a few instructions when using a global reference to access a local object • shared memory: no performance implications • Solution: use local qualifier • statically restrict references to local objects • example: T local lv = null; • use only in critical sections
Global Synchronization Analysis • In Titanium, processes must synchronize at the same textual instances of barrier() doThis(); barrier(); boolean x = someCondition(); if (x) { doThat(); barrier(); } doSomeMore(); barrier();
Global Synchronization Analysis • In Titanium, processes must synchronize at the same textual instances of barrier() • Singleness analysis statically guarantees correctness by restricting the values of variables that control program flow doThis(); barrier(); boolean single x = someCondition(); if (x) { doThat(); barrier(); } doSomeMore(); barrier();
Support for Grid-Based Computation R Point<2> lb = [0, 0]; Point<2> ub = [6, 4]; RectDomain<2> R = [lb : ub : [2, 2]]; … Domain<2> red = R + (R + [1, 1]); foreach (p in red) { … } (6, 4) (0, 0) R + [1, 1] (7, 5) (1, 1) red (7, 5) Gauss-Seidel relaxation with red-black ordering (0, 0)
Implementation • Strategy • compile Titanium into C (currently C++) • Posix threads for SMPs (currently Solaris threads) • Lightweight Active Messages for communication • Status • runs on SUN Enterprise 8-way SMP • runs on Berkeley NOW • trivial ports to 1/2 dozen other architectures • tuning for sequential performance
Titanium Status • Titanium language definition complete. • Titanium compiler running. • Compiles for uniprocessors, NOW; others soon. • Application developments ongoing. • Many research opportunities.
Parallel Performance Speedup • Numbers from Ultrasparc SMP • Parallel efficiency good • EM3D (unstructured kernel) • 3D AMR limited by algorithm Number of processors
Future Directions • Use of framework for domain-specific languages • Fluids and AMR done • Unstructured meshes and sparse solvers • Better programming tools • debuggers, performance analysis • Optimizations • analysis of parallel code and synchronization done • optimizations for caches on uniprocessors and SMPs underway • load balancing on clusters of SMPs
Conclusions • Performance • sequential performance consistently close to C/FORTRAN • currently: 80% slower to 25% faster • sequential efficiency very high • Expressiveness • safety of Java with small set of performance features • extensible to new application domains • Portability, compatibility, etc. • no gratuitous departures from Java standard • compilation model easily supports new platforms