Titanium: A High-Performance Java-Based Language for Effective Parallel Computing

Titanium: A High Performance Java-Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul Hilfinger, Arvind Krishnamurthy, Ben Liblit, Carleton Miyamoto, Geoff Pike, Luigi Semenzato,

Talk Outline • Motivation • Extensions for uniprocessor performance • Extensions for parallelism • A framework for domain-specific languages • Status and performance

Programming Challenges on Millennium • Large scale computations • Optimized simulation algorithms are complex • Use of hierarchical parallel machine • Cost-conscious programming Minimization algorithms Unstructured meshes ? Adaptive meshes

Titanium Approach • Performance is primary goal • High uniprocessor performance • Designed for shared and distributed memory • Parallelism constructs with programmer control • Optimizing compiler for caches, communication scheduling, etc. • Expressiveness secondary goal • Based on safe language: Java • Safety simplifies programming and compiler analysis • Framework for domain-specific language extensions

New Language Features • Immutable classes • Multidimensional arrays • also: points and index sets as first-class values • multidimensional iterators • Memory management • semi-automated zone-based allocation • Scalable parallelism • SPMD model of execution with global address space • Language-level synchronization • Support for grid-based computation

Java Objects • Primitive scalar types: boolean, double, int, etc. • access is fast • Objects: user-defined and from the standard library • has level of indirection (pointer to) implicit • arrays are objects • all objects can be checked for equality and a few other operations 3 true r: 7.1 i: 4.3

Immutable Classes in Titanium • For small objects, would sometimes prefer • to avoid level of indirection • pass by value • extends the idea of primitive values (1, 4.2, etc.) to user-defined values • Titanium introduces immutable classes • all fields are final(implicitly) • cannot inherit from (extend) or be inherited by other classes • needs to have 0-argument constructor, e.g., Complex () immutable class Complex { ... } Complex c = new Complex(7.1, 4.3);

Arrays in Java • Arrays in Java are objects • Only 1D arrays are directly supported • Array bounds are checked (as in Fortran) • Multidimensional arrays as arrays of arrays are slow and cannot transform into contiguous memory

Titanium Arrays • Fast, expressive arrays • multidimensional • lower bound, upper bound, stride • concise indexing: A[p] instead of A(i, j, k) • Points • tuple of integers as primitive type • Domains • rectangular sets of points (bounds and stride) • arbitrary sets of points • Multidimensional iterators

Example: Point, RectDomain, Array Point<2> lb = [1, 1]; Point<2> ub = [10, 20]; RectDomain<2> R = [lb : ub : [2, 2]]; double [2d] A = new double[R]; … foreach (p in A.domain()) { A[p] = B[2 * p]; } • Standard optimizations: • strength reduction • common subexpression elimination • invariant code motion • removing bounds checks from body

Memory Management • Java implemented with garbage collection • Distributed GC too unpredictable • Compile-time analysis can improve performance • Zone-based memory management • extends existing model • good performance • safe • easy to use

Zone-Based Memory Management • Allocate objects in zones • Release zones manually Z1 Zone Z1 = new Zone(); Zone Z2 = new Zone(); T x = new(Z1) T(); x T y = new(Z2) T(); x.field = y; x = y; delete Z1; Z2 y delete Z2; // error

Sequential Performance Times in seconds (lower is better).

Model of Parallelism { • Single Program, Multiple Data • fixed number of processes • each process has own local data • global synchronization (barrier) n processes start ... barrier ... barrier ... ... barrier ... end

lv lv lv lv lv lv gv gv gv gv gv gv Global Address Space • Each process has its own heap • References can span process boundaries Other processes Process 0 LOCAL HEAP LOCAL HEAP Class T { … } T gv; T lv = null; if (thisProc() == 0) { lv = new T(); // allocate locally } gv = broadcast lv from 0; // distribute … gv.field ...

Global vs. Local References • Global references may be slow • distributed memory: overhead of a few instructions when using a global reference to access a local object • shared memory: no performance implications • Solution: use local qualifier • statically restrict references to local objects • example: T local lv = null; • use only in critical sections

Global Synchronization Analysis • In Titanium, processes must synchronize at the same textual instances of barrier() doThis(); barrier(); boolean x = someCondition(); if (x) { doThat(); barrier(); } doSomeMore(); barrier();

Global Synchronization Analysis • In Titanium, processes must synchronize at the same textual instances of barrier() • Singleness analysis statically guarantees correctness by restricting the values of variables that control program flow doThis(); barrier(); boolean single x = someCondition(); if (x) { doThat(); barrier(); } doSomeMore(); barrier();

Support for Grid-Based Computation R Point<2> lb = [0, 0]; Point<2> ub = [6, 4]; RectDomain<2> R = [lb : ub : [2, 2]]; … Domain<2> red = R + (R + [1, 1]); foreach (p in red) { … } (6, 4) (0, 0) R + [1, 1] (7, 5) (1, 1) red (7, 5) Gauss-Seidel relaxation with red-black ordering (0, 0)

Implementation • Strategy • compile Titanium into C (currently C++) • Posix threads for SMPs (currently Solaris threads) • Lightweight Active Messages for communication • Status • runs on SUN Enterprise 8-way SMP • runs on Berkeley NOW • trivial ports to 1/2 dozen other architectures • tuning for sequential performance

Titanium Status • Titanium language definition complete. • Titanium compiler running. • Compiles for uniprocessors, NOW; others soon. • Application developments ongoing. • Many research opportunities.

Parallel Performance Speedup • Numbers from Ultrasparc SMP • Parallel efficiency good • EM3D (unstructured kernel) • 3D AMR limited by algorithm Number of processors

Future Directions • Use of framework for domain-specific languages • Fluids and AMR done • Unstructured meshes and sparse solvers • Better programming tools • debuggers, performance analysis • Optimizations • analysis of parallel code and synchronization done • optimizations for caches on uniprocessors and SMPs underway • load balancing on clusters of SMPs

Conclusions • Performance • sequential performance consistently close to C/FORTRAN • currently: 80% slower to 25% faster • sequential efficiency very high • Expressiveness • safety of Java with small set of performance features • extensible to new application domains • Portability, compatibility, etc. • no gratuitous departures from Java standard • compilation model easily supports new platforms

Titanium: A High-Performance Java-Based Language for Effective Parallel Computing

Titanium: A High-Performance Java-Based Language for Effective Parallel Computing

Presentation Transcript

Java for High Performance Computing

Some Quick Reviews of Java

Java for High Performance Computing

Java program

Think Java: Java Programming Language Part 1

Titanium-based Alloys

Java Basics

Java Language

An agile dynamic language for the Java Platform

Think Java: Java Programming Language Part 1

JavaScript 1

600.226 Java intro

Java - An Introduction

Introduction to Java

Java

Introduction to Java

Java and Java Computing

The Java Programming Language

Performance of Titanium Alloy and Co Plated Titanium Alloy for MCFC Current Collector

Sea Ice

Sea Ice