Cluster Computing with Java Threads

Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu

Collaborators • UNH/Hyperion • Mark MacBeth and Keith McGuigan • ENS-Lyon/DSM-PM2 • Gabriel Antoniu, Luc Bougé and Raymond Namyst

Focus • Use Java “as is” for high-performance computing • support computationally intensive applications • utilize parallel computing hardware

Outline • Our Vision • Java Threads • The PM2 Run-time Environment • Hyperion: Java Threads on Clusters • Evaluation • Related Work • Conclusions

Why Java? • Soon to be ubiquitous! • use of Java is growing very rapidly • Designed for portability: • develop programs on your desktop • run programs on a distant cluster

Why Java? • Explicitly parallel! • includes a threaded programming model • Relaxed memory model • consistency model aids an implementation on distributed-memory parallel computers

Unique Opportunity • Use Java to bring parallelism to the “masses” • Let’s not miss it! • But, programmers will not accept syntax or model changes

Open Question • Parallelism via Java access to distributed-computing techniques? • e.g. RMI (remote method invocation) • Or, parallelism via Java threads?

That is, ... • Does a user prefer to view a cluster as a collection of distinct machines? • Or, does a user prefer to view a cluster as a “black box” that will simply run Java code faster?

Are you “in a box”?

Or, are you “thinking outside of the box”?

Climb out of the box! • Use Java threads “as is” to program clusters of computers. • Program for the threaded Java virtual machine. • Allow the implementation to handle the details of executing in a cluster.

Java Threads • Threads are objects. • The class java/lang/Thread contains all of the methods for initializing, running, suspending, querying and destroying threads.

java/lang/Thread methods • Thread() - constructor for thread object. • start() - start the thread executing. • run() - method invoked by ‘start’. • stop(), suspend(), resume(), join(), yield(). • setPriority().

Java Synchronization • Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it. • Monitors utilize locks. • There is a lock associated with each object.

synchronized keyword • synchronized ( Exp ) Block • public class Q { synchronized void put(…) { … }}

java/lang/Object methods • wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released. • notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object. • notifyall() - all waiting threads awakened.

Shared-Memory Model • Java threads execute in a virtual shared memory. • All threads are able to access all objects. • But threads may not access each other’s stacks.

Java Memory Consistency • A variant of release consistency. • Threads can keep locally cached copies of objects. • Consistency is provided by requiring that: • a thread's object cache be flushed upon entry to a monitor. • local modifications made to cached objects be transmitted to the central memory when a thread exits a monitor.

Thread library: Marcel User-level Supports SMP POSIX-like Preemptive thread migration Communication library: Madeleine Portable: BIP, SISCI/SCI, MPI, TCP, PVM Efficient PM2: A Distributed, Multithreaded Runtime Environment

DSM Protocol Policy DSM Protocol lib DSM Page Manager DSM Comm Madeleine Comms Marcel Threads DSM-PM2: Architecture DSM-PM2 • DSM comm: • send page request • send page • send invalidate request • … • DSM page manager: • set/get page owner • set/get page access • add/remove to/from copyset • ... PM2

DSM-PM2: Performance SCI cluster has 450 MHz Pentium II nodes Myrinet cluster has 200 MHz Pentium Pro nodes

Hyperion • Executes threaded Java programs on clusters. • Built on top of PM2 and DSM-PM2. • Provides both portability and efficiency

Reversing the Bytecode Stream • Conventionally, users “pull” bytecode to their machines for local execution. • Our vision: • users develop their high-performance Java programs using the Java toolset on their desktop. • they then “push” the resulting bytecode to a Hyperion server for high-performance cycles.

Supporting High Performance • Utilizes a bytecode-to-C translator. • Parallel execution via spreading of Java threads across nodes of the cluster. • Java threads implemented as lightweight threads using PM2 library.

Compiling Java • Hyperion designed for computationally intensive applications, so small overhead of translating bytecode is not important. • Translating to C allows us to leverage the native C compiler and optimizer.

prog prog.class prog.[ch] prog.java javac java2c gcc -06 (bytecode) Sun's Java compiler Instruction-wise translation libs General Hyperion Overview Runtime libraries

The Hyperion Run-Time System • Collection of modules to allow “plug-and-play” implementations: • inter-node communication • threads • memory and synchronization • etc

Load balancer Native Java API Thread subsystem Memory subsystem Comm. subsystem PM2 API: pm2_rpc, pm2_thread_create, etc. PM2 DSM subsystem Thread subsystem Comm. Subsystem Hyperion Internal Structure

Thread and Object Allocation • Currently, threads are allocated to processors in round-robin fashion. • Currently, an object is allocated to the processor that holds the thread that is creating the object. • Currently, DSM-PM2 is used to implement the Java memory model.

Hyperion’s DSM API • loadIntoCache • invalidateCache • updateMainMemory • get • put

DSM Implementation • Node-level caches. • Page-based and home-based protocol. • Log mods made to remote objects. • Use explicit in-line checks in get/put. • Each node allocates objects from a different range of the virtual address space.

Details • Objects are aligned on 64-byte boundaries. • An object reference is the address of the base of the object. • The bottom 6 bits of the ref can be used to store the node number of the object’s home.

More details • loadIntoCache checks the 6 bits to see if an object is remote. • If so, and if not already locally cached, DSM-PM2 is used to load the page(s) containing the object. • When a remote object is cached, a bit is turned on in its header.

Yet more details • The put primitive checks the header bit to see if a modification should be logged. • updateMainMemory sends the logged changes to the home node.

Evaluation • Minimal-cost map-coloring application. • Branch-and-bound algorithm. • 64 threads, each with its own priority queue. • Current best solution is shared. • Problem size: 29 eastern-most states of USA with 4 colors of differing costs.

Experimental Setting • Two Linux 2.2 clusters: • eight 200 MHz Pentium Pro processors connected by Myrinet switch and using MPI over BIP. • four 450 MHz Pentium II processors connected by a SCI network and using SISCI. • gcc 2.7.2.3 with -O6

Performance Results

Parallelizability

Baseline Performance • Compared serial Java to serial C for map-coloring application. • Each program has single queue, single thread.

Serial Java versus Serial C Java v2: DSM checks disabled Java v3: DSM and array-bound checks disabled Executing on a single 450 MHz Pentium II

Inline checks are expensive! • Genericity of DSM-PM2 allows an alternative implementation. • Use page-fault detection rather than inline check to detect non-local object.

Using Page Faults: details • An object reference is the address of the base of the object. • loadIntoCache does nothing. • DSM-PM2 is used to handle page faults generated by the get/put primitives.

More details • When an object is allocated, its address is appended to a list attached to the page that contains its header. • When a page is loaded on a remote node, the list is used to turn on the header bit for all object headers on the page. • The put primitive uses the header bit in the same manner as inline-check version.

Inline Check versus Page Fault • IC has higher overhead for accessing objects (either local or locally cached). • PF has higher overhead (signal handling and memory protection) for loading a page into the cache.

IC versus PF: serial map-coloring Java XX v2: DSM checks disabled Java XX v3: DSM and array-bound checks disabled Executing on a single 450 MHz Pentium II

IC versus PF: parallel map-coloring Executing on 450MHz/SCI cluster.

Related Work • Java/MPI: cluster nodes are explicit • Java/RMI: ditto • Remote objects via RMI: nearly transparent • e.g. JavaParty, Do! • Distributed interpreters • e.g. Java/DSM, MultiJav, cJVM

Conclusions • Approach is clean: Java “as is” • Approach is promising • good parallelizability for map-coloring • need better scalar compilation • e.g. array bound-check removal • need further parallel application studies • are thread/object placement heuristics sufficient for programmers to write efficient programs?

Cluster Computing with Java Threads

Cluster Computing with Java Threads

Presentation Transcript

Cluster Computing with DryadLINQ

Java Threads 1

Java Threads

Cluster Computing with DryadLINQ

Java Threads

Java Threads

Threads in Java

Java Threads

Cluster Computing with DryadLINQ

Cluster Computing with DryadLINQ

Java Threads

Java Threads

Cluster Computing with Linux

Java Threads

Cluster Computing with DryadLINQ

Java threads: synchronization

Java Threads

Java Threads

Java Threads

Java Threads