1 / 49

Cluster Computing with Java Threads

Cluster Computing with Java Threads. Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu. Collaborators. UNH/Hyperion Mark MacBeth and Keith McGuigan ENS-Lyon/DSM-PM2 Gabriel Antoniu, Luc Bougé and Raymond Namyst. Focus. Use Java “as is” for high-performance computing

kirby
Télécharger la présentation

Cluster Computing with Java Threads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu

  2. Collaborators • UNH/Hyperion • Mark MacBeth and Keith McGuigan • ENS-Lyon/DSM-PM2 • Gabriel Antoniu, Luc Bougé and Raymond Namyst

  3. Focus • Use Java “as is” for high-performance computing • support computationally intensive applications • utilize parallel computing hardware

  4. Outline • Our Vision • Java Threads • The PM2 Run-time Environment • Hyperion: Java Threads on Clusters • Evaluation • Related Work • Conclusions

  5. Why Java? • Soon to be ubiquitous! • use of Java is growing very rapidly • Designed for portability: • develop programs on your desktop • run programs on a distant cluster

  6. Why Java? • Explicitly parallel! • includes a threaded programming model • Relaxed memory model • consistency model aids an implementation on distributed-memory parallel computers

  7. Unique Opportunity • Use Java to bring parallelism to the “masses” • Let’s not miss it! • But, programmers will not accept syntax or model changes

  8. Open Question • Parallelism via Java access to distributed-computing techniques? • e.g. RMI (remote method invocation) • Or, parallelism via Java threads?

  9. That is, ... • Does a user prefer to view a cluster as a collection of distinct machines? • Or, does a user prefer to view a cluster as a “black box” that will simply run Java code faster?

  10. Are you “in a box”?

  11. Or, are you “thinking outside of the box”?

  12. Climb out of the box! • Use Java threads “as is” to program clusters of computers. • Program for the threaded Java virtual machine. • Allow the implementation to handle the details of executing in a cluster.

  13. Java Threads • Threads are objects. • The class java/lang/Thread contains all of the methods for initializing, running, suspending, querying and destroying threads.

  14. java/lang/Thread methods • Thread() - constructor for thread object. • start() - start the thread executing. • run() - method invoked by ‘start’. • stop(), suspend(), resume(), join(), yield(). • setPriority().

  15. Java Synchronization • Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it. • Monitors utilize locks. • There is a lock associated with each object.

  16. synchronized keyword • synchronized ( Exp ) Block • public class Q { synchronized void put(…) { … }}

  17. java/lang/Object methods • wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released. • notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object. • notifyall() - all waiting threads awakened.

  18. Shared-Memory Model • Java threads execute in a virtual shared memory. • All threads are able to access all objects. • But threads may not access each other’s stacks.

  19. Java Memory Consistency • A variant of release consistency. • Threads can keep locally cached copies of objects. • Consistency is provided by requiring that: • a thread's object cache be flushed upon entry to a monitor. • local modifications made to cached objects be transmitted to the central memory when a thread exits a monitor.

  20. Thread library: Marcel User-level Supports SMP POSIX-like Preemptive thread migration Communication library: Madeleine Portable: BIP, SISCI/SCI, MPI, TCP, PVM Efficient PM2: A Distributed, Multithreaded Runtime Environment

  21. DSM Protocol Policy DSM Protocol lib DSM Page Manager DSM Comm Madeleine Comms Marcel Threads DSM-PM2: Architecture DSM-PM2 • DSM comm: • send page request • send page • send invalidate request • … • DSM page manager: • set/get page owner • set/get page access • add/remove to/from copyset • ... PM2

  22. DSM-PM2: Performance SCI cluster has 450 MHz Pentium II nodes Myrinet cluster has 200 MHz Pentium Pro nodes

  23. Hyperion • Executes threaded Java programs on clusters. • Built on top of PM2 and DSM-PM2. • Provides both portability and efficiency

  24. Reversing the Bytecode Stream • Conventionally, users “pull” bytecode to their machines for local execution. • Our vision: • users develop their high-performance Java programs using the Java toolset on their desktop. • they then “push” the resulting bytecode to a Hyperion server for high-performance cycles.

  25. Supporting High Performance • Utilizes a bytecode-to-C translator. • Parallel execution via spreading of Java threads across nodes of the cluster. • Java threads implemented as lightweight threads using PM2 library.

  26. Compiling Java • Hyperion designed for computationally intensive applications, so small overhead of translating bytecode is not important. • Translating to C allows us to leverage the native C compiler and optimizer.

  27. prog prog.class prog.[ch] prog.java javac java2c gcc -06 (bytecode) Sun's Java compiler Instruction-wise translation libs General Hyperion Overview Runtime libraries

  28. The Hyperion Run-Time System • Collection of modules to allow “plug-and-play” implementations: • inter-node communication • threads • memory and synchronization • etc

  29. Load balancer Native Java API Thread subsystem Memory subsystem Comm. subsystem PM2 API: pm2_rpc, pm2_thread_create, etc. PM2 DSM subsystem Thread subsystem Comm. Subsystem Hyperion Internal Structure

  30. Thread and Object Allocation • Currently, threads are allocated to processors in round-robin fashion. • Currently, an object is allocated to the processor that holds the thread that is creating the object. • Currently, DSM-PM2 is used to implement the Java memory model.

  31. Hyperion’s DSM API • loadIntoCache • invalidateCache • updateMainMemory • get • put

  32. DSM Implementation • Node-level caches. • Page-based and home-based protocol. • Log mods made to remote objects. • Use explicit in-line checks in get/put. • Each node allocates objects from a different range of the virtual address space.

  33. Details • Objects are aligned on 64-byte boundaries. • An object reference is the address of the base of the object. • The bottom 6 bits of the ref can be used to store the node number of the object’s home.

  34. More details • loadIntoCache checks the 6 bits to see if an object is remote. • If so, and if not already locally cached, DSM-PM2 is used to load the page(s) containing the object. • When a remote object is cached, a bit is turned on in its header.

  35. Yet more details • The put primitive checks the header bit to see if a modification should be logged. • updateMainMemory sends the logged changes to the home node.

  36. Evaluation • Minimal-cost map-coloring application. • Branch-and-bound algorithm. • 64 threads, each with its own priority queue. • Current best solution is shared. • Problem size: 29 eastern-most states of USA with 4 colors of differing costs.

  37. Experimental Setting • Two Linux 2.2 clusters: • eight 200 MHz Pentium Pro processors connected by Myrinet switch and using MPI over BIP. • four 450 MHz Pentium II processors connected by a SCI network and using SISCI. • gcc 2.7.2.3 with -O6

  38. Performance Results

  39. Parallelizability

  40. Baseline Performance • Compared serial Java to serial C for map-coloring application. • Each program has single queue, single thread.

  41. Serial Java versus Serial C Java v2: DSM checks disabled Java v3: DSM and array-bound checks disabled Executing on a single 450 MHz Pentium II

  42. Inline checks are expensive! • Genericity of DSM-PM2 allows an alternative implementation. • Use page-fault detection rather than inline check to detect non-local object.

  43. Using Page Faults: details • An object reference is the address of the base of the object. • loadIntoCache does nothing. • DSM-PM2 is used to handle page faults generated by the get/put primitives.

  44. More details • When an object is allocated, its address is appended to a list attached to the page that contains its header. • When a page is loaded on a remote node, the list is used to turn on the header bit for all object headers on the page. • The put primitive uses the header bit in the same manner as inline-check version.

  45. Inline Check versus Page Fault • IC has higher overhead for accessing objects (either local or locally cached). • PF has higher overhead (signal handling and memory protection) for loading a page into the cache.

  46. IC versus PF: serial map-coloring Java XX v2: DSM checks disabled Java XX v3: DSM and array-bound checks disabled Executing on a single 450 MHz Pentium II

  47. IC versus PF: parallel map-coloring Executing on 450MHz/SCI cluster.

  48. Related Work • Java/MPI: cluster nodes are explicit • Java/RMI: ditto • Remote objects via RMI: nearly transparent • e.g. JavaParty, Do! • Distributed interpreters • e.g. Java/DSM, MultiJav, cJVM

  49. Conclusions • Approach is clean: Java “as is” • Approach is promising • good parallelizability for map-coloring • need better scalar compilation • e.g. array bound-check removal • need further parallel application studies • are thread/object placement heuristics sufficient for programmers to write efficient programs?

More Related