1 / 48

MPJ Express

MPJ Express. and Lessons from Java Gadget. Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk. Contents. MPJ Express overview The Gadget 2 code Making a Java version of Gadget. Acknowledgements.

dillon
Télécharger la présentation

MPJ Express

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MPJ Express and Lessons from JavaGadget Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk

  2. Contents • MPJ Express overview • The Gadget 2 code • Making a Java version of Gadget

  3. Acknowledgements • Work done in collaboration with Mark Baker and Aamir Shafir. • Most work done by latter!

  4. MPJ Express

  5. Java for High Performance Computing? • Various people have argued that Java provides a good programming platform for large scale computation (e.g. Java Grande, 1997 - ~2003). • Parallel computing has a long history of novel language concepts to support parallelism, multithreading etc. • Java seems to incorporate at least some of these ideas, and has the benefit that it is a “mainstream” language. • Java encourages better software engineering and is more portable than, say, Fortran or C. • There is a large body of supporting software for Java.

  6. Performance • Modern JVMs perform compilation from byte code to native machine code, as the Java program is executed. • Typical JVMs implement many of the optimizations used by compilers for “traditional” programming languages. • Adaptive compilation may even allow optimizations that are impractical for static compilers, because run-time information is available? • Quality of compilation comparable to many C or Fortran compilers. • But Java has safety features that may limit performance. • Difficulty of managing memory explicitly, exact exception processing, …

  7. MPI Background • MPI was introduced in 1994 as a standard message passing API for parallel scientific computing: • Language bindings for C, C++, and Fortran. • Original programming model was SPMD (Single Program Multiple Data) ― P copies of a single program run on a fixed number, P, of processors. • Only interaction is by message exchange ― no sharing of memory between processors (c.f. CSP). • Introduced idea of communicator as basic communication abstraction (c.f. channels). A communicator is shared by a group of processors. • Supports library development and collective operations.

  8. Communicators Communicator C Proc 0 Proc 1 C.Send(…, 2) Proc 2 C.Recv(…, 0)

  9. Collective Communications Communicator C Proc 0 Proc 1 C.Bcast(…, 0) C.Bcast(…, 0) Proc 2 C.Bcast(…, 0)

  10. MPJ Background • Java Grande Message Passing Workgroup defined Java bindings for MPI in 1998. • Previous efforts follow two approaches: • Pure Java approaches: • Remote Method Invocation (RMI), • Java Sockets, • Java Native Interface (JNI) approaches: • mpiJava – Java wrapper to platform native MPI.

  11. Minimal mpiJava Program import mpi.* class Hello { static public void main(String[] args) { MPI.Init(args) ; int myrank = MPI.COMM_WORLD.Rank() ; if(myrank == 0) { char[] message = “Hello, there”.toCharArray() ; MPI.COMM_WORLD.Send(message, 0, message.length, MPI.CHAR, 1, 99) ; } else { char[] message = new char [20] ; MPI.COMM_WORLD.Recv(message, 0, 20, MPI.CHAR, 0, 99) ; System.out.println(“received:” + new String(message) + “:”) ; } MPI.Finalize() ; } }

  12. MPJ Express Goals • Modular design: unify earlier approaches by pluggable communication devices. • Support high-level functionality of MPI-1.2 in pure Java code, as far as possible: • Point to point communications, • Collective communications, • Groups, communicators, and contexts, • Derived datatypes – buffering API. • Thread safety ― support Java threads for parallelism within multicore nodes. • Open source distribution: http://mpj-express.org

  13. MPJ Design

  14. MPJ Demos • There were a few existing demos for MPJ, inherited from the earlier mpiJava project, e.g. • A 2d Fluid Flow example. • Some Monte Carlo simulations of 2d spin systems from condensed matter physics. • These have nice GUIs, but aren’t very representative of real codes used by computational scientists.

  15. MPJ demo: CFD – inviscid flow

  16. MPJ demo: Q-state Potts model

  17. Gadget

  18. Gadget-2 • Gadget-2 is a free-software, production code for cosmological N-body (and hydrodynamic) computations. http://www.mpa-garching.mpg.de/gadget • Written by Volker Springel, of the Max Plank Institute for Astrophysics, Garching. • It is written in the C language – already parallelized using MPI. • Versions have been used in various research papers in astrophysics literature, including the Millennium Simulation.

  19. Millennium Simulation • See the paper Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution by Springel et al on Gadget home page. • Follows evolution of 1010 dark matter “particles” from early Universe (z = 127) to current day. • Performed on 512 nodes of IBM p690. • Used 1TB of distributed memory. • 350,000 CPU hours – 28 days elapsed time. • Floating point performance around 0.2 TFLOPS. • Around 20Tb total data generated.

  20. Dynamics in Gadget • Gadget is “just” simulating the movement of (a lot of) representative particles under the influence of Newton’s law of gravity. • Plus some hydrodynamic forces, but these don’t affect dominant Dark Matter. • Co-moving coordinates take account of expansion of Universe according to General Relativity, but otherwise basically classical mechanics. • Classical N-body problem.

  21. Gadget Main Loop • This is a slightly simplified view of the Gadget code: … Initialize … while (not done) { move_particles() ; // update positions domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities } • Most of the interesting work happens in compute_accelerations() (and domain_Decomposition() – see later).

  22. Computing Forces • The function compute_accelerations() must compute forces experienced by all particles. • In particular must compute gravitational force. • Because gravity is a long range force, every particle affects every other. Naively, total cost is O(N2). • with N ≈ 1010, this is quite infeasible. • Need some kind of approximation. • Intuitive approach: when computing force on particle i, group together particles in selected regions distant from i, and treat groups as single particles, located at their centres of mass.

  23. Barnes Hut Tree • First divide cubical region of 3d space into 23 = 8 regions, halving each dimension. • For every sub-region that contains any particles, divide again recursively to make an oct-tree, until “leaf” regions have at most one particle.

  24. Two Dimensional Example • Picture borrowed from www.cs.berkeley.edu/~demmel/cs267/lecture26/lecture26.html

  25. Barnes Hut Force Computation • To compute the force on a particle i, traverse tree starting from root: • if a node n is “distant from” i, just add contribution to force on i from centre of mass of n– no need to visit children of n; • if node n is “close to” i, visit children of n and recurse. • Hinges on definition of distant from/close to. • Basic idea is that a node representing some region of space is distant from a particle i if the angle it subtends is smaller than a threshold opening angle: n θ i

  26. Complexity • On average, number of nodes “opened” to compute force on i is O(log N), as opposed to visiting O(N) particles in naïve algorithm. • A huge win when N ≈ 1010.

  27. Domain Decomposition • Need to divide space and/or particle set into domains, where each domain is handled by a single processor. • Problem is that we can’t just divide space evenly, because some regions will have many more particle than others – poor load balancing. • Conversely, can’t just divide particles evenly, because particles move throughout space, and want to maintain physically close particles on the same processor, as far as practical – communication problem.

  28. Peano-Hilbert Curve • Warren and Salmon originally suggested using a space-filling curve: • Picture borrowed from http://www.mpa-garching.mpg.de/gadget/gadget2-paper.pdf

  29. Peano-Hilbert Key • Gadget applies the recursion 20 times, logically dividing space into up to 220 × 220 × 220 cells on the Peano-Hilbert curve. • Then can label each cell by its location along the Peano-Hilbert curve – 260 possible locations comfortably fit into a 64-bit word.

  30. Decomposition based on P-H Curve • Because number of cells> > number of particles, segments of linear Peano-Hilbert curve sparsely populated. • Sort particles by their Peano-Hilbert key, then divide evenly into P domains. • Intuitively – stretch out the P-H curve with particles dotted along it; segment it into P parts where each part has the same number of particles. • Characteristics of this decomposition: • Good load balancing. • Domains simply connected and quite “compact” in real space, because particles that are close along P-H curve are close in real space (converse often but not always true). • Domains have relatively simple mapping to BH octtree nodes.

  31. Distribution of BH Tree in Gadget • Ibid.

  32. Distributed Representation of Tree • Every processor holds a copy of the root nodes, and a copy of all child nodes down to the point where all particles in of a node are held on a single remote processor. Remotely held nodes are called pseudo-particles. • To compute the force on a single local target particle, traverse tree from root as usual, and accumulate force contributions from locally held particles. • Build an export list containing target particle and hosts of pseudo-particles encountered in walk.

  33. Communication • After computation for all locally held target particles, process export list and send list of local target particles to all hosts that own pseudo-particle nodes needed for those particles. • All processors do another tree walk to compute their contributions to remotely owned (from their point of view) target particles. • These contributions are returned to the original processor, and added into the accelerations for the target particles.

  34. Other Communications • Notably: • The Domain Decomposition itself requires (in principle) a distributed sort of the particle list. • Gadget only approximates this sort, but still fairly intricate communications. • In Gadget all communication is using the standard MPI library. It makes fairly extensive use of collective communication routines.

  35. Java Gadget

  36. Timeline • MPJ Express released 2005. • Realized we needed a realistic exemplar code: both as a demo, and to drive further improvements of our software. • First thoughts that an N-body simulation code would make sense – Summer 2005. • Learned about Gadget in December that year. • Effort to make a Java version of Gadget started seriously early February 2006 – expecting many months work. • Non-trivial simulations running in March.

  37. Gadget Code Statistics • Public distribution is around 17,000 lines of C, distributed over about 30 source files. • Dependencies on: • MPI library for parallelization. • GNU scientific library (but only a handful of functions) • FFTW – library for parallel Fourier transforms. • Comes with a set of initial conditions files that we can use to test our code, including “colliding galaxies” and “cluster formation”.

  38. Translation of Gadget Code • Manually translated, but in first cut, deliberately keep the Java code as similar to C as possible. • e.g. nearly all of Java Gadget is currently implemented as static methods on one enormous class called Main. • Small supporting classes corresponding to structs of C code. • Currently not supporting a number of features, including: • PMTree algorithm (an optimization using Fourier Transforms on a mesh to compute long-range part of gravitational force), • Multiple Initial Conditions files, • COMPUTE_POTENTIAL_ENERGY, ISOTHERM_EQS, FLEXSTEPS, OUTPUTPOTENTIAL, FORCETEST, MAKEGLASS, PSEUDOSYMMETRIC, …

  39. Handling Dependencies • Replace MPI calls with MPJ calls (!) • Use of FFTW avoided by disabling the PMTree algorithm (but may slow down some simulations a lot). • Few GNU Scientific Library functions that were needed (numeric quadrature, etc) were hand translated to Java.

  40. Test Cases • Restrictions aside, we successfully ran Colliding Galaxies and Cluster Formation example simulations varying numbers of processors. • These use pure Dark Matter – hydrodynamics code not yet tested. • Results are indistinguishable from running the original Gadget.

  41. Colliding Galaxies (Java Gadget)

  42. Cluster Formation (Java Gadget)

  43. Optimizing Java Gadget • Initial Java version of Gadget was about 3 times slower than C. • Optimizations: • Replace communication of Object arrays (corresponding to C structs) with communication of component primitive elements (floats, ints, etc). • Still wasn’t enough – final version actually expunges large Object arrays from program – flattens all struct arrays into arrays of primitive types (see later comments). • Extended the MPJ Express API to manipulate Java ByteBuffers directly.

  44. Flattening Object Arrays • To preserve memory locality, we decided to “flatten” large arrays of small objects (ParticleData) – replace an array of objects with an array of doubles containing object component fields, for example. • This makes the code less readable and maintainable than the original C code – clearly not acceptable. But we wanted speed. • Need a much better approach here. One possibility is some kind of macro pre-processing (it would be better if the Java language had structs…).

  45. Benchmarks • Graphs taken from Aamir Shafi’s thesis – you will find many more results there. • Some benchmarks are on up to 8 processors of Starbug cluster in Portsmouth Distributed Systems Group. • Other benchmarks are on up to 32 processors from NW-GRID cluster at Daresbury Laboratory.

  46. Starbug: Colliding Galaxies with 60,000 Particles

  47. NW-Grid: Structure Formation with two million Particles

  48. Conclusions • Possible to do large scale numerical simulations in Java • Speed is OK. • Memory consumption still an issue (Java Gadget needs perhaps three times more memory than C Gadget). • Made various improvements to MPJ Express implementation and API along the way. • As things stand, the Java Gadget code is not very maintainable – C-like, and then optimized some. • Needs more work to learn how to write maintainable code in Java that is also fast – e.g. how to preserve memory locality in arrays of objects.

More Related