Mathew Reno

Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters Mathew Reno M.S. Thesis Defense

Overview • For this thesis we wanted to evaluate the performance of the Hyperion distributed virtual machine, designed at UNH, when compared to a preexisting parallel computing API. • The results would indicate where Hyperion’s strength and weaknesses were and possibly validate Hyperion as a high-performance computing alternative. M.S. Thesis Defense

What Is A Cluster? • A cluster is a group of lowcost computers connected with an “offtheshelf” network. • The cluster’s network is isolated from WAN data traffic and the computers on the cluster are presented to the user as a single resource. M.S. Thesis Defense

Why Use Clusters? • Clusters are cost effective when compared to traditional parallel systems. • Clusters can be grown as needed. • Software components are based on standards allowing portable software to be designed for the cluster. M.S. Thesis Defense

Cluster Computing • Cluster computing takes advantage of the cluster by distributing computational workload among nodes of the cluster, thereby reducing total computation time. • There are many programming models for distributing data throughout the cluster. M.S. Thesis Defense

Distributed Shared Memory • Distributed Shared Memory (DSM) allows the user to view the whole cluster as one resource. • Memory is shared among the nodes. Each node has access to all other nodes memory as if it owns it. • Data coordination among nodes is generally hidden from the user. M.S. Thesis Defense

Message Passing • Message Passing (MP) requires explicit messages to be employed to distribute data throughout the cluster. • The programmer must coordinate all data exchanges when designing the application through a language level MP API. M.S. Thesis Defense

Related: Treadmarks Vs. PVM • Treadmarks (Rice, 1995) implements a DSM model while PVM implements a MP model. The two approaches were compared with benchmarks. • On average, PVM was found to perform two times better the Treadmarks. • Treadmarks suffered from excessive messages that were required for the requestresponse communication DSM model employed. • Treadmarks was found to be more natural to program with saving development time. M.S. Thesis Defense

Hyperion • Hyperion is a distributed Java Virtual Machine (JVM), designed at UNH. • The Java language provides parallelism through its threading model. Hyperion extends this model by distributing the threads among the cluster. • Hyperion implements the DSM model via DSM-PM2, which allows for lightweight thread creation and data distribution. M.S. Thesis Defense

Hyperion, Continued • Hyperion has a fixed memory size that it shares with all threads executing across the cluster. • Hyperion uses pagebased data distribution; if a thread accesses memory it does not have locally, a pagefault occurs and the memory is transmitted from the node that owns the memory to the requesting node a page at a time. M.S. Thesis Defense

Hyperion, Continued • Hyperion translates Java bytecodes into native C code. • A native executable is generated by a native C compiler. • The belief is that native executables are optimized by the C compiler and will benefit the application by executing faster than interpreted code. M.S. Thesis Defense

Hyperion’s Threads • Threads are created in a round robin fashion among the nodes of the cluster. • Data is transmitted between threads via a request/response mechanism. This approach requires two messages. • In order to respond to a request message, a response thread must be scheduled. This thread handles the request by sending back the requested data in a response message. M.S. Thesis Defense

mpiJava • mpiJava is a Java wrapper for the Message Passing Interface (MPI). • The Java Native Interface (JNI) is used to translate between Java and native code. • We used MPICH for the native MPI implementation. M.S. Thesis Defense

Clusters • The “Star” cluster (UNH) consists of 16 PIII 667MHz Linux PCs on a 100Mb Fast Ethernet network. TCP is communication protocol. • The “Paraski” cluster (France) consists of 16 PIII 1GHz Linux PCs on a 2Gb Myrinet network. BIP (DSM) and GM (MPI) are the communication protocols. M.S. Thesis Defense

Clusters, Continued • The implementation of MPICH on BIP was not stable in time for this thesis. GM had to be used in place of BIP for MPICH. GM has not been ported to Hyperion and a port would be unreasonable at this time. • BIP performs better than GM as the message size increases. M.S. Thesis Defense

BIP vs. GM Latency (Paraski) M.S. Thesis Defense

DSM & MPI In Hyperion • For consistency, mpiJava was ported into Hyperion. • Both DSM and MPI versions of the benchmarks could be compiled by Hyperion. • The executables produced by Hyperion are then executed by the respective native launchers (PM2 and MPICH). M.S. Thesis Defense

Benchmarks • The Java Grande Forum (JGF) developed a suite of benchmarks to test Java implementations. • We used two of the JGF benchmark suites, multithreaded and javaMPI. M.S. Thesis Defense

Benchmarks, Continued • Benchmarks used: • Fourier coefficient analysis • Lower/upper matrix factorization • Successive over-relaxation • IDEA encryption • Sparse matrix multiplication • Molecular dynamics simulation • 3D Ray Tracer • Monte Carlo simulation (only with MPI) M.S. Thesis Defense

Benchmarks And Hyperion • The multi-threaded JGF benchmarks had unacceptable performance when run “out of the box”. • Each benchmark creates all of its data objects on the root node causing all remote object access to occur through this one node. • This type of access causes a performance bottleneck on the root node as it has to service all the requests while calculating its algorithm part. • The solution was to modify the benchmarks to be cluster aware. M.S. Thesis Defense

Hyperion Extensions • Hyperion makes up for Java’s limited thread data management by providing efficient reduce and broadcast mechanisms. • Hyperion also provides a cluster aware implementation of arraycopy. M.S. Thesis Defense

Hyperion Extension: Reduce • Reduce blocks all enrolled threads until each thread has the final result of the reduce. • This is done by neighbor threads exchanging their data for computation, then their neighbors, and so on until each thread has the same answer. • This operation is faster and scales well as opposed to performing the calculation serially. The operation is LogP. M.S. Thesis Defense

Hyperion Extension: Broadcast • The broadcast mechanism transmits the same data to all enrolled threads. • Like reduce, data is distributed to the threads in a LogP fashion, which scales better than serial distribution of data. M.S. Thesis Defense

Hyperion Extension: Arraycopy • The arraycopy method is part of the Java System class. The Hyperion version was extended to be cluster aware. • If data is copied across threads, this version will send all data as one message instead of relying on paging mechanisms to access remote array data. M.S. Thesis Defense

Benchmark Modifications • The multithreaded benchmarks had unacceptable performance. • The benchmarks were modified in order to reduce remote object access and root node bottlenecks. • Techniques, such as arraycopy, broadcast and reduce were employed to improve performance. M.S. Thesis Defense

Experiment • Each benchmark was executed 50 times at each node size to provide a sample mean. • Node sizes were 1, 2, 4, 8, and 16. • Confidence intervals (95% level) were used to determine which version, MPI or DSM, performed better. M.S. Thesis Defense

Results On The Star Cluster M.S. Thesis Defense

Results On The Paraski Cluster M.S. Thesis Defense

Fourier Coefficient Analysis • Calculates the first 10,000 pairs of Fourier coefficients. • Each node is responsible for calculating its portion of the coefficient array. • Each node sends back its array portion to the root node, which accumulates the final array. M.S. Thesis Defense

Fourier: DSM Modifications • The original multithreaded version required all threads to update arrays located on the root node, causing the root node to be flooded with requests. • The modified version used arraycopy to distribute the local arrays back to the root thread arrays. M.S. Thesis Defense

Fourier: mpiJava • The mpiJava version is similar to the DSM version. • Each process is responsible for its portion of the arrays. • MPI_Ssend and MPI_Recv were called to distribute the array portions to the root process. M.S. Thesis Defense

Fourier: Results M.S. Thesis Defense

Fourier: Conclusions • Most of the time in this benchmark is spent in the computation. • Network communication does not play a significant role in the overall time. • Both MPI and DSM perform similar on each cluster, scaling well when more nodes are added. M.S. Thesis Defense

Lower/Upper Factorization • Solves a 500 x 500 linear system with LU factorization followed with a triangular solve. • The factorization is parallelized while the triangular solve is computed in serial. M.S. Thesis Defense

LU: DSM Modifications • The original version created the matrix on the root thread and all access was through this thread, causing performance bottlenecks. • The benchmark was modified to use Hyperion’s Broadcast facility to distribute the pivot information and arraycopy was used to coordinate the final data for the solve. M.S. Thesis Defense

LU: mpiJava • MPI_Bcast is used to distribute the pivot information. • MPI_Send and MPI_Recv are used so the root process can acquire the final matrix. M.S. Thesis Defense

LU: Results M.S. Thesis Defense

LU: Conclusions • While the DSM version uses a similar data distribution mechanism as the MPI version, there is significant overhead that is exposed when executing these methods in large loops. • This overhead is minimized on the Paraski cluster due to the nature Myrinet and BIP. M.S. Thesis Defense

Successive Over-Relaxation • Performs 100 iterations of SOR on a 1000 x 1000 grid. • A “red-black” ordering mechanism allows array rows to be distributed to nodes in blocks. • After initial data distribution, only neighbor rows need be communicated during the SOR. M.S. Thesis Defense

SOR: DSM Modifications • Excessive remote thread object access made it necessary to modify the benchmark. • Modified version uses arraycopy to update neighbor rows during the SOR. • When the SOR completes, arraycopy is used to assemble the final matrix on the root thread. M.S. Thesis Defense

SOR: mpiJava • MPI_Sendrecv is used to exchange neighbor rows. • MPI_Ssend and MPI_Recv are used to build the final matrix on the root process. M.S. Thesis Defense

SOR: Results M.S. Thesis Defense

SOR: Conclusions • The DSM version requires an extra barrier after row neighbors are exchanged due to the “network reactivity” problem. • A thread must be able to service all requests in a timely fashion. If the thread is busy computing, it cannot react quickly enough to schedule the request thread. • The barrier will block all threads until each reaches the barrier, which guarantees that all nodes have their requested data and it is safe to continue with the computation. M.S. Thesis Defense

IDEA Crypt • Performs IDEA encryption and decryption on a 3,000,000 byte array. • The array is divided among nodes in a block manner. • Each node encrypts and decrypts its portion. • When complete, the root node collects the decrypted array for validation. M.S. Thesis Defense

Crypt: DSM Modifications • The original created the whole array on the root thread and required each remote thread to page in their portions. • The modified version used arraycopy to distribute each threads portion from the root thread. • When decryption finishes, arraycopy copies back the decrypted portion to the root thread. M.S. Thesis Defense

Crypt: mpiJava • The mpiJava version uses MPI_Ssend to send the array portions to the remote processes and MPI_Recv to receive the portions. • When complete, MPI_Ssend is used to send back the processes portion and MPI_Recv receives each portion. M.S. Thesis Defense

Crypt: Results M.S. Thesis Defense

Crypt: Conclusions • Results are similar on both clusters. • There is a slight performance problem with 4 and 8 nodes with the DSM version. • This can be attributed to the placing of a barrier that causes all threads to block before computing, in the DSM version, while the MPI version does not block. M.S. Thesis Defense

Sparse Matrix Multiplication • A 50,000 x 50,000 unstructured matrix stored in compressed-row format is multiplied over 200 iterations. • Only the final result is communicated as each node has its own portion of data and initial distribution is not timed. M.S. Thesis Defense

Sparse: DSM Modifications • This benchmark originally produced excessive network traffic through remote object access. • The modifications involved removing the object access during the multiplication loop and using arraycopy to distribute the final result to the root thread. M.S. Thesis Defense

Mathew Reno

Mathew Reno

Presentation Transcript

Mathew Chapter 1

Mathew Chapter 2

Mathew Cobalt Marthaler

Mathew Thompson

Reno .

Mathew Frith

MATHEW JOHN MILLS

Gospel of Mathew

Mathew Brady

Dr.Saramma Mathew

Dr.Saramma Mathew

Dr.Saramma Mathew

RENO & RENO-50

Abbie Mathew, NewLANS

Reno Hvac

Mortgage Reno

SEO Reno

SEO Reno

Locksmith Reno

Reno Pros

Reno .

Mathew Reno

Mathew Reno

Presentation Transcript

Mathew Chapter 1

Mathew Chapter 2

Mathew Cobalt Marthaler

Mathew Thompson

Reno .

Mathew Frith

MATHEW JOHN MILLS

Gospel of Mathew

Mathew Brady

Dr.Saramma Mathew

Dr.Saramma Mathew

Dr.Saramma Mathew

RENO &amp; RENO-50

Abbie Mathew, NewLANS

Reno Hvac

Mortgage Reno

SEO Reno

SEO Reno

Locksmith Reno

Reno Pros

Reno .

RENO & RENO-50