1 / 12

VolpexMPI : Performance Evaluation of VolpexMPI over Infiniband

VolpexMPI : Performance Evaluation of VolpexMPI over Infiniband. Stephen Herbein Mentors: Jaspal Subhlok & Edgar Gabriel. Volpex : Parallel Execution on Volatile Nodes. Fault tolerance: why ? Node failures on machines with thousands of processors ( large cluster)

devon
Télécharger la présentation

VolpexMPI : Performance Evaluation of VolpexMPI over Infiniband

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VolpexMPI: Performance Evaluation of VolpexMPI over Infiniband Stephen Herbein Mentors: JaspalSubhlok& Edgar Gabriel

  2. Volpex: Parallel Execution on Volatile Nodes • Fault tolerance: why ? • Node failures on machines with thousands of processors (large cluster) • Node and communication failure in distributed environments (volunteer environment) • VolpexProject Goals: • Execution on failure prone platforms • Key problem: High failure rates ANDcommunicating parallel programs

  3. VolpexMPI • MPI library for execution of parallel application on volatile nodes • Key features: • controlled redundancy: each MPI process can have multiple replicas • Receiver based direct communication between processes • Distributed sender logging to support slow processes

  4. Managing Replicated MPI processes • Only need one copy for program to execute successfully

  5. Bandwidth comparison • 4 byte latency over Gigabit Ethernet: • Open MPI v1.4.1: ~50us • VolpexMPI: ~1.8ms

  6. NAS Parallel Benchmarks • VolpexMPI execution times are comparable to reference OpenMPI execution times (100)

  7. Overhead of redundancy and processor failures • Performance impact of executing with replicas (left side) • Performance impact of processor failures (right side) • Both run with 16 processes

  8. Use in High Performance Clusters • Not just limited to volunteer computing • Tested on small cluster using Ethernet • Yet to be tested on Large Scale Cluster with high performance communication, like Infiniband • Evaluate and Validate the use of VolpexMPI on High Performance Clusters • Specifically clusters that use Infiniband

  9. What is Infiniband • High speed fiber connection • Associated protocols designed to remove the overhead associated Ethernet and IP • Leads to higher bandwidth, lower latency, and lower CPU usage • Widespread use in HPC • Most used interconnect in the TOP500 (42%)

  10. How to Run VolpexMPI over Infiniband • Ways to use Infiniband • IPoIB • Sockets Direct Protocol (SDP) • IPoIB • High Bandwidth • High Latency • SDP • Higher Bandwidth • Low Latency • Bypasses TCPStack

  11. Current Measurements

  12. Summary • Status • Currently implementing SDP in underlying Socket Library of VolpexMPI • Challenges • Parallel programs are notoriously hard to debug • Not familiar with network or socket programming • Goals • Re-run bandwidth and latency tests using SDP • Re-run NAS benchmarks using SDP • Evaluate and validate use of VolpexMPI on High Performance clusters

More Related