GridMPI ： Grid Enabled MPI

GridMPI：Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST Yutaka Ishikawa, The University of Tokyo

Motivation • MPI has been widely used to program parallel applications • Users want to run such applications over the Grid environment without any modifications of the program • However, the performance of existing MPI implementations is not scaled up on the Grid environment computing resource site A computing resource site B Wide-area Network Single (monolithic) MPI application over the Grid environment Yutaka Ishikawa, The University of Tokyo

Motivation • Focus on metropolitan-area, high-bandwidth environment: 10Gpbs,  500miles (smaller than 10ms one-way latency) • Internet Bandwidth in Grid  Interconnect Bandwidth in Cluster • 10 Gbps vs. 1 Gbps • 100 Gbps vs. 10 Gbps computing resource site A computing resource site B Wide-area Network Single (monolithic) MPI application over the Grid environment Yutaka Ishikawa, The University of Tokyo

Motivation • Focus on metropolitan-area, high-bandwidth environment: 10Gpbs,  500miles (smaller than 10ms one-way latency) • We have already demonstrated that the performance of the NAS parallel benchmark programs are scaled up if one-way latency is smaller than 10ms using an emulated WAN environment Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003 computing resource site A computing resource site B Wide-area Network Single (monolithic) MPI application over the Grid environment Yutaka Ishikawa, The University of Tokyo

High Performance Communication Facilities for MPI on Long and Fat Networks TCP vs. MPI communication patterns Network Topology Latency and Bandwidth Interoperability Most MPI library implementations use their own network protocol. Fault Tolerance and Migration To survive a site failure Security Repeating 10MB data transfer with two second intervals • The slow-start phase • window size is set to 1 • These silences results from burst traffic Observed during one 10MB data transfer Issues Yutaka Ishikawa, The University of Tokyo

High Performance Communication Facilities for MPI on Long and Fat Networks TCP vs. MPI communication patterns Network Topology Latency and Bandwidth Interoperability Most MPI library implementations use their own network protocol. Fault Tolerance and Migration To survive a site failure Security Start one-to-one communication at time 0 after all-to-all Issues Yutaka Ishikawa, The University of Tokyo

High Performance Communication Facilities for MPI on Long and Fat Networks TCP vs. MPI communication patterns Network Topology Latency and Bandwidth Interoperability Most MPI library implementations use their own network protocol. Fault Tolerance and Migration To survive a site failure Security Internet Issues Yutaka Ishikawa, The University of Tokyo

High Performance Communication Facilities for MPI on Long and Fat Networks TCP vs. MPI communication patterns Network Topology Latency and Bandwidth Interoperability Many MPI library implementations. Most implementations use their own network protocol Fault Tolerance and Migration To survive a site failure Security Using Vendor B’s MPI library Using Vendor A’s MPI library Using Vendor D’s MPI library Using Vendor C’s MPI library Issues Internet Yutaka Ishikawa, The University of Tokyo

MPI API IMPI LAC Layer (Collectives) RPIM Interface Request Interface Request Layer P2P Interface ssh rsh SCore Globus Vendor MPI IMPI TCP/IP Vendor MPI PMv2 MX O2G IPMI/TCP Internet YAMPII Vendor’s MPI GridMPI Features • MPI-2 implementation • YAMPII, developed at the University of Tokyo, is used as the core implementation • Intra communication by YAMPII（TCP/IP、SCore） • Inter communication by IMPI（Interoperable MPI), protocol and extension to Grid • MPI-2 • New Collective protocols • Integration of Vendor MPI • IBM Regatta MPI, MPICH2, Solaris MPI, Fujitsu MPI, (NEC SX MPI) • Incremental checkpoint • High Performance TCP/IP implementation • LAC: Latency Aware Collectives • bcast/allreduce algorithms have been developed (to appear at the cluster06 conference) Yutaka Ishikawa, The University of Tokyo

High-performance Communication Mechanisms in the Long and Fat Network • Modifications of TCP Behavior • M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y. Ishikawa, “TCP Adaptation for MPI on Long-and-Fat Networks,” IEEE Cluster 2005, 2005. • Precise Software Pacing • R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, Y. Ishikawa, “Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks”, PFLDnet2005, 2005. • Collective communication algorithms with respect to network latency and bandwidth. • M. Matsuda, T. Kudoh, Y. Kodama, R. Takano, Y. Ishikawa, “Efficient MPI Collective Operations for Clusters in Long-and-Fast Networks”, to appear at IEEE Cluster 2006. Yutaka Ishikawa, The University of Tokyo

Four 1000Base-SX ports • One USB port for Host PC • FPGA (XC2V6000) Evaluation • It is almost impossible to reproduce the execution behavior of communication performance in the wide area network • A WAN emulator, GtrcNET-1, is used to scientifically examine implementations, protocols, communication algorithms, etc. GtrcNET-1 • GtrcNET-1 is developed at AIST. • injection of delay, jitter, error, … • traffic monitor, frame capture http://www.gtrc.aist.go.jp/gnet/ Yutaka Ishikawa, The University of Tokyo

WAN Emulator Node0 Node8 GtrcNET-1 Host 0 Host 0 Host 0 Host 0 Host 0 Host 0 Catalyst 3750 Catalyst 3750 Experimental Environment 8 PCs 8 PCs ……… ……… Node15 Node7 • Bandwidth:1Gbps • Delay: 0ms -- 10ms • CPU: Pentium4/2.4GHz, Memory: DDR400 512MB • NIC: Intel PRO/1000 (82547EI) • OS: Linux-2.6.9-1.6 (Fedora Core 2) • Socket Buffer Size: 20MB Yutaka Ishikawa, The University of Tokyo

GridMPI vs. MPICH-G2 (1/4) FT (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes Relative Performance One way delay (msec) Yutaka Ishikawa, The University of Tokyo

GridMPI vs. MPICH-G2 (2/4) IS (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes Relative Performance One way delay (msec) Yutaka Ishikawa, The University of Tokyo

GridMPI vs. MPICH-G2 (3/4) LU (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes Relative Performance One way delay (msec) Yutaka Ishikawa, The University of Tokyo

GridMPI vs. MPICH-G2 (4/4) NAS Parallel Benchmarks 3.2 Class B on 8 x 8 processes Relative Performance No parameters tuned in GridMPI One way delay (msec) Yutaka Ishikawa, The University of Tokyo

Relative performance Pentium-4 2.4GHz x 8 connected by 1G Ethernet @ Tsukuba Pentium-4 2.8 GHz x 8 Connected by 1G Ethernet @ Akihabara Benchmarks JGN2 Network 10Gbps Bandwidth 1.5 msec RTT 60 Km (40mi.) GridMPI on Actual Network • NAS Parallel Benchmarks run using 8 node (2.4GHz) cluster at Tsukuba and 8 node (2.8GHz) cluster at Akihabara • 16 nodes • Comparing the performance with • result using 16 node (2.4 GHz) • result using 16 node (2.8 GHz) Yutaka Ishikawa, The University of Tokyo

GridMPI Now and Future • GridMPI version 1.0 has been released • Conformance Tests • MPICH Test Suite: 0/142 (Fails/Tests) • Intel Test Suite: 0/493 (Fails/Tests) • GridMPI is integrated into the NaReGI package • Extension of IMPI Specification • Refine the current extensions • Collective communication and check point algorithms could not be fixed. The current idea is specifying • The mechanism of • dynamic algorithm selection • dynamic algorithm shipment and load • virtual machine to implement algorithms Yutaka Ishikawa, The University of Tokyo

Dynamic Algorithm Shipment • A collective communication algorithm implemented in the virtual machine • The code is shipped to all MPI processes • The MPI runtime library interprets the algorithm to perform the collective communication for inter-clusters Internet Yutaka Ishikawa, The University of Tokyo

Concluding Remarks • Our Main Concern is the metropolitan area network • high-bandwidth environment: 10Gpbs,  500miles (smaller than 10ms one-way latency) • Overseas ( 100 milliseconds) • Applications must be aware of the communication latency • data movement using MPI-IO ? • Collaborations • We would like to ask people, who are interested in this work, for collaborations Yutaka Ishikawa, The University of Tokyo

GridMPI ： Grid Enabled MPI