Integrating New Capabilities into NetPIPE

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This work was funded by the MICS office of the US Department of Energy

N e t w o r k P r o t o c o l I n d e p e n d e n t P e r f o r m a n c e E v a l u a t o r T C P M P I w o r k s t a t i o n s M P I C H L A M / M P I P C s M P I / P r o M P _ L i t e G M C l u s t e r s P V M M y r i n e t c a r d s N e t P I P E n a t i v e I n f i n i b a n d T C G M S G 2 - s i d e d s o f t w a r e M e l l a n o x V A P I p r o t o c o l s r u n s o n l a y e r s A R M C I o r M P I A R M C I 1 - s i d e d i n t e r n a l T C P , G M , V I A , p r o t o c o l s s y s t e m s Q u a d r i c s , L A P I M P I - 2 1 - s i d e d M P I _ P u t o r M P I _ G e t L A P I I B M S P S H M E M C r a y T 3 E m e m c p y 1 - s i d e d S H M E M S G I s y s t e m s p u t s a n d g e t s & G P S H M E M A R M C I + B a s i c s e n d / r e c v w i t h o p t i o n s t o g u a r a n t e e p r e - p o s t i n g o r u s e M P I _ A N Y _ S O U R C E . + O p t i o n t o m e a s u r e p e r f o r m a n c e w i t h o u t c a c h e e f f e c t s . + O n e - s i d e d c o m m u n i c a t i o n s u s i n g e i t h e r G e t o r P u t , w i t h o r w i t h o u t f e n c e c a l l s . + M e a s u r e p e r f o r m a n c e o r d o a n i n t e g r i t y t e s t . http://www.scl.ameslab.gov/Projects/NetPIPE/

The NetPIPE utility • NetPIPE does a series of ping-pong tests between two nodes. • Message sizes are chosen at regular intervals, and with slight perturbations, to fully test the communication system for idiosyncrasies. • Latencies reported represent half the ping-pong time for messages smaller than 64 Bytes. Some typical uses • Measuring the overhead of message-passing protocols. • Help in tuning the optimization parameters of message-passing libraries. • Optimizing driver and OS parameters (socket buffer sizes, etc.). • Identifying dropouts in networking hardware and drivers. What is not measured • NetPIPE cannot measure the load on the CPU yet. • The effects from the different methods for maintaining message progress. • Scalability with system size.

Recent additions to NetPIPE • Can do an integrity test instead of measuring performance. • Streaming mode measures performance in 1 direction only. • Must reset sockets to avoid effects from a collapsing window size. • A bi-directional ping-pong mode has been added (-2). • One-sided Get and Put calls can be measured (MPI or SHMEM). • Can choose whether to use an intervening MPI_Fence call to synchronize. • Messages can be bounced between the same buffers (default mode), or they can be started from a different area of memory each time. • There are lots of cache effects in SMP message-passing. • InfiniBand can show similar effects since memory must be registered with the card. Process 1 Process 0 0 2 1 3

Current projects • Overlapping pair-wise ping-pong tests. • Must consider synchronization if not using bi-directional communications. Ethernet Switch n0 n1 n2 n3 Line speed vs end-point limited n0 n1 n2 n3 • Investigate other methods for testing the global network. • Evaluate the full range from simultaneous nearest neighbor communications to all-to-all.

Performance on Mellanox InfiniBand cards A new NetPIPE module allows us to measure the raw performance across InfiniBand hardware (RDMA and Send/Recv). Burst mode preposts all receives to duplicate the Mellanox test. The no-cache performance is much lower when the memory has to be registered with the card. An MP_Lite InfiniBand module will be incorporated into LAM/MPI. MVAPICH 0.9.1

10 Gigabit Ethernet Intel 10 Gigabit Ethernet cards 133 MHz PCI-X bus Single mode fiber Intel ixgb driver Can only achieve 2 Gbps now. Latency is 75 us. Streaming mode delivers up to 3 Gbps. Much more development work is needed.

Channel-bonding Gigabit Ethernet for better communications between nodes Channel-bonding uses 2 or more Gigabit Ethernet cards per PC to increase the communication rate between nodes in a cluster. GigE cards cost ~$40 each. 24-port switches cost ~$1400.  $100 / computer This is much more cost effective for PC clusters than using more expensive networking hardware, and may deliver similar performance.

Performance for channel-bonded Gigabit Ethernet GigE can deliver 900 Mbps with latencies of 25-62 us for PCs with 64-bit / 66 MHz PCI slots. Channel-bonding 2 GigE cards / PC using MP_Lite doubles the performance for large messages. Adding a 3rd card does not help much. Channel-bonding 2 GigE cards / PC using Linux kernel level bonding actually results in poorer performance. The same tricks that make channel-bonding successful in MP_Lite should make Linux kernel bonding working even better. Any message-passing system could then make use of channel-bonding on Linux systems. Channel-bonding multiple GigE cards using MP_Lite and Linux kernel bonding

Channel-bonding in MP_Lite User space Kernel space device driver Application on node 0 Large socket buffers device queue GigE card a b dev_q_xmit DMA TCP/IP stack b TCP/IP stack GigE card a dev_q_xmit DMA MP_Lite device queue Flow control may stop a given stream at several places. With MP_Lite channel-bonding, each stream is independent of the others.

Linux kernel channel-bonding User space Kernel space device driver Application on node 0 device queue Large socket buffer GigE card dqx DMA bonding.c TCP/IP stack dqx dqx GigE card DMA device queue A full device queue will stop the flow at bonding.c to both device queues. Flow control on the destination node may stop the flow out of the socket buffer. In both of these cases, problems with one stream can affect both streams.

Comparison of high-speed interconnects InfiniBand can deliver 4500 - 6500Mbps at a 7.5 us latency. Atoll delivers 1890 Mbps with a 4.7 us latency. SCI delivers 1840 Mbps with only a 4.2 us latency. Myrinet performance reaches 1820 Mbps with an 8 us latency. Channel-bonded GigE offers 1800 Mbps for very large messages. Gigabit Ethernet delivers 900 Mbps with a 25-62 us latency. 10 GigE only delivers 2 Gbps with a 75 us latency.

Conclusions • NetPIPE provides a consistent set of analytical tools in the same flexible framework to many message-passing and native communication layers. • New modules have been developed. • 1-sided MPI and SHMEM • GM, InfiniBand using the Mellanox VAPI, ARMCI, LAPI • Internal tests like memcpy • New modes have been incorporated into NetPIPE. • Streaming and bi-directional modes. • Testing without cache effects. • The ability to test integrity instead of performance.

Current projects • Developing new modules. • ATOLL • IBM Blue Gene/L • I/O performance • Need to be able to measure CPU load during communications. • Expanding NetPIPE to do multiple pair-wise communications. • Can measure the backplane performance on switches. • Compare the line speed to end-point limited performance. • Working toward measuring more of the global properties of a network. • The network topology will need to be considered.

Contact information Dave Turner - turner@ameslab.gov http://www.scl.ameslab.gov/Projects/MP_Lite/ http://www.scl.ameslab.gov/Projects/NetPIPE/

One-sided Puts between two Linux PCs • MP_Lite is SIGIO based, so MPI_Put() and MPI_Get() finish without a fence. • LAM/MPI has no message progress, so a fence is required. • ARMCI uses a polling method, and therefore does not require a fence. • An MPI-2 implementation of MPICH is under development. • An MPI-2 implementation of MPI/Pro is under development. Netgear GA620 fiber GigE 32/64-bit 33/66 MHz AceNIC driver

The MP_Lite message-passing library • A light-weight MPI implementation • Highly efficient for the architectures supported • Designed to be very user-friendly • Ideal for performing message-passing research http://www.scl.ameslab.gov/Projects/MP_Lite/

A NetPIPE example: Performance on a Cray T3E Raw SHMEM delivers: • 2600 Mbps • 2-3 us latency Cray MPI originally delivered: • 1300 Mbps • 20 us latency MP_Lite delivers: • 2600 Mbps • 9-10 us latency New CrayMPI delivers: • 2400 Mbps • 20 us latency The top of the spikes are where the message size is divisible by 8 Bytes.

Integrating New Capabilities into NetPIPE

Integrating New Capabilities into NetPIPE

Presentation Transcript

Integrating New Literacy Into Today’s Curriculum

Breaking New Ground: Integrating Evaluation into Practice

Integrating Technology into Teaching

Integrating GPUs into Condor

Integrating Quotations into Sentences

Integrating Quotations into Sentences

Integrating quotes into sentences

Integrating CVS into China

Integrating CLS into Academics

Integrating Cells Into Tissues

Integrating Citations into Memos

Integrating Technology into Education

Integrating New Capabilities into NetPIPE

Integrating Technology into Teaching

Integrating Phenological Measurements into

New Individual Capabilities

INTEGRATING EHS INTO NEW PRODUCT DEVELOPMENT

Integrating Ethics into FB2300

Integrating Webquests into Instruction

INTEGRATING SUSTAINABILITY INTO BUSINESS