Low Overhead Fault Tolerant Networking (in Myrinet)

Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003

Motivation • An increasing use of COTS components in systems has been motivated by the need to • Reduce cost in design and maintenance • Reduce software complexity • The emergence oflow cost, high performance COTS networking solutions • e.g., Myrinet, SCI, FiberChannel etc. • The increasing complexity of network interfaces has renewed concerns about its reliability • The amount of silicon used has increased tremendously

The Basic Question How can we incorporate fault tolerance into a COTS network technology without greatly compromising its performance?

Microprocessor-based Networks • Most modern network technologies have processors in their interface cards that help to achieve superior network performance • Many of these technologies allow changes in the program running on the network processor • Such programmable interfaces offer numerous benefits: • Developing different fault tolerance techniques • Validating fault recovery using fault injection • experimenting with different communication protocols • We use Myrinet as the platform for our study

Myrinet • Myrinet is a cost-effective high performance (2.2 Gb/s) packet switching technology • At its core is a powerful RISC processor • It is scalable to thousands of nodes • Low latency communication (8 ms) is achieved through direct interaction with network interface (“OS bypass”) • Flow control, error control and simple “heartbeat mechanisms” are incorporated in hardware • Link and routing specifications are public & standard • Myrinet support software is supplied “open source”

Myrinet Configuration Host Node System Memory Host Processor System Bridge I/O Bus LANai SRAM Timers 0 1 2 PCI Bridge DMA Engine Host Interface Packet Interface SAN/LAN Conversion RISC PCIDMA LANai 9

Hardware & Software Application Host Processor System Memory Middleware (e.g., MPI) TCP/IP interface OS driver I/O Bus Myrinet Card Network Processor Local Memory Myrinet Control Program Programmable Interface

Susceptability to Failures • Dependability evaluation was carried out using software implemented fault injection • Faults were injected in the Control Program (MCP) • A wide range of failures were observed • Unexpected latencies and reduction of bandwidth • The network processor can hang and stop responding • A host system can crash/hang • A remote network interface can get affected • Similar type of failures can be expected from other high-speed networks • Such failures can greatly impact the reliability/availability of the system

Host Interface Hang 514 24.6 Messages Dropped/Corrupted 264 12.7 MCP Restart 65 3.1 Host Computer Crash 9 0.43 Other Errors 23 1.15 No Impact 1205 57.9 Summary of Experiments Failure Category Count % of Injections • More than 50% of the failures were host interface hangs Total 2080 100

Design Considerations • The faults must be detected and diagnosed as quickly as possible • The network interface must be up and running as soon as possible • The recovery process must ensure that no messages are lost or improperly received/sent • Complete correctness should be achieved • The overhead on the normal running of the system must be minimal • The fault tolerance should be made as transparent to the user as possible

Fault Detection • Continuously polling the card can be very costly • We use a spare interval timer to implement a watchdog timer functionality for fault detection • We set the LANai to raise an interrupt when the timer expires • A routine (L_timer) that the LANai is supposed to execute every so often resets this interval timer • If the interface hangs, then L_timer is not executed, causing our interval timer to expire and raising a FATAL interrupt

Fault Recovery Summary • The FATAL interrupt signal is picked by the fault recovery daemon on the host • The failure is verified through numerous probing messages • The control program is reloaded into the LANai SRAM • Any process that was accessing the board prior to the failure is also restored to its original state • Simply reloading the MCP will not ensure correctness

Myrinet Programming Model • Flow control is achieved through send and receive tokens • Myrinet software (GM) provides reliable in-order delivery of messages • A modified form of “Go-Back-N” protocol is used • Sequence numbers for the protocol are provided by the MCP • One stream of sequence numbers exists per destination

Typical Control Flow Receiver Sender User process provides receive buffer User process sets recv token User process prepares message User process sets send token LANai recvs message LANai sends ACK LANai rdmas message LANai sends event to process LANai sdmas message LANai sends message LANai receives ACK LANai sends event to process User process handles notification event User process reuses buffer User process handles notification event User process reuses buffer

Duplicate Messages Receiver Sender User process provides receive buffer User process sets recv token User process prepares message User process sets send token LANai recvs message LANai sends ACK LANai rdmas message LANai sends event to process LANai sdmas message LANai sends message LANai goes down Lost ACK User process handles notification event User process reuses buffer Driver reloads MCP into board Driver resends all unacked messages LANai sdmas message LANai sends message LANai recvs message Duplicate message ERROR! Lack of redundant state information is the cause for this problem

Lost Messages Receiver Sender User process provides receive buffer User process sets recv token User process prepares message User process sets send token LANai recvs message LANai sends ACK LANai sdmas message LANai sends message LANai receives ACK LANai sends event to process LANai goes down User process handles notification event User process reuses buffer Driver reloads MCP into board Driver sets all recv tokens again LANai waits for message ERROR! Incorrect commit point is the cause of this problem

Fault Recovery • We need to keep a copy of the state information • Checkpointing can be a big overhead • Logging critical message information is enough • GM functions are modified so that • A copy of the send tokens and the receive tokens is made with every send and receive call • The host processes provide the sequence numbers, one per (destination node, local port) pair • Copy of send and receive token is removed when the send/receive completes successfully • MCP is modified • ACK is sent out only after a message is DMAed to host memory

Performance Impact • The scheme has been integrated successfully into GM • Over 1 man year for complete implementation • How much of the performance of the system has been compromised ? • After all one can’t get a free lunch these days! • Performance is measured using two key parameters • Bandwidth obtained with large messages • Latency of small messages

Latency

Bandwidth

Performance Metric GM FTGM Bandwidth 92.4 MHz 92 MHz Latency 11.5 ms 13.0 ms Host-CPU utilization for send 0.3 ms 0.55 ms Host-CPU utilization for receive 0.75 ms 1.15 ms LANai-CPU utilization 6.0 ms 6.8 ms Summary of Results Host Platform: Pentium III with 256MB RedHat Linux 7.2

Summary of Results Fault Detection Latency = 50 ms Fault Recovery Latency = 0.765 s Per-Process Latency = 0.50 s

Our Contributions • We have devised smart ways to detect and recover from network interface failures • Our fault detection technique for “network processor hangs” uses software implemented watchdog timers • Fault recovery time (including reloading of network control program) ~ 2 seconds • Performance impact is under 1% for messages over 1KB • Complete user transparency was achieved

Low Overhead Fault Tolerant Networking (in Myrinet)

Low Overhead Fault Tolerant Networking (in Myrinet)

Presentation Transcript

Fault-Tolerant Softcore Processors Part I: Fault-Tolerant Instruction Memory

Fault-Tolerant Broadcast

Low-Overhead Byzantine Fault-Tolerant Storage

Fault-Tolerant Broadcast

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

High-Performance, Low Fault-Tolerant Schools

Fault Tolerant MPI

Fault-Tolerant Consensus

Project Wisdom Stone Fault Tolerant Networking

FAULT-TOLERANT COMPUTING

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

High-Performance, Low Fault-Tolerant Schools

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-tolerant Computing