Ron Minnich Advanced Computing Lab LANL

Scaleable (in)Coherent InterfaceWork done at Sarnoff Corp, 1999Funded by a large consumer electronics companyNote: we didn’t use MPI on our 161-node cluster … Ron Minnich Advanced Computing Lab LANL

Scaleable Coherent Interface • aka IEEE 1596 • Provides a direct memory read/write model for up to 64K hosts • Memory reads and writes are converted to network packets by the interface cards • Message retry is handled automatically • CRC errors result in retransmissions • In theory, all data is “tagged cache lines” • Nobody in their right mind does that anymore

CPU PCI Bus 16 MB 16 MB Dolphin SCI Cards Remote memory window is a memory mapped region in which PCI writes are translated to remote memory write over the network. Mapping is from physical address to Remote Node/Offset. Granularity is 512 KB per mapping. Memory SCI Card Remote Memory Window 1 (one word I/O) Network Hardware (Optical, 4 Gbits/s Remote Memory Window2 (Fast I/O) DMA Engine Support for Other cards

Mapping for Dolphin • 32-bit address mapped to SCI packet with a 16-bit node # and a 48-bit offset at that node • Translation on the NIC • Granularity on the card is 512KB, • 32 mappings • One other parameter: I/O type at remote • Read/write or fetch-and-add

SCI NetworkNot shown: upper level switches CPU+SCI CPU+SCI CPU+SCI CPU+SCI CPU+SCI Note: all switches and cards can be remotely configured Switch (two ports used) (four ports total)

SCI Network • Can have up to 16 nodes per switch port at present • Can have as few as 0 nodes per switch port • Switch adds 250 nanoseconds latency (compare to 5 microseconds for best Ethernet switch)

Sarnoff SCI Cluster • 16 Linux/Alpha nodes, • 1 FreeBSD/Dual PII/450 node • FreeBSD did video capture • Alphas did computation • Two code versions: • Socket based • SCI based • SCI version was much easier …

Software • We developed 64-bit clean endian-independent software, including drivers, for Linux and FreeBSD • Harder: convinced Dolphin to let us release it • Available at www.acl.lanl.gov/~rminnich • Open source is the best documentation • Many engineers at Dolphin don’t understand their own card

SCI performance • All times derived from: • 106 ops/wall clock time (I.e. no one-way times) • 1010 bytes/wall clock time • Alpha, LX164 motherboard, Linux 2.0.34 • Remote write: 5 microseconds (RTT) • Fetch and add: 6 microseconds (RTT) • Bulk data: 700 Mbits/second

SCI programming • Bulk data movement • bcopy(local, remote, nbytes); • Fetch: • a = *remote; • Fetch and add: • a = *remote;

The Fundamental SCI Problem • printf(“%d %d %d\n”, *a, *b, *c); • There was a failure in this line of code somewhere in the C library, due to an SCI error • This printf was at the bottom of a 5-level deep call chain • What will you do?

Transparency is a Bad Idea • CS people like to argue for transparency • Transparency means you don’t know what’s going on behind the scenes, but you don’t need to care • In practice, you only need to care when there are errors -- but then, you reallycare • Would you let such a system be used for your next open-heart surgery?

SCI programming style • For best error handling and fault tolerance, segregate the SCI memory and use remote memory writes • In general, for PCI, best performance is achieved by writing from the CPU anyway • Ensures cache coherency of the data at each end without surprises • Quite a change from the SCI “vision”

SCI Programming (wrong) Program Common Transparent Shared Memory Area Program Program Program Program Program

Program Program Private SCI Area Private SCI Area Program Program Private SCI Area Private SCI Area SCI Programming (Right) Programs can write to other program’s private SCI areas

A Fundamental Problem • You’re going to have errors and lose packets • So you have to handle lost packets • So you need some sort of sequence generation and checking • Which means headers • Which in turn requires actions at the receive end other than computation

An even more fundamental Problem • You’re going to get trashed packets • For some applications, a previous iteration of the data may be preferable to trash • You may want to validate the data before having it appear in user space, and not have it appear if it is trashed • Which implies a buffer, which in turn eliminates a major advantage of bypass

Trashed Data Incoming truncated Incoming! Data Area for Message ? Data Area for Message 1 Data Area for Message 1 Message 2 Step 1 Step 2 Step 3

Fixing Trashed Data Message 2 Incoming truncated, discard buffer Incoming! Data Area for Message 1 Data Area for Message 1 Data Area for Message 1 Step 1 Step 2 Step 3

SCI Conclusions • Due to errors, caches, efficiency, we still use copy in/copy out • Well, I was surprised too • But these interfaces are designed for direct I/O into user space • Implication: we may want to rethink how to build OS Bypass interfaces • Well, actually, we already are ...

(My) Final Conclusions • Zero copy is broken • There’s more potential if the interfaces can exploit an efficient OS Virtual Memory system • But changes must be made to the Virtual Memory system, context switch code, and network interface architecture in a unified way

Ron Minnich Advanced Computing Lab LANL