Clustering SMP Nodes with the ATOLL Network: A Look into the Future of System Area Networks

Clustering SMP Nodes with the ATOLL Network: A Look into the Future of System Area Networks Lars Rzymianowicz larsrzy@atoll-net.de Dept. of Computer Engineering, Prof. Dr. Ulrich Brüning University of Mannheim, Germany

Motivation for SMP-specific SAN architectures • The ATOLL SAN: Basic Architecture • Project status • Performance Prediction: • simulation method & environment • how to measure accurate and realistic numbers? • single-/multiple-sender performance • Comparison with other solutions

Cluster Computing evolves as a new way of High Performance Computing as result of its superior price/performance ratio • the key to Cluster Computing is a SAN delivering the communication performance normally found in Supercomputers • several SANs have been developed in the last years: ServerNet QsNet Memory Channel

Comparison: A traditional Supercomputer (ASCI Blue Pacific) versus A Cluster (Avalon) • based on their TOP500 ranking and their price • www.top500.org

ASCI Blue Pacific • located at Lawrence Livermore National Laboratory • IBM SP2 with 5856 PowerPC 604 • LINPACK of Rmax = 468,200 for 1344 CPUs; this scaled linear results in 2,040,000 • costs of $93 Million $64 per 1 Rmax

Avalon • located at Los Alamos National Laboratory • Cluster of 140 Alpha CPUs, connected by Fast Ethernet • LINPACK of Rmax = 48,600 • costs of $300,000 $6 per 1 Rmax

What do the experts say? • Gordon Bell (Microsoft Bay Area Research Center)*: • “Clusters are the only structure that scales!” • “SmP (m<16) will be the component for clusters. Most cost-effective systems are made from best nodes.” • Horst D. Simon (NERSC)*: • “Large clusters show promise of replacing MPPs such as the NERSC T3E for certain applications.” * all comments taken from presentations at the Supercomputer’99 conference in Mannheim, Germany

What does a typical cluster look like? ...if you have the money? Compaq Alpha SC Cluster Series

What does a typical cluster look like? ...if you don’t have the money? Self-made cluster: Parnass2

A quick look at www.top500clusters.org reveals (18 cluster submitted): • CPU: 2x SPARC, 4x Alpha, 12x x86 (PPro, PII, PIII) • OS: 2x Solaris, 1x Compaq Tru64 UNIX, 13 Linux • 15x MPI, 0x PVM ! The war is over! • Network: 1x SCI, 3x Myrinet, 1x Gigabit E., 10x Fast E. • Nodes: 10x Single-CPU, 7x Dual-CPU, 1x 8-CPU SMP nodes are getting popular!

Software starts to address Clusters of SMPs: • BIP-SMP • Multi-Protocol AM • SCore 3.0 (PM) • Hardware does not! Up to now, you can only add more NICs to each node ... • ... but nobody does for cost reasons! (except some Beowulf clusters use two Fast Ethernet NICs: “channel bonding” )

So Clusters based on COTS SMP nodes are an attractive alternative to Supercomputers... • ... if they have a high performance network! • Most Clusters are still equipped with traditional network technology like (Fast) Ethernet:„Yeah, we decided to use Ethernet for our Cluster, the other stuff is just too expensive.“

Conclusion: for a new generation of Clusters we need SANs, which are: • less expensive! • comparable to Supercomputer networks in performance! • suitable for clustering SMP nodes!

ATOLL addresses all problems: • single-chip implementation avoiding expensive on-board SRAM and multi-chip solutions • on-chip 4x4 bi-directional switch to eliminate costly external switching hardware • 4 replicated independent NIs to serve the needs of 2/4/8-way SMP nodes • high-end ASIC and transmission technology

NI 0 Link 0 NI 1 Link 1 64bit/66MHzPCI Bus Interface 4 x 4XBar NI 2 Link 2 NI 3 Link 3 ATOLL Architecture: a true ‘Network on a Chip!’ 66 MHz 250 MHz 250 MB/s 528 MB/s

64bit/66MHzPCI Bus Interface • 64bit/66Mhz v2.1 compliant PCI interface: • master (DMA) & slave (PIO) functionality • runs also as 32bit/33MHz interface • split-phase transaction mode for serving master cycles from all four NIs • capable of combining several transactions into one burst if applicable • 64bit/66MHz slots will appear in the PC mass market soon

4 x 4XBar • 4x4 bi-directional Crossbar: • running at 250MHz • ATOLL uses wormhole (source-path) routing • fall-through latency of 6 clock cycles (24ns) • 2 GB/s bisection bandwidth

Link n • Link interface: • running at 250MHz, byte-wide links • reverse flow control signals are exchanged to prevent buffer overflow • messages are broken down into 64 byte link packets, which are protected by a CRC and(!) retransmitted by the sending link in case of transmission errors every injected message is delivered!

NI n • Network Interface: • running at 250MHz • PIO interface for efficient send/receive of small messages • PIO interface utilizes advanced I/O chipset features like write-combining and read-prefetching to boost performance • DMA engine for autonomous transfer of larger messages • NI context fully loadable (virtual NI)

Why data transfer per Programmed I/O? • a lot of applications have a fine-grain communication pattern; DMA has an inherent start-up cost • single-cycle accesses to NI regs result in poor performance • ... we can benefit from developments to enhance graphics performance: write-buffers in CPUs and I/O bridges • major problem: “How do i make my network FIFO look like memory?” “Thanks god, PCI only knows linear bursts.”

i i+1 i+2 i+3 linear address space PIO transfer • one address area for each msg frame (routing, header, data) • the read interface has to deal with lost prefetched data; this is accomplished by a small prefetch buffer, with keeps the most recent accessed data WC buffer buffer is transferred in a single burst cycle consecutive writes

route len route offset header len header offset data len data offset Tag 63 0 DMA transfer descriptor table data region • all data structures reside in pinned-down main memory • NI context can be switched • for efficient multicast messages, several descriptors can reference the same data regions ...

Control transfer: • problem: frequent polling on the device waists bus bandwidth; interrupts trigger a costly context switch • ATOLL mirrors all relevant data into main memory CPU polls on cache-coherent memory • a configurable watchdog timer can trigger an interrupt, if a message is not served by the processor within a specific time frame

Project status: • a cooperation with an industrial partner: Fujitsu/Siemens PC Systems, which has implemented the PCI interface and is responsible for the physical flow (synthesis, place & route, floorplanning) • started as research project, now aimed at the commercial market (price per NIC:  $600) • first prototypes were expected for Q1 2000

We still have no chip :-(( • We needed more time than expected to catch all design bugs in simulation (chip complexity) • our partner had to favour an in-house chip project and could not work on ATOLL full-time • we jumped in but simply don’t have the manpower nor the know-how in the back-end (synthesis, DfT)

Even if we don’t have a chip yet, we tried to estimate the performance of an ATOLL-powered cluster: • first versions of the software (ATOLL API) have been implemented on top of a process simulating an ATOLL NIC (most CPU-NIC interaction is memory mapped) • a cycle-accurate simulation of the ATOLL chip and its environment (Host-PCI bridge, memory, CPU) enabled the extraction of ‘close to reality’ performance numbers

Benchmarking the ATOLL API send/receive calls: • their tasks: various checks, copy data into/from DMA buffer, assemble descriptor, trigger message transfer • used a CPU-cycle counter register to measure accurate numbers on a Dual PIII 500MHz, Intel 440BX chipset • without the data copy, an API call needs 0.5 s • On the sending side, make sure data is written to main memory: we use the write-combining mode • On the receiving side, all reads trigger cache misses and cache line fills

NI NI NI NI Link Link Link Link XBar • Hardware simulation bench: messages leaves node via one link NI fetches data from memory PCI target model ATOLL chip Memory PCI bridge API copies data into memory system bus PCI bus no model, access penalty CPU RTL-HDL model PCI master model

Testbench: two connected ATOLL chips • PCI Master/Slave BFM to model CPU/memory • High-end PCI bus: 64bit + 66Mhz • Two components not accurately modeled: • system bus: left out here • chipset (Host - PCI bridge): access penalty for each bus access of 200ns (~12 PCI cycles)

Formulas used: Latency = APIsnd + (Nsnd * Tbridge) + Tatoll + (Nrcv * Tbridge) + APIrcv Bandwidth = Sizemsg / (TN+1start - TNstart) + (Nsnd * Tbridge)

Comparison with other SANs: lots of software Myrinet QsNet $???? 400 MB/s peak $995 160 MB/s peak

Bandwidth per node cost (NIC + switch port) 400 MB/s we don’t want the best performance, but the best performance/price ratio ATOLL 300 MB/s QsNet ? Synfinity ? 200 MB/s Myrinet Gigabit Ethernet 100 MB/s Giganet ? Price $500 $1000 $2000 $3000

info@atoll-net.de www.atoll-net.de Thank you for your attention! Join the mailing list for announcements!

Clustering SMP Nodes with the ATOLL Network: A Look into the Future of System Area Networks