180 likes | 292 Vues
This paper presents a Tightly Coupled Cluster (TCCluster) architecture that utilizes the processor host interface as a network interconnect. Given future trends in computing with a rapid increase in cores and nodes, the architecture aims to exploit fine-grain parallelism, improving serialization and synchronization with low-latency communication. The TCCluster demonstrates significant latency improvements by effectively integrating network functionalities into the CPU. By focusing on locality and memory efficiency, this approach addresses the challenges posed by the scalability and performance limitations of classical cluster architectures.
E N D
TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as aNetwork Interconnect Heiner Litz University of Heidelberg
Motivation • Future Trends • More cores, 2-fold increase per year [Asanovic 2006] • More nodes, 200.000+ nodes for Exascale [Exascale Rep.] • Consequence • Exploit fine grain parallelisim • Improve serialization/synchronization • Requirement • Low latency communication
Motivation • Latency lags Bandwidth [Patterson, 2004] • Memory vs. Network • Memory BW 10GB/s • Network BW 5 GB/s • Memory Latency 50ns • Network Latency 1us • 2x vs. 20x
State of the Art Clusters Scalability Ethernet TCCluster Infiniband SW DSM SMPs Tilera Larrabee Quickpath HyperTransport Lower Latency
Observation • Today’s CPUs represent complete Cluster nodes • Processor cores • Switch • Links
Approach • Use host interface as interconnect • Tightly Coupled Cluster (TCCluster)
Background • Coherent HyperTransport • Shared memory SMPs • Cache coherency overhead • Max. 8 endpoints • Table based routing (nodeID) • Non-coherent HyperTransport • Subset of cHT • I/O devices, Southbridge,.. • PCI like protocol • “Unlimited” number of devices • Interval routing (memory address)
Approach • Processors pretend to be I/O devices • Partitioned global address space • Communicate via PIO writes to MMIO
Programming Model • Remote Store PM • Each process has local private memory • Each process supports remotely writable regions • Sending by storing to remote locations • Receiving by reading from local memory • Synchronization through serializing instructions • No support of bulk transfers (DMA) • No support for remote reads • Emphasis on locality, low latency reads
Implementation • 2x Two-socket Quadcore Shanghai Tyan Box node0 node1 node1 node0 1 1 Reset/PWR 2 2 3 3 3 3 SB HTX HTX SB ncHT link 16@3.6Gbit BOX 0 BOX 1
Implementation • Software based approach • Firmware • Coreboot (LinuxBIOS) • Link de-enumeration • Force non-coherent • Link frequency & electrical parameters • Driver • Linux based • Topology & Routing • Manages remotely writable regions
Memory Layout DRAM Hole RW mem UC MMIO WC 6 GB 6 GB 5 GB MMIO WC RW mem UC 5 GB Node1 WB Node1 WB 4 GB 4 GB Local DRAM node 0 WB Local DRAM node 0 WB 0 GB 0 GB BOX 1 BOX 0
Bandwidth – HT800(16bit) Singlethread message-rate: 142 mio
Latency – HT800(16bit) Software-2-Software Half-Roundtrip 227 ns
Conclusion • Introduced novel tightly coupled interconnect • “Virtually” moved the NIC into the CPU • Order of magnitude latency improvement • Scalable • Next steps: • MPI over RSM support • Own mainboard with multiple links
References • [Asanovic, 2006] Asanovic K, Bodik R, Catanzaro B, Gebis J. The landscape of parallel computing research: A view from berkeley. UC Berkeley Tech Report. 2006. • [Exascale Rep] ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems • [Patterson, 2004] Latency lags Bandwidth. Communications of the ACM, vol. 47, number 10, pp. 71-75, October 2004.