150 likes | 261 Vues
This paper presents innovative solutions for low-latency and high-throughput crossbar arbitration in FPGAs, developed as part of the OSMOSIS project. We describe a new approach utilizing parallel iterative matching algorithms that leverage pipelining to enhance performance. Our experiments demonstrate the latency and throughput characteristics as functions of pipeline depth, establishing the viability of the proposed solutions for massively parallel computing applications. Using a combination of wavelength and space-division multiplexing, we outline the system architecture and the advantages of our techniques in overcoming technical challenges in all-optical networking.
E N D
Low-Latency Pipelined Crossbar Arbitration Cyriel Minkenberg, Ilias Iliadis, François Abel IBM Research, Zurich Research Laboratory
Outline • Context • OSMOSIS project • Problem • Low-latency, high-throughput crossbar arbitration in FPGAs • Approach • A new way to pipeline parallel iterative matching algorithms • Simulation results • Latency-throughput as a function of pipeline depth • Conclusions
OSMOSIS • Optical Shared MemOry Supercomputer Interconnect System • Sponsored by DoE & NNSA as part of ASCI • Joint 2½-year project • Corning: Optics and packaging • IBM: Electronics (arbiter, input and output adapters) and system integration • High-Performance Computing (HPC) • Massively parallel computers (e.g. Earth Simulator, Blue Gene) • Low-latency, high-bandwidth, scalable interconnection networks • Main sponsor objective • Solving the technical challenges and accelerating the cost reduction of all-optical packet switches for HPCS interconnects by • building a full-function all-optical packet switch demonstrator system • showing the scalability, performance and cost paths for a potential commercial system • Key requirements:
VOQs Com- biner Tx 2 Rx EQ 8x1 1x8 8x1 control WDM Mux Star Coupler control OpticalAmplifier 8x1 1x128 2 Rx EQ control OSMOSIS System Architecture All-optical Switch • Broadcast-and-select architecture (crossbar) • Combination of wavelength- and space-division multiplexing • Fast switching based on SOAs • Electronic input and output adapters • Electronic arbitration 64 Ingress Adapters 64 Egress Adapters 8 Broadcast Units 128 Select Units 1 1 1 Fast SOA 1x8 Fiber Selector Gates Fast SOA 1x8 Wavelength Selector Gates VOQs 1 Tx 128 8 control 64 64 control links central arbiter (bipartite graph matching algorithm)
OSMOSIS Arbitration • Crossbar arbitration • Heuristic parallel iterative matching algorithms • RRM, PIM, i-SLIP, FIRM, DRRM, etc. • These require I = log2N iterations to achieve good performance • Mean latency decreases as the number of iterations increases • OSMOSIS • N = 64 I = 6 iterations • Problem • An iteration takes too long (Ti) to complete I iterations in one time slot Tc • VHDL experiments indicate that TiTc 2Ti • Poor performance… • Solution • Pipelining • however, in general this incurs a latency penalty
M0[2] M0[3] M0[4] M1[2] M1[3] M1[4] M2[3] M2[4] M3[4] Parallel Matching: PMM • K parallel matching units (allocators) • Every allocator now has K time slots to compute a matching • K = I( Ti/ Tc ) • Requests/grants issued in round-robin TDM fashion • In every time slot, one allocator receives a set of requests, and • one allocator issues a set of grants (and is reset) • Drawbacks • Minimum arbitration latency equals K time slots • Allocators cannot take most recent arrivals into account in subsequent iterations allocators A0 M0[1] Tc Tarbitration grants A0 A1 Ti requests matching A1 … … AK-1 AK-1 requests time
M3[4] M3[3] M2[4] M3[2] M2[3] M1[4] M3[1] M3[2] M3[3] M3[1] M2[2] M1[3] M0[4] M2[1] M3[1] M3[2] M2[2] M2[3] M1[1] M2[1] M3[1] M2[2] M1[2] M1[3] M0[1] M2[1] M1[1] M1[2] M0[2] M0[3] FLPPR: Fast Low-latency Parallel Pipelined aRbitration VOQ state A3 A2 requests matching A1 A0
Request and Grant Filtering • PMM = Parallel allocators, TDM requests; FLPPR = Pipelined allocators, parallel requests • FLPPR allows requests to be issued to any allocator in any time slot • Request filtering function determines the subset of allocators for every VOQ • Opportunity for performance optimization by selectively submitting requests to and accepting grants from specific allocators • Request and grant filtering • General class of algorithms • Request filter Rk determines mapping between allocators and requests • Selective requests depending on Lij, Mk, k • Grant filter Fk can remove excess grants R0 A0 F0 VOQ state line card requests requests R1 A1 F1 matching … … … R3 AK-1 F3
0 0 0 0 0 0 0 3 1 1 1 2 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 2 1 1 1 1 1 0 0 0 2 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 6 0 0 6 1 1 0 1 1 4 3 1 2 1 0 1 1 1 2 0 0 1 0 1 2 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 Example, N = 4, K = 2 without filtering VOQ ctrs Lij requests Rk matches Mk grants Gij
0 0 0 0 0 0 0 3 1 1 1 2 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 2 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 6 0 0 6 1 1 0 1 1 4 3 1 2 1 0 1 1 1 1 0 1 1 1 1 2 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 Example, N = 4, K = 2, with request filtering VOQ ctrs Lij requests Rk matches Mk grants Gij
FLPPR Methods • We define three FLPPR variants • Method 1: Broadcast requests, selective post-filtering • Requests sent to all allocators; excess grants are cancelled • Method 2: Broadcast requests, no post-filtering • Requests sent to all allocators; no check for excess grants • May lead to “wasted” grants • Method 3: Selective requests, no post-filtering • Requests sent selectively (no more requests than current VOQ occupancy); no check for excess grants
A[K-1] SCI SCI A[0] Arbiter Implementation Rj CC itf. VOQ state Rj Rx CC[00] Tx Fj SCC[00] Tx Fj Tx SCC[15] SYSCLK & CTRL Rj CC itf. VOQ state Rj Rx CC[63] Fj Tx Fj
Conclusions • Problem: Short packet duration makes it hard to complete enough iterations • Pipelining achieves high rates of matching with a highly distributed implementation • FLPPR pipelining with parallel requests has performance advantages • Eliminates pipelining latency at low load • Achieves 100% throughput with uniform traffic • Reduces latency with respect to PMM also at high load • Can improve throughput with nonuniform traffic • Request pre- and post-filtering allows performance optimization • Different traffic types may require different filtering rules • Future work: Find filtering functions that optimize uniform and non-uniform performance • Highly amenable to distributed implementation in FPGAs • Can be applied to any existing iterative matching algorithm