1 / 15

Low-Latency Pipelined Crossbar Arbitration

Low-Latency Pipelined Crossbar Arbitration. Cyriel Minkenberg, Ilias Iliadis, François Abel IBM Research, Zurich Research Laboratory. Outline. Context OSMOSIS project Problem Low-latency, high-throughput crossbar arbitration in FPGAs Approach

milica
Télécharger la présentation

Low-Latency Pipelined Crossbar Arbitration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Low-Latency Pipelined Crossbar Arbitration Cyriel Minkenberg, Ilias Iliadis, François Abel IBM Research, Zurich Research Laboratory

  2. Outline • Context • OSMOSIS project • Problem • Low-latency, high-throughput crossbar arbitration in FPGAs • Approach • A new way to pipeline parallel iterative matching algorithms • Simulation results • Latency-throughput as a function of pipeline depth • Conclusions

  3. OSMOSIS • Optical Shared MemOry Supercomputer Interconnect System • Sponsored by DoE & NNSA as part of ASCI • Joint 2½-year project • Corning: Optics and packaging • IBM: Electronics (arbiter, input and output adapters) and system integration • High-Performance Computing (HPC) • Massively parallel computers (e.g. Earth Simulator, Blue Gene) • Low-latency, high-bandwidth, scalable interconnection networks • Main sponsor objective • Solving the technical challenges and accelerating the cost reduction of all-optical packet switches for HPCS interconnects by • building a full-function all-optical packet switch demonstrator system • showing the scalability, performance and cost paths for a potential commercial system • Key requirements:

  4. VOQs Com- biner Tx 2 Rx EQ 8x1 1x8 8x1 control WDM Mux Star Coupler control OpticalAmplifier 8x1 1x128 2 Rx EQ control OSMOSIS System Architecture All-optical Switch • Broadcast-and-select architecture (crossbar) • Combination of wavelength- and space-division multiplexing • Fast switching based on SOAs • Electronic input and output adapters • Electronic arbitration 64 Ingress Adapters 64 Egress Adapters 8 Broadcast Units 128 Select Units 1 1 1 Fast SOA 1x8 Fiber Selector Gates Fast SOA 1x8 Wavelength Selector Gates VOQs 1 Tx 128 8 control 64 64 control links central arbiter (bipartite graph matching algorithm)

  5. OSMOSIS Arbitration • Crossbar arbitration • Heuristic parallel iterative matching algorithms • RRM, PIM, i-SLIP, FIRM, DRRM, etc. • These require I = log2N iterations to achieve good performance • Mean latency decreases as the number of iterations increases • OSMOSIS • N = 64 I = 6 iterations • Problem • An iteration takes too long (Ti) to complete I iterations in one time slot Tc • VHDL experiments indicate that TiTc 2Ti • Poor performance… • Solution • Pipelining • however, in general this incurs a latency penalty

  6. M0[2] M0[3] M0[4] M1[2] M1[3] M1[4] M2[3] M2[4] M3[4] Parallel Matching: PMM • K parallel matching units (allocators) • Every allocator now has K time slots to compute a matching • K = I( Ti/ Tc ) • Requests/grants issued in round-robin TDM fashion • In every time slot, one allocator receives a set of requests, and • one allocator issues a set of grants (and is reset) • Drawbacks • Minimum arbitration latency equals K time slots • Allocators cannot take most recent arrivals into account in subsequent iterations allocators A0 M0[1] Tc Tarbitration grants A0 A1 Ti requests matching A1 … … AK-1 AK-1 requests time

  7. M3[4] M3[3] M2[4] M3[2] M2[3] M1[4] M3[1] M3[2] M3[3] M3[1] M2[2] M1[3] M0[4] M2[1] M3[1] M3[2] M2[2] M2[3] M1[1] M2[1] M3[1] M2[2] M1[2] M1[3] M0[1] M2[1] M1[1] M1[2] M0[2] M0[3] FLPPR: Fast Low-latency Parallel Pipelined aRbitration VOQ state A3 A2 requests matching A1 A0

  8. Request and Grant Filtering • PMM = Parallel allocators, TDM requests; FLPPR = Pipelined allocators, parallel requests • FLPPR allows requests to be issued to any allocator in any time slot • Request filtering function determines the subset of allocators for every VOQ • Opportunity for performance optimization by selectively submitting requests to and accepting grants from specific allocators • Request and grant filtering • General class of algorithms • Request filter Rk determines mapping between allocators and requests • Selective requests depending on Lij, Mk, k • Grant filter Fk can remove excess grants R0 A0 F0 VOQ state line card requests requests R1 A1 F1 matching … … … R3 AK-1 F3

  9. 0 0 0 0 0 0 0 3 1 1 1 2 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 2 1 1 1 1 1 0 0 0 2 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 6 0 0 6 1 1 0 1 1 4 3 1 2 1 0 1 1 1 2 0 0 1 0 1 2 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 Example, N = 4, K = 2 without filtering VOQ ctrs Lij requests Rk matches Mk grants Gij

  10. 0 0 0 0 0 0 0 3 1 1 1 2 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 2 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 6 0 0 6 1 1 0 1 1 4 3 1 2 1 0 1 1 1 1 0 1 1 1 1 2 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 Example, N = 4, K = 2, with request filtering VOQ ctrs Lij requests Rk matches Mk grants Gij

  11. FLPPR Methods • We define three FLPPR variants • Method 1: Broadcast requests, selective post-filtering • Requests sent to all allocators; excess grants are cancelled • Method 2: Broadcast requests, no post-filtering • Requests sent to all allocators; no check for excess grants • May lead to “wasted” grants • Method 3: Selective requests, no post-filtering • Requests sent selectively (no more requests than current VOQ occupancy); no check for excess grants

  12. FLPPR performance – Uniform Bernoulli traffic

  13. FLPPR performance – Nonuniform Bernoulli traffic

  14. A[K-1] SCI SCI A[0] Arbiter Implementation Rj CC itf. VOQ state Rj Rx CC[00] Tx Fj SCC[00] Tx Fj Tx SCC[15] SYSCLK & CTRL Rj CC itf. VOQ state Rj Rx CC[63] Fj Tx Fj

  15. Conclusions • Problem: Short packet duration makes it hard to complete enough iterations • Pipelining achieves high rates of matching with a highly distributed implementation • FLPPR pipelining with parallel requests has performance advantages • Eliminates pipelining latency at low load • Achieves 100% throughput with uniform traffic • Reduces latency with respect to PMM also at high load • Can improve throughput with nonuniform traffic • Request pre- and post-filtering allows performance optimization • Different traffic types may require different filtering rules • Future work: Find filtering functions that optimize uniform and non-uniform performance • Highly amenable to distributed implementation in FPGAs • Can be applied to any existing iterative matching algorithm

More Related