Low-Latency Pipelined Crossbar Arbitration

Low-Latency Pipelined Crossbar Arbitration Cyriel Minkenberg, Ilias Iliadis, François Abel IBM Research, Zurich Research Laboratory

Outline • Context • OSMOSIS project • Problem • Low-latency, high-throughput crossbar arbitration in FPGAs • Approach • A new way to pipeline parallel iterative matching algorithms • Simulation results • Latency-throughput as a function of pipeline depth • Conclusions

OSMOSIS • Optical Shared MemOry Supercomputer Interconnect System • Sponsored by DoE & NNSA as part of ASCI • Joint 2½-year project • Corning: Optics and packaging • IBM: Electronics (arbiter, input and output adapters) and system integration • High-Performance Computing (HPC) • Massively parallel computers (e.g. Earth Simulator, Blue Gene) • Low-latency, high-bandwidth, scalable interconnection networks • Main sponsor objective • Solving the technical challenges and accelerating the cost reduction of all-optical packet switches for HPCS interconnects by • building a full-function all-optical packet switch demonstrator system • showing the scalability, performance and cost paths for a potential commercial system • Key requirements:

VOQs Com- biner Tx 2 Rx EQ 8x1 1x8 8x1 control WDM Mux Star Coupler control OpticalAmplifier 8x1 1x128 2 Rx EQ control OSMOSIS System Architecture All-optical Switch • Broadcast-and-select architecture (crossbar) • Combination of wavelength- and space-division multiplexing • Fast switching based on SOAs • Electronic input and output adapters • Electronic arbitration 64 Ingress Adapters 64 Egress Adapters 8 Broadcast Units 128 Select Units 1 1 1 Fast SOA 1x8 Fiber Selector Gates Fast SOA 1x8 Wavelength Selector Gates VOQs 1 Tx 128 8 control 64 64 control links central arbiter (bipartite graph matching algorithm)

OSMOSIS Arbitration • Crossbar arbitration • Heuristic parallel iterative matching algorithms • RRM, PIM, i-SLIP, FIRM, DRRM, etc. • These require I = log2N iterations to achieve good performance • Mean latency decreases as the number of iterations increases • OSMOSIS • N = 64 I = 6 iterations • Problem • An iteration takes too long (Ti) to complete I iterations in one time slot Tc • VHDL experiments indicate that TiTc 2Ti • Poor performance… • Solution • Pipelining • however, in general this incurs a latency penalty

M0[2] M0[3] M0[4] M1[2] M1[3] M1[4] M2[3] M2[4] M3[4] Parallel Matching: PMM • K parallel matching units (allocators) • Every allocator now has K time slots to compute a matching • K = I( Ti/ Tc ) • Requests/grants issued in round-robin TDM fashion • In every time slot, one allocator receives a set of requests, and • one allocator issues a set of grants (and is reset) • Drawbacks • Minimum arbitration latency equals K time slots • Allocators cannot take most recent arrivals into account in subsequent iterations allocators A0 M0[1] Tc Tarbitration grants A0 A1 Ti requests matching A1 … … AK-1 AK-1 requests time

M3[4] M3[3] M2[4] M3[2] M2[3] M1[4] M3[1] M3[2] M3[3] M3[1] M2[2] M1[3] M0[4] M2[1] M3[1] M3[2] M2[2] M2[3] M1[1] M2[1] M3[1] M2[2] M1[2] M1[3] M0[1] M2[1] M1[1] M1[2] M0[2] M0[3] FLPPR: Fast Low-latency Parallel Pipelined aRbitration VOQ state A3 A2 requests matching A1 A0

Request and Grant Filtering • PMM = Parallel allocators, TDM requests; FLPPR = Pipelined allocators, parallel requests • FLPPR allows requests to be issued to any allocator in any time slot • Request filtering function determines the subset of allocators for every VOQ • Opportunity for performance optimization by selectively submitting requests to and accepting grants from specific allocators • Request and grant filtering • General class of algorithms • Request filter Rk determines mapping between allocators and requests • Selective requests depending on Lij, Mk, k • Grant filter Fk can remove excess grants R0 A0 F0 VOQ state line card requests requests R1 A1 F1 matching … … … R3 AK-1 F3

0 0 0 0 0 0 0 3 1 1 1 2 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 2 1 1 1 1 1 0 0 0 2 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 6 0 0 6 1 1 0 1 1 4 3 1 2 1 0 1 1 1 2 0 0 1 0 1 2 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 Example, N = 4, K = 2 without filtering VOQ ctrs Lij requests Rk matches Mk grants Gij

0 0 0 0 0 0 0 3 1 1 1 2 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 2 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 6 0 0 6 1 1 0 1 1 4 3 1 2 1 0 1 1 1 1 0 1 1 1 1 2 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 Example, N = 4, K = 2, with request filtering VOQ ctrs Lij requests Rk matches Mk grants Gij

FLPPR Methods • We define three FLPPR variants • Method 1: Broadcast requests, selective post-filtering • Requests sent to all allocators; excess grants are cancelled • Method 2: Broadcast requests, no post-filtering • Requests sent to all allocators; no check for excess grants • May lead to “wasted” grants • Method 3: Selective requests, no post-filtering • Requests sent selectively (no more requests than current VOQ occupancy); no check for excess grants

FLPPR performance – Uniform Bernoulli traffic

FLPPR performance – Nonuniform Bernoulli traffic

A[K-1] SCI SCI A[0] Arbiter Implementation Rj CC itf. VOQ state Rj Rx CC[00] Tx Fj SCC[00] Tx Fj Tx SCC[15] SYSCLK & CTRL Rj CC itf. VOQ state Rj Rx CC[63] Fj Tx Fj

Conclusions • Problem: Short packet duration makes it hard to complete enough iterations • Pipelining achieves high rates of matching with a highly distributed implementation • FLPPR pipelining with parallel requests has performance advantages • Eliminates pipelining latency at low load • Achieves 100% throughput with uniform traffic • Reduces latency with respect to PMM also at high load • Can improve throughput with nonuniform traffic • Request pre- and post-filtering allows performance optimization • Different traffic types may require different filtering rules • Future work: Find filtering functions that optimize uniform and non-uniform performance • Highly amenable to distributed implementation in FPGAs • Can be applied to any existing iterative matching algorithm

Low-Latency Pipelined Crossbar Arbitration

Low-Latency Pipelined Crossbar Arbitration

Presentation Transcript

Low-Latency Networks for Financial Applications

Crossbar switches

Low-Latency Adaptive Streaming Over TCP

Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches

Low Latency Networking

Low Latency Computations on Massive Data

RAMCloud: a Low-Latency Datacenter Storage System

High -Fidelity Latency Measurements in Low -Latency Networks

LOLA ( LOw LAtency Audio Visual Streaming System)

Sparrow Distributed , Low Latency Scheduling

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Low latency via redundancy

Design of Low-Power Pipelined ADCs

Low-Latency FIFO’s Using Token Rings

Low Latency Messaging Over Gigabit Ethernet

QA Function for Low Latency Trading Platform

Enabling Ultra Low Latency Applications Over Ethernet

Attacks on Low-Latency Anonymous Network: TOR

Delivering Capacity, Low Latency and Low Jitter

Low Latency Photon Mapping with Block Hashing

Low-Cost, High-Latency, Unlimited-Bandwidth Communication

Low Latency Rendering with Dataflow Architectures