NoC Switch: Basic Design Principles & Intra-Switch Performance Optimization

Tunis, December 2015 NoC Switch: Basic Design Principles & Intra-Switch Performance Optimization Instructor:Davide Bertozzi Email: davide.bertozzi@unife.it

Acknowledgement • Manyslideshavebeentaken or adapted from Prof. GiorgosDimitrakopoulos Electrical and Computer EngineeringDemocritus University of Thrace (DUTH), Greece • NoCS 2012 Tutorial: Switch design: A unified view of microarchitecture and circuits

Switch Building Blocks

Wormhole switch operation • The operations can fit in the same cycle or they can be pipelined • Extra registers are needed in the control path • Body/tail flits inherit the decisions taken by the head flits, yet they cannot bypass RC, and SA • Simply, there is nothing to do for them in those stages • Operation latency is an issue! • For single-cycle switches, the head flit is anyway the one which determines the critical timing path! Body/tail flits would have slack!

Look-ahead routing • Routing computation is based only on packet’s destination • Can be performed in switch A and used in switch B • Look-ahead routing computation (LRC) • Does it really need to be a separate pipeline stage?

Look-ahead routing optimization • The LRC can be performed in parallel to SA • LRC should be completed before the ST stage in the same switch • The head flit needs to embed the output port requests for the next switch before leaving

Look-ahead routing details • The head flit of each packet carries the output port requests for the next switch together with the destination address

Low-latency organizations SA ST LT LRC • Baseline • SA precedes ST (no speculation) • SA decoupled from ST • Predict or Speculate arbiter’s decisions • Trick: crossbar control not from switch allocator • When prediction is wrong replay all the tasks (same as baseline) • Do in different phases • Circuit switching • Arbitration and routing at setup phase • Transmit phase: contention-free ST • Bypass switches • Reduce latency under certain criteria • When bypass not enabled, same as baseline SA LT ST LRC SA LT Setup LRC ST LT Transmit

ST in parallel with SA SA ST Target: LRC It is likely that a packet coming from the east (if any) will go to the west because of xy routing in a 2D mesh • Prediction criteria Input packet Target Output East West During idle time the predictor pre-sets the I/O connection of the West output port through the crossbar multiplexer At runtime the prediction accuracy is verified. Mis- prediction Arbiter South East Arbiter West Arbiter West East East East East North East North South South West West Local Local

ST in parallel with SA SA ST Target: LRC • Speculation Mux control signals are set on-the-fly by fast speculation logic At the end of the cycle, arbiter computation results are compared with the outcome of speculation logic At the beginning of the cycle, requests are fed to the allocator arbiter and to speculation logic East East arbiter arbiter ?? Local Local Local Mis- speculation East East Mask East East East North East North South South West West Local Local Next step: switch traversal from Local

Idle state: Output port X+ is selected and reserved for Input X+ 1st cycle: Incoming flit is transferred to X+ without RC and SA 1st cycle: RC is performed  The prediction was correct! Correct Crossbar is reserved Prediction-based ST: Hit SA ST Target: Assumption: RC is a pipeline stage (no LRC) RC Outcome: SA, ST and RC were actually performed in parallel! PREDICTOR ARBITER X+ X+ Buffer X- X- Y+ Y+ Y- Y- Crossbar

Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and SA 1st cycle: RC is performed  The prediction is wrong! (X- is correct) Dead flit Correct Prediction-based ST: Miss Kill signal to X+ is asserted 2nd/3rd cycle: Dead flit is removed; Move on with SA. KILL PREDICTOR ARBITER X+ X+ Buffer X- X- Y+ Y+ @Miss: tasks replayed as the baseline case (that is, at least RC was done, now let us move to SA) Y- Y- Crossbar

Speculative ST Assumption: RC is done (LRC) B Wins A Wins • Speculation criteria: assume contention doesn’t happen! • If correct then flit transferred directly to output port without waiting SA • In case of contention SA was done, move on with ST accordingly • Wasted cycle in the event of contention • Generation and management of event abort Control A A A ? A A Switch Fabric B B B cycle 0 1 2 3 4 A clk port 0 port 1 grant valid out data out B A A A B p1 p0 p0 A B A ??? Only SA done

Efficient recovery from mispeculation:xor-based recovery B Wins • Assume contention never happens • If correct then flit transferred directly to output port • If not then bitwise=XOR all the competing flits and send the encoded result to the link • At the same time arbitrate and mask (set to 0) the winning input • Repeat on the next cycle • In the case of contention, encoded outputs (due to contention) are resolved at the receiver • Can be done at the output port of the switch too Control A A A A A A^B Switch Fabric B B cycle 0 1 2 3 4 clk port 0 port 1 grant valid out data out A A A B p1 p0 A A B^A No Contention Contention

XOR-based recovery 0 1 • Works upon simple XOR property. • (A^B^C) ^ (B^C) = A • Always able to decode by XORing two sequential values • Performs similarly to speculative switches • Only head-flit collisions matter • Maintains previous router’s arbitration order 0 Coded B^C A A^B^C B^C B^C A B A^B^C C C Flit Buffer

Virtual bypassing paths Bypassed Bypassed 1-cycle 1-cycle Bypassing intermediate nodes • Switch bypassing criteria: • Frequently used paths • Packets continually moving along the same dimension • Most techniques can bypass some pipeline stages only for specific packet transfers and traffic patterns • Not generic enough SRC DST 3-cycle 3-cycle 3-cycle 3-cycle 3-cycle

Speculation-free low-latency switches • Prediction and speculation drawbacks • On miss-prediction (speculation) the tasks should be replayed • Latency not always saved. Depends on network conditions • Merged Switch Allocation and Traversal (SAT) • Latency always saved – no speculation • Delay of SAT smaller than SA and ST in series SAT

How can we increase throughput? Green flow is blocked until red passes the switch. Physical channel left idle The solution is to have separate buffers for each flow

Virtual Channels • Decouple output port allocation from next-hop buffer allocation • Contention present on: • Output links (crossbar output port) • Input port of the crossbar • Contention is resolved by time sharing the resources • Mixing words of two packets on the same channel • The words are on different virtual channels • A virtual channel identifier should be in place • Separate buffers at the end of the link guarantee no blocking between the packets

Virtual Channels • Virtual-channel support does not mean extra links • They act as extra street lanes • Traffic on each lane is time shared on a common channel • Provide dedicated buffer space for each virtual channel • Decouple channels from buffers • Interleave flits from different packets • “The Swiss Army Knife for Interconnection Networks” • Reduce head-of-line blocking • Prevent deadlocks • Provide QoS, fault-tolerance, …

Datapath of a VC-based switch • Separate buffer for each VC • Separate flow control signals (credits/stalls) for each VC • The radix of the crossbar may stay the same (or may increase) • A higher number of input ports increases propagation delay through the crossbar • Input VCs may share a common input port of the crossbar • Alternatively, crossbars can be replicated • On each cycle at most one VC will receive a new word

Per-packet operation of a VC-based switch • A switch connects input VCs to output VCs • Routing computation (RC) determines the output port • Can be shared among VCs of an input port • May restrict the usable output VCs (e.g., based on msg type or dst ID) • An input VC should allocate first an output VC • Allocation is performed by a VC allocator (VA) • RC and VA are done per packet on the head flits and inherited to the remaining flits of the packet Output VCs Input VCs

Per-flit operation of a VC-based switch • Flits with an allocated output VC fight for an output port • Output port allocated by switch allocator • This entails 2 levels of arbitration • At input port • At output port • The VCs of the same input share a common input port of the crossbar • Each input has multiple requests (equal to the #input VCs) • The flit leaves the switch provided credits are available downstream • Credits are counted per output VC • Unfortunate case: VC & port are allocated to an input VC, but no credits available Output VCs Input VCs

Switch allocation • All VCs at a given input port share one crossbar input port • Switch allocator matches ready-to-go flits with crossbar time slots • Allocation performed on a cycle-by-cycle basis • N×V requests (input VCs), N resources (output ports) • At most one flit at each input port can be granted • At most one flit et each output port can be sampled • Other options need more crossbar ports (input-output speedup)

Switch allocation example Bipartite graph Request matrix Outputs Outputs Inputs • One request (arc) for each input VC • Example with 2 VCs per input • At most 2 arcs leaving each input • At most 2 requests per row in the request matrix • The allocation is a Matching problem: • Each grant must satisfy a request • Each requester gets at most one grant • Each resource is granted at most once 0 1 2 0 0 0 1 1 Inputs 1 2 2 2

Separable allocation • Matchings have at most one grant per row and per column • Two phases of arbitration • Column-wise and row-wise • Perform in either order • Arbiters in each stage are independent • But the outcome of each one affects the quality of the overall match • Fast and cheap • Bad choices in first phase can prevent second stage from generating a good matching • Multiple iterations required for a good match • Iterative scheduling converges to a maximal schedule • Unaffordable for high-speed networks Input-first: Output-first:

Implementation Input first allocation (row-wise)

Implementation Output first allocation (column-wise)

Centralized allocator • Wavefront allocation • Pick initial diagonal • Grant all requests on each diagonal • Never conflict! • For each grant, delete requests in same row, column • Repeat for next diagonal

Switch allocation for adaptive routing • Input VCs can request more than one output ports • Called the set of Admissible Output Ports (AOP) • This adds an extra selection step (not arbitration) • Selection mostly tries to load balance the traffic • Input-first allocation • For each input VC select one request of the AOP • Arbitrate locally per input and select one input VC • Arbitrate globally per output and select one VC from all fighting inputs • Output-first allocation • Send all requests of the AOP of each input VC to the outputs • Arbitrate globally per output and grant one request • Arbitrate locally per input and grant an input VC • For this input VC select one out of the possibly multiple grants of the AOP set

VC allocation • Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels) • Before packets can proceed through router, need to claim ownership of VC buffer at next router • VC acquired by head flit, is inherited by body & tail flits • VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use • N×V inputs (input VCs), N×V outputs (output VCs) • Once assigned, VC is used for entire packet’s duration in the switch Input VCs Output VCs

VC allocation example Requests Grants Output VCs Inputs VCs Output VCs Inputs VCs • An input VC may require any of the VCs of a given output port • In case of adaptive routing, an input VC may require VCs from different output ports • No port constraints as in switch allocators • Allocation can be both separable (2 arbitration steps) or centralized • At most one grant per input VC • At most one grant per output VC 0 0 0 0 In#0 Out#0 In#0 Out#0 1 1 1 1 2 2 2 2 In#1 Out#1 In#1 Out#1 3 3 3 3 4 4 4 4 Out#2 Out#2 In#2 In#2 5 5 5 5

Input – output VC mapping • Any-to-any flexibility in VC allocator is unnecessary • Partition set of VCs to restrict legal requests • Different use cases for VCs restrict possible transitions: • Message class never changes • VCs within a packet class are functionally equivalent • Can take advantage of these properties to reduce VC allocator complexity!

Single cycle VA or pipelined organization • Header flits see longer latency than body/tail flits • RC, VA decisions taken for head flits and inherited to the rest of the packet • Every flit fights for SA • Can we parallelize SA and VA?

The order of VC and switch allocation • VA first SA follows • Only packets with an allocated output VC fight for SA • VA and SA can be performed concurrently: • Speculate that waiting packets will successfully acquire a VC • Prioritize non-speculative requests over speculative ones for SA • Speculation holds only for the head flits (The body/tail flits always know their output VC)

Free list of VCs per output • Can assign a VC non-speculatively after SA • A free list of output VCs exists at each output • The flit that was granted access to this output receives the first free VC before leaving the switch • If no VC available output port allocation slot is missed • Flit retries for switch allocation • VCs are not unnecessarily occupied for flits that don’t win SA • Optimizations feasible: • Flits are allowed to win SA for a target port only provided there are empty VCs at that output port

NoC Switch: Basic Design Principles & Intra-Switch Performance Optimization