A Low-Latency Adaptive Asynchronous Interconnection Network Using Bi-Modal Router Nodes

A Low-Latency Adaptive Asynchronous Interconnection Network Using Bi-Modal Router Nodes Gennette Gill, Sumedh S. Attarde, Geoffray Lacourba and Steven M. Nowick Department of Computer Science, Columbia University New York, NY, USA ACM/IEEE Int. Symp. on Networks-on-Chip (NOCS-11)

Motivation for Networks-on-Chip Future of computing is multi-core 2 to 4 cores are common, 8 to 16 widely available e.g. Niagara 16-core, Intel 10-core Xeon, AMD 12-core Opteron Expected progression: hundreds or thousands of cores Trend towards complex systems-on-chip (SoC) Communication complexity: new limiting factor NoCs design enables orthogonalization of concerns: Improves scalability Buses and crossbars unable to deliver desired bandwidth Global ad-hoc wiring does not scale to large systems Provides flexibility Handle pre-scheduled and dynamic traffic Route around faulty network nodes Facilitates design reuse Standard interfaces increase modularity, decrease design time

Key Active Research Challenges for NoCs Power consumption Will exceed future power budgets by a factor of 10x [Owens IEEE Micro-07] Global clocks: consume large fraction of overall power Complex clock-gating techniques [Benini TVLSI-02] Chips partitioned into multiple timing domains Difficult to integrate heterogeneous modules Dynamic frequency/voltage scaling (DFVS) for lower power [Ogras/Marculescu DAC-08] Performance bottleneck: latency Latency critical for on-chip memory access Important for shared-memory chip multiprocessors (CMP)

Potential Advantages of Asynchronous Design Lower power No clock power consumed Idle components consume no dynamic power Low-power microcontroller from Philips[van Gageldonk/Berkel ASYNC-98] Greater flexibility/modularity Easier integration between multiple timing domains Supports reusable components [Bainbridge/Furber IEEE MICRO-02], [Dobkin/Ginosar ASYNC-04] Lower system latency No per-router clock synchronization  no waiting for clock [Sheibanyrad/Greiner et al. D&T-08], [Horak, Nowick et al. NOCS-10]

Motivations for Our Research • Requires high performance • Low system-level latency • Lightweight routers for low-latency • High sustained throughput • Maximize steady-state throughput • Motivating example: • XMT parallel architecture • [Naishlos/Vishkin SPAA-01] • Our two main contributions: • High-performance async. network with fine-grained dynamic adaptivity • Simulation framework and detailed results for 8 benchmarks • [Naishlos/Vishkin SPAA-01] • Target = interconnection network for CMPs • Network between processors and memory • Mixed timing network: sync/async interfaces + async network

Contributions (1) • Introduced a new bi-modal arbitration node • Default mode: normal arbitration between incoming flits • Biased mode: creates “fast forward” path for one incoming channel • Mode change: based on local recent traffic history • Enter biased mode when only one channel is active • Net benefit: entirely avoid arbitration in biased mode • Created a hybrid packet-/circuit-switched network • Default= packet-switching • Biased = localized circuit-switching whenever possible • Very fine-grained adaptability: across both space and time • Each bi-modal node can reconfigure on a per packet basis • Net benefit: lower system-level latency, higher throughput

Contributions (2) • Asynchronous network simulation environment • Configure for many benchmarks: mix deterministic and random traffic • Launches flits asynchronously, following a Poisson distribution • Debugging instrumentation on a per-node and global basis • Detailed experimentation and analysis • New “adaptive” network vs. “baseline” [Horak/Nowick NOCS-10] • 8 diverse benchmarks: • Random and deterministic traffic • Static and dynamic traffic patterns • Significant improvements in throughput and latency • System latency (up to 19.8%) + throughput (up to 27.8%) • End-to-end system latency • 1.8-2.8 ns (at 25% load) through 6 router nodes + 5 hops

Previous Work: GALS GALS: Globally-Asynchronous Locally-Synchronous Researchers turning to GALS for NoC solutions Recent GALS research targets Low power: Dynamic voltage scaling [Iyer/Marculescu ISCA-02] Voltage management, multiple clock domains [Zhu/Albonesi ISCA-05] Scalability/modularity: Tile-based design [Yu/Baas ICCD-06] High performance: Fulcrum’s FocalPoint Ethernet Switch [Lines MICRO-04] Interconnection network for CMPs[Horak, Nowick et al. NOCS-10] Quality of service: Guaranteed service + best effort [Ginosar ASYNC-05], [Sparsø DATE-05]

Previous Work (continued) Dynamic adaptability: Power regulation for asynchronous interconnection [Thonnart ASYNC-08] Multi-threshold gates[Imai/Nanya ASYNC-09] Dynamic voltage/frequency scaling [Garg/Marculescu DAC-09] Approaches to lower latency: Asynchronous bypass channels [Jain/Choi NOCS-10] Express virtual channels [Kumar, Peh et al. ISCA-07] Aethereal: reconfigurable packet-/circuit switched routers Guaranteed service: time-multiplexed circuit switching Best effort: packet switched, scavenge unused bandwidth [Goosens et al. D&T-05] Our work: dynamic adaptability for latency improvement Complementary to many of these approaches Distinguishing features: Very fine-grain adaptability, at each node on a per-packet basis Adapts locally at a network node, no global control decisions

Outline Introduction Previous Work Background: Async Network for Parallel Processors New Dynamically-Adaptive Network Overview Design of Bi-Modal Arbitration Node Monitoring Network Experimental Results Simulation Setup Cell-Level Results Network-Level: Results and Analysis Conclusions and Future Work

Background: Async Interconnection Mesh-of-Trees (MoT) variant topology Fan-out + fan-in network Unique path for each source/sink pair Shown to perform well for CMPs MoT network: 2 node types Routing: 1 input channel, 2 output Arbitration: 2 input channels, 1 output Node performance, compared to sync [Balkan/Vishkin TVLSI-09] 64-84% less area, 82-91% less energy Maximum throughput = ~2 Gflits/sec Async network comparison with 800 MHz sync network: Same throughput for all input rates; lower latency at <73% saturation Arbitration (fan-in) Routing (fan-out) 0 0 1 1 2 2 3 3 Michael N. Horak, Steven M. Nowick, Matthew Carlberg, and Uzi Vishkin. A low-overhead asynchronous interconnection network for GALS chip multiprocessors. NOCS-2010

Background: Async Routing Primitve Source routing Req0 Ack0 Data0 Req Ack B(oolean)‏ Data_In Req1 Ack1 Data1 1 incoming handshaking channel 2 outgoing handshaking channels Michael N. Horak, Steven M. Nowick, Matthew Carlberg, and Uzi Vishkin. A low-overhead asynchronous interconnection network for GALS chip multiprocessors. NOCS-2010

Background: Async Arbitration Primitive Req0 Ack0 Data0 Req_Out Ack_In Data_Out Req1 Ack1 Data1 1 outgoing handshaking channel 2 incoming handshaking channels Michael N. Horak, Steven M. Nowick, Matthew Carlberg, and Uzi Vishkin. A low-overhead asynchronous interconnection network for GALS chip multiprocessors. NOCS-2010

Background: Protocol Decisions Michael N. Horak, Steven M. Nowick, Matthew Carlberg, and Uzi Vishkin. A low-overhead asynchronous interconnection network for GALS chip multiprocessors. NOCS-2010 • Handshaking: transition signaling (two-phase) • Benefits over level signaling (four-phase) • 1 roundtrip link communication per transaction • High throughput / low power • Challenge of two-phase: designing lightweight implementations • Data encoding: single-rail bundled data • “Single-Rail Bundled Data” benefits: • Excellent coding efficiency and low power • Re-use synchronous datapaths: 1 wire/bit + added “request” • Challenge: requires matched delay for “request” signal • 1-sided timing constraint: “request” must arrive after data stable

Overview: Dynamically-Adaptive Network Identified critical network bottleneck: Arbitration logic in fan-in node Limits throughput; adds latency Basic strategy: Observe local traffic behavior If only one channel active: No arbitration needed Dynamically bypass arbitration logic Network-level result: Hybrid packet-/circuit-switched network Circuit-switched = lower latency Mesh of Trees Arbitration (fan-in) Routing (fan-out) 0 0 1 1 2 2 3 3 Routing nodes: unchanged New bi-modal arbitration nodes

Overview: Bi-Modal Arbitration Node Bi-modal arbitration node Modes = “default”, “single-channel-bias” All mode changes: decided locally Based on recently observed traffic Enter “bias-to-0” mode Only one channel active Arbitration is bypassed Return to “default mode” Inactive channel becomes active Uses channel arbitration Very fine-grain adaptation Adjacent nodes can be in different states Mode changes on a per-packet basis Mesh of Trees Arbitration (fan-in) Routing (fan-out) 0 0 1 1 2 2 3 3 New bi-modal arbitration nodes

Default and Biased Mode Operation wait for arbitration to complete input channel 0 arb output channel input channel 1 Default mode: arbitrated “packet-switched” connection Asynchronous: also no waiting for clock Biased channel: no waiting input channel 0 output channel input channel 1 Non-biased channel: blocked Biased-to-0 mode: dedicated “circuit-switched” connection

Behavior in “Default” Mode Arbitration between two input channels A flit arrives on channel 0 Opaque latch blocks flit progression Latch on “winning” channel is opened

Behavior in “Biased to 0” Mode Arbitration logic: Entirely bypassed Latch on biased channel is transparent Operates in background in parallel A flit arrives on channel 0

Mode Change Operation Goals: ensure low-latency and safe mode changes Basic mode change policy: two types of mode changes Default to biased:two packets in a row on one channel Biased to default:one packet arrives on inactive channel Default-to-biased mode change Easy transition from packet-switched to circuit-switched Safe: mode change in parallel with flit processing Biased-to-default mode change Challenging transition from circuit-switched to packet-switched Risky: severing dedicated connection just as a flit arrives Solution strategy: Find a “safe time window” for biased-to-default mode change Check network activity for risky incoming packets

Biased-to-Default Mode Change Safety Each input channel: monitoring signal “something coming” Before biased-to-0 to default, check “something coming 0” Two possible scenarios: No flit is approaching: perform mode change now Flit is approaching: wait for it, and “piggyback” a mode change something coming 0 input channel 0 arb output channel something coming 1 input channel 1 Biased-to-0 mode: dedicated “circuit-switched” connection Flit on channel 1 triggers mode change

Bi-Modal Node Implementation • Bi-modal node implementation: three steps • Policy: • Initiates mode change based on local flit arrival • Default to biased: two packets in a row on one channel • Biased to default: one packet arrives on the inactive channel • Safety: • Finds a precise time window for mode change • Decides on “Type A” or “Type B” mode change • Reconfigurable controllers: • Implements operation of each mode • Controls latches that regulate flow of flits through node

Mode Implementation: (1) Policy Policy: initiates a mode change based on local traffic conditions

Mode Implementation: (2) Safety Safety: finds a safe time window to enact a mode change

Mode Implementation: (3) Latch Controllers Reconfigurable latch controllers: Implement the mode change within the node

MoT with Monitoring Network Purpose: helps to find a “safe window” for mode change Provides advance warning of flit arrival Fast and lightweight, reuses network channels Original MoT network Path of a flit through the MoT network

MoT with Monitoring Network Purpose: helps to find a “safe window” for mode change Provides advance warning of flit arrival Fast and lightweight, reuses network channels Add Monitor Control to every network node New network with monitoring Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control

MoT with Monitoring Network Purpose: helps to find a “safe window” for mode change Provides advance warning of flit arrival Fast and lightweight, reuses network channels All bi-modal arbitration nodes use monitoring network for safety “Advance guard” signal, traces same path as flits “Trip wire” at entrance to the network Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control Monitor Control

Summary Introduced a new bi-modal arbitration node Monitors local traffic arrival When possible, bypasses arbitration Created a dynamically-adaptive network Incorporates new bi-modal node Based on past work [Horak/Nowick NOCS-10] Added lightweight monitoring Finds a “safe time window” for mode change Fast and low area overhead Result: Hybrid packet-/circuit-switched network Fine-grain adaptation: Adjacent nodes can be in different states Changes modes on a per-packet basis Target: lower latency and higher throughput Mesh of Trees Arbitration (fan-in) Routing (fan-out) 0 0 1 1 2 2 3 3

Experimental Evaluation: Overview • Two levels of evaluation: • New cell designs in isolation • 8-terminal adaptive network • Cell-level evaluation: • Comparison of bi-modal arbitration node to [Horak/Nowick NOCS-10] • Latency, throughput and area • Cell-level mode change overheads • 90nm ARM standard cells, gate-level Spice simulation • Network-level evaluation: • Built two 8-terminal MoT networks: 112 nodes each • “Baseline”[Horak/Nowick NOCS-10]and new “dynamically adaptive” • Modeled in structural technology-mapped Verilog • Developed a new asynchronous simulation framework • Applied 8 diverse benchmarks: mix of random and deterministic traffic

Cell Level: Bi-Modal Arbitration Node Latency comparison: bi-modal vs. baseline [Horak/Nowick, NOCS-10] Biased mode: 41% faster than baseline Default mode: overhead of 8% compared to baseline Throughput comparison Biased mode: 20% higher than baseline Default mode: overhead of <2% for single Area comparison: 32-bit data-path bi-modal vs. baseline Arbitration nodes: 43% area overhead Routing nodes: 2% area overhead Single Alternating Arbitration Node Throughput Arbitration Node Latency Traffic pattern Bi-modal Default Bi-modal Biased Bi-modal Default Bi-modal Biased Baseline Baseline [Horak, NOCS-10] [Horak, NOCS-10]

Cell Level: Mode-Change Latency • Time to locally reconfigure a node • Default to biased: • Reconfiguration in parallel with processing a flit • No overhead added to flit latency • Biased to default: • Reconfiguration must complete before processing flit • Cell-level overhead to flit latency: variable latency penalty

Async. Network Simulation Framework W. Dally and B. Towles. Princlples and practices of Interconnection Networks. Morgan Kaufmann Publishers, Inc., 2003. • Goal: compare two 8-terminal networks • “Baseline” [Horak, NOCS-10]and new “dynamically adaptive” • Developed an asynchronous simulation framework • Flexible: configuration file for specifying input traffic patterns • Can provide both deterministic and random traffic • Launches flits asynchronously as a Poisson process • Built on custom trace generator [S. Sethumadhavan, Columbia University] • Includes correctness checking • Debugging instrumentation on a per-node and global basis • Testing for dropped and mis-routed packets • Checks for per-flit data corruption • Implements asynchronous equivalent to standard [Dally-03] • Warm-up, measurement, and drain phases

Benchmarks Benchmarks represent a wide variety of network conditions Level of contention affects performance of new bi-modal node Biased mode lowers latency in contention-free scenarios e.g. Standard bit- and digit- permutation benchmarks Benchmarks with the most contention are the most adversarial e.g. Hotspot benchmark Benchmarks mix random and deterministic traffic e.g. “Partial steaming” benchmark: bit-permutation + random interruptions 1) Bit-permutation benchmark: “shuffle” [Dally`03] 2) Digit-permutation benchmark: “tornado” [Dally`03] 3) Uniform random traffic [Dally`03] 4) Simple alteration with overlap 5) Random restricted broadcast with partial overlap 6) Hotspot8 benchmark [Dally`03] 7) Random single source broadcast 8) Partial streaming with random interruption

Network-Level Latency Evaluation Latency Comparison for 25% Network Load Baseline [Horak, NOCS-10] Dynamically Adaptive • Dynamically-adaptive network: • Latency improvements up to 19.81% • 6 of 8 benchmarks show some improvement • Benchmark #6 “Hotspot” adversarial: 13.2% higher latency % Improvement

Network-Level Throughput Evaluation Saturation Throughput Baseline [Horak, NOCS-10] Dynamically Adaptive • Dynamically-adaptive network • Throughput improvements up to 27.84% • 7 out of 8 benchmarks show improvement • Benchmark #3 “Random” adversarial: 5.88% lower Benchmarks % Improvement

Monitoring Network Latency • Monitoring signal must arrive before flit arrives • Monitor provides advance notification of flit arrival • Timing margin: long enough for a node to safely reconfigure Leaf: Monitor arrives ~400ps before flit Root: Monitor arrives > 750 ps before flit

Discussion • Latency benefits in 6 of the 8 benchmarks • Ranging up to 19.81% • Throughput improvement in 7 of the 8 • Ranging up to 27.84% • Adversarial benchmark evaluation: • Hotspot and uniform random show overheads • Detailed simulation indicates that thrashing occurs • Enter and then immediately leave biased mode • Plan to address in future work

Conclusions • Introduced a new bi-modal arbitration node • Includes a low-latency “single-channel-bias” mode • Rapid reconfiguration using local traffic information • Created a hybrid packet-/circuit-switched network • Targets improved latency and throughput • Fine-grain adaptability in both spatial and temporal domains • Infrastructure for asynchronous network simulation • Easily configure to generate range of complex benchmarks • Custom-designed for providing asynchronous inputs • Detailed experimentation and analysis • Two 8X8 MoT asynchronous networks: “baseline” and “adaptive” • Significant improvements in throughput and latency • System latency (up to 19.8%) + throughput (up to 27.8%)

Future Research Directions • Explore new mode-change policies • Increase the number of flits processed in biased mode • Reduce “thrashing” between modes • Lower mode-change overhead • Reduce the “penalty” for changing from biased to default mode • Target different network topologies • Use dynamic adaptability improve latency/throughput • End goal: complete layouts, fabricate chip

A Low-Latency Adaptive Asynchronous Interconnection Network Using Bi-Modal Router Nodes