Fully Asynchronous framework for GALS network on chip

EE Department Technion, Haifa, Israel Fully Asynchronous framework for GALS network on chip Friedman Harel Seminar in VLSI Architectures (048879) Electronic EngineeringTechnion Mentor: Prof Ran Ginosar

Asynchronous Network On Chip (A-NOC) The demand for scalable, low latency and power efficient System-On-Chip interconnection, leads the development of network on chip (NOC). This presentation reviews architectural and practical aspects, of Asynchronous Network-On-Chip solution.

Agenda • Why NOC In SOC • The evolution of networks – brief • The advantages of A – NOC VS NOC • The GALS solution • The A-NOC major blocks • A NOC protocol • The routing algorithm • The LETI Faust SOC Approach (Dynamic Reconfiguration) • The LETI Alpin SOC Approach (DVFS) • The MAGALI SOC Approach (Heterogeneous architecture) • Summary

System On Chip (SOC) • With many tens of million transistors available on a • single chip, the System-on-Chip (SOC) has become a reality. • Design with IP reuse is mandatory . • Integrated processor cores, DSPs, on-chip memories, • IP-blocks, etc…are commonly • in use. SOC implementation using A-NOC in telecommunication chip: ISSCC 2010 / SESSION 15 / LOW-POWER PROCESSORS & COMMUNICATION / 15.3

From buses to networks

Why NOC In SOC • A globally shared buses cannot meet the increasing demands of System-on-Chip interconnects. • The following handicaps has become dramatic obstacles: • Long-wire loads and resistances, results slow signals propagation. • Difficulties in timing validation. • Connecting blocks running at different speeds. • Connecting blocks using different voltage levels. • Power efficiency drops down.

The Synchronous network - brief

Generic On-Chip Router

Frequency Distribution • Clock skew may force the system to be partitioned into multiple clock domains • Can exploit the fact that only the phase of each router’s clock differs, simple error-free clock-domain crossing possible (single clock source)

The need for Asynchronous hand shake

Synchronous Routers with Asynchronous Links • Synchronization: • Time Safe: e.g. Traditional 2 FF synchronizers • Value Safe: Clock Pausing/Data-driven clocks

The advantages of A – NOC VS NOC • Clock management of a NOC-based chip is still an issue when multi-clock synchronization is required. • Asynchronous and self-timed circuits are known to be feasible. • Delay Insensitive (DI) asynchronous communication provides chip level communication robustness, ensuring functionality in a large voltage and process range. • 3. With DI encoding, delay variations, due to physical constraints such as crosstalk, are no longer an issue. Wire pipelining is easy to achieve • at chip level by adding asynchronous latches, in order to re-power the • signals while cutting-off the wire cycle time.

The GALS solution Because most of the IP cores and logic blocks are synchronous, the SOC design which uses A – NOC is based on: Globally Asynchronous Locally Synchronous (GALS) topology. The GALS topology is base on asynchronous network of synchronous blocks Each block on the network has its one clock domain.

The A-NOC major blocks Router (Node) IP Core Links The network topology is 2D point to point connections of routers (nodes) , arranges as 2D matrix and function as a mesh network .

The A-NOC Router Every router (beside the routers on the edges of the NET), contains five sets of inputs and outputs. Four inputs and outputs are directed to the fourth possible directions (North, South, East and West). The fifth Input / Output set is directed to the specific core in the node (Locally synchronous). Every Input is connected to all the fifth outputs and vice versa.

The A-NOC Link • The link is based on 4-phase handshake of QDI (quasi delay insensitive) 4-Phase asynchronous protocol. • In long traces, asynchronous pipelining is added to the NoC links. • Typical 4-Rail QDI interconnect, and the associated pipelining is presented in figure (b)

One of four data encoding

The GALS adapter • The synchronous IP core is • Connected to the A-NOC via the GALS adapter. • The adapter has two objectives: • Resynchronize the asynchronous NoC protocol with the synchronous domain. • To generate a local clock with configurable frequency. The synchronization is based on two FIFO. For every FIFO an ordinary synchronizers are used to adapt the Read and Write signals of the FIFO to the synchronous and asynchronous domains. The Johnson Encoding method, is used in order to offer small and efficient FIFO, by locally generated clock using a pausable clocking scheme with programmable IP clock.

GALS synchronization Pausable clock Meta stability filter

GALS synchronization Pausable clock • Simple GALS interface (receiver) • Note: Req/Ack uses 2-phase handshaking protocol

GALS synchronization Pausable clock

Data-Driven Clock Waveform

Data-Driven Clock Waveform • Imagine data from two packets arriving at a single router node at different rates • An aperiodic clock may be generated to minimise latency and power • Minimum clock period set by delay line • Value safe synchronization (no chance data is ever lost)

The FIFO approach Synchronization issue ? pointers are cross timing domains need synchronization with opposite clock Needs ad-hoc encoding to ensure proper detection of full and empty states

The Johnson counter

Johnson Encoding for FIFO design

Johnson Encoding FIFO architecture

A to S interface

Local clock generation

A NOC protocol • NOC communication architecture protocol, is defined at the following levels : • The physical layer corresponds to the signal level of data exchange. This is implemented by a 4-phase handshake protocol. • The flit level corresponds to an atomic 32-bit data transfer. At this level, we describe the signal mechanism to exchange flits, for a given priority. The flit level allows to remove any dependency with a clock cycle within the full network. • The packet level corresponds to packet transmission through the network. Packets are coherent messages, built of successive data flits. At this level is defined all required information for proper message routing within the network. This is the level where network arbitration is performed. Virtual channel mechanism is used to improve efficiency and guarantee low latency for priority packets. • The last level is the message level. This does not concern the network itself, but only the source unit and the destination units which communicate together.

A NOC protocol

The Physical layer • Quasi Delay Insensitive circuits design .for instance, a 4-phase protocol handshaking for asynchronous channels. • The full data path => • (32 bits + BoP + EoP) is entirely designed with 4-rail encoded data, requires 17 * 4-rail vectors. A pipeline stage every millimeter suppose to be enough for 65 nm CMOS

The “flit” level The whole network synchronization mechanism is based on a basic handshake between nodes to exchange a data flit. OP: Output port. IP: Input port Each flit is composed of 32-bit data and 2 control bits, where the 34th bit encodes the begin-of-packet (BoP) and the 33rd bit encodes the end-of-packet (EoP).

The packet level • Data is transmitted in packets made of several flits. • Every packet contains header flit and several payload flits. • The header flit is comprised of the following fields: • path-to target field: • The encoding is the following : 00 for north, 01 for east, 10 for south, 11 for west. • 18-bits, which allows to cross at most 9 different nodes in the network topology. In case more nodes must be addressed, a specific programmable resource can be used in order to extend the path value. • message control field : • Is used to encode message level of the packet : whether it is a read packet, a write packet, an interrupt packet, etc

Basic Routing

The router structure • In order to improve efficiency and to guarantee low latency, two virtual channels are implemented in each node. • The first one is dedicated to real-time, low latency packets, and the other one for best-effort traffic. The first channel VC0 has the highest priority and can suspend the path of the second channel VC1. • A given packet cannot be suspended by a packet of the same or lower priority, it may only be suspended when a packet with a higher priority requests the same network link (which is actually a given node output). In that case, the suspended packet is stalled and stored in previous nodes. acc 0 send acc 1 data data acc 1 acc 0 send

The routing algorithm • The static paths between initiator and target resources are programmed and stored in the initiator resources. • Even if the routing is deterministic, the routing paths between the resources, in case of blocks, routing is determined using a dynamic routing algorithm. • One of efficient and dead-lock free proven algorithm called the “odd-even turn model” • In a two dimensional Mesh of size m× n every node is identified by a two element vector (x, y), 0< x <m-1, and 0 <y <n-1, where x and y are the coordinates in the two dimensions. • Rule 1: East-north and north-west turns are not allowed at any nodes located in the even column and odd column respectively. • Rule 2: East-south and south-west turns are not allowed at any nodes located in the even column and odd column respectively.

The routing algorithm An example of faulty pass recovery – block relief

Blocking Issue

QNoC-based SoC design flow An example of faulty pass recovery – block relief

The router structure – input ports • DEMUX stage which routes flits to their corresponding VC queues. • The Shifter stage modifies the routing information as needed (path to target field in the flit header), and the flit is stored in a buffer stage waiting for the Availability of the appropriate output port.

The router structure – output ports • First performs arbitration between directions within each VC, and only then between VCs. • Direction Arbiter performs fair arbitration of the possible new packet requests from the input ports. • Generates a single command token to the Direction Switch that will be received the data only at the end of packet. • Finally, the VC Arbiter arbitrates at flit level between the two VCs, and commands the VC Switch.

GALS adapter unit – input port • IP Decodes packet routing bits and shift the path-to-target bits for following nodes. • IP Transfers data, priority, BoP and EoP information to the selected output controller. • A first process (get_priority_bit) decodes the • incoming flit priority level from the IP_send signal. • If the flit is a begin of packet : • The path to target shifted and the flit is stored in the corresponding priority level channel ; • Token to the appropriate channel and path information is maintained using the loop processes. • If the flit is not a begin of packet : • The incoming flit is stored in the corresponding priority channel • The process get_new_flit • Is responsible to shift the path to target bits, • and to transmit the received 32-bit data and • EoP bit toward the proper register, according to the Virtual Channel number.

GALS adapter unit – output ports • Arbitration between virtual channels and arbitration within a virtual channel. A "first arrived, first served” policy (FAFS) - priority virtual channels. • VC0 is made simpler with only static arbitration (N/E/S/W) • VC1 arbitration is in accordance with priority list using the mechanism of FAFS • 34 bit data switch.

TheLow-power processor & communication chip design A local processing core connected to the A-NOT

TheLow-power processor & communication chip design Receiver Block diagram on the NOC

TheLow-power processor & communication chip design The cores structure on the NOC

TheLow-power processor & communication chip designLETI - FAUST approach FAUST: FlexibleArchitecture of Unified System for Telecom

Fully Asynchronous framework for GALS network on chip

Fully Asynchronous framework for GALS network on chip

Presentation Transcript

Network-on-chip

ChIP on ChIP

Network-on-Chip

Network-on-Chip Physical Properties

Research Directions for On-chip Network Microarchitectures

Network on Chip (NoC)

NETWORK ON CHIP ROUTER

Network On Chip Platform

Efficient Microarchitecture for Network-on-Chip Routers

Allocator Implementations for Network-on-Chip Routers

GALS

Network On Chip Cache Coherency

Network-on-Chip

Asynchronous Partitioning Framework

Network On Chip Cache Coherency

Network On Chip Cache Coherency

Network On Chip Cache Coherency

Asynchronous Learning Network

NETWORK ON CHIP ROUTER

Fully Parallel Learning Neural Network Chip for Real-time Control

LOW-LEAKAGE REPEATERS FOR NETWORK-ON-CHIP INTERCONNECTS

NOCARC Network on Chip Architecture