NoC for Cache Coherence

NoC for Cache Coherence NoC Seminar Technion Vainbaum Yuri Mentor I.Keidar

Cache coherence problem in NUCA Update other L2$ T1 T2 P1 A A B P3 B L2$ L2$ P2 P4 L2$ L2$ • The cache coherency problem appears when tasks running on different processors in the SoC share data stored in the system memory. • When a task T1 running on a processor P1 modifies a data shared with task T2 , which runs on the processor P2, that data’s copy on P2 processor’s cache must be either updated or invalidated, before a new access to it.

MESI -maintain the coherence in cached systems Modified: Actually, it is an exclusive-modified state. It means that the cache has the only copy that is correct in the whole system. The data which are in the main memory are wrong. Exclusive: Exclusive without having been modified. That is, this cache is the only one that has the correct value of the block. Data blocks are according to the existing ones in the main memory. Invalid: It is a non-valid state. The data you are looking for are not in the cache, or the local copy of these data is not correct because another processor has updated the corresponding memory position. Shared: Shared without having been modified. Another processor can have the data into the cache memory and both copies are in their current version.

In-Network Cache Coherence • Propose : • Implementation of the coherence protocol and directories within the network at each router node. • This opens up the possibility of optimizing a protocol with in-transit actions In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

In-Network Read optimization H A B Read request Directory based MSI To sheerer data • Three end-to-end messages In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

In-Network Read optimization H A B Read request In-Network MSI data • Node B “bumps” into node A • While message in-transit to the home node H obtain the data directly from A In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

In-Network Write optimization H A B C write request Inv Ack Inv Directory based MSI Ack data In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

In-Network Write optimization H A B C write request Inv+Ack In-network MSI Ack +inv data • This in-transit optimization can reduce write communication from two round-trips to a single round-trip from C to H and back In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

In-Network cache coherence protocol H R • Idea: move coherence directories from the nodes into the network fabric • Virtual trees, one for each cache line, are maintained within the network in place of coherence directories to keep track of sharers • The virtual tree consists of one root node R which is the node that first loads a cache line from off-chip memory, all nodes that sharing this line and intermediate nodes between root and sharers • Nodes of the tree are connected by virtual links • Virtual trees are stored in virtual tree caches at each router within the network • Reads and writes are routed towards the home node, if they encounter a virtual tree in-transit, the virtual tree takes over as the routing function and steers read requests and write invalidates appropriately towards the sharers instead. In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

In-Network-Read access example 2. Load line from off-chip 3. Constructs virtual tree 4. Hits virtual tree on the way to home node H H read2 R2 5. Steered to nearest copy 1.Towards home node 6. Returns data and constructs new virtual tree links R1 R1 read1 Second read request to the same line –Read2 New read request –Read1

In-Network Router micro-architecture Flit • Virtual tree cache serves to steer head flits towards the appropriate output ports. • Virtual tree cache points them towards caches housing the most up-to-date data requested • Memory address contained in each packet’s header is first parsed into < tag, index,o f f set > if the tag matches, there is a hit in the tree cache, and its prescribed direction is used as the desired output port

In-Network Results & Summary • Proposed an approach of cache coherence for chip multiprocessors where the coherence protocol and directories are all embedded within network routers. • This approach has a low hardware overhead which quickly leads to hardware savings, compared to the standard directory protocol, as the number of cores per chip increases In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

DCOS-Directory Cache On a Switch • To reduce cache-to-cache data transfer time proposed architecture implemented inside each switch • 4x2 2D mesh topology MIPS R10000 core model ,Directory based cache coherence MSI protocol DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs , Daewook Kim 2006 IEEE

DCOS-Directory Cache On a Switch • State entry assigned to a memory block holds current state of block : empty, shared, modified /invalid • No data items are copied to caches or memories :Marked as “E” DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs , Daewook Kim 2006 IEEE

DCOS-Directory Cache On a Switch • Data is shared with other caches and memories DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs , Daewook Kim 2006 IEEE

DCOS-Switch architecture • All directory caches are embedded within crossbar switch

DCOS-Results DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs , Daewook Kim 2006 IEEE

Cache Coherency Communication Cost • How costly is cache coherency in interconnection terms? • This paper focuses on bringing light onto this question. • Directory based mechanism to maintain coherence among all caches in the system Cache Coherency Communication Cost in a NoC-based MPSoC Platform , Gustavo Girão, SBCCI’07, September 3–6, 2007, Rio de Janeiro, Brazil.

Cache Coherency Communication Cost • The amount of data on the NoC for regular operations is much larger than the amount of data for cache coherence maintenance for almost all the cache sizes • The increase in cache size decreases the amount of data for regular operations, and so the amount of data for cache coherence becomes more significant Cache Coherency Communication Cost in a NoC-based MPSoC Platform , Gustavo Girão, SBCCI’07, September 3–6, 2007, Rio de Janeiro, Brazil.

Cache Coherency Communication Cost • Graph shows that the amount of page replacement requests is the • most responsible for the cache coherence injected load for small • cache sizes. This happens because the amount of replacements • increases as cache size decreases 8 CPUs, 1 directory Cache Coherency Communication Cost in a NoC-based MPSoC Platform , Gustavo Girão, SBCCI’07, September 3–6, 2007, Rio de Janeiro, Brazil.

BeNOC –Bus enhanced Network on Chip • Low latency, low bandwidth specialized bus, optimized for system-wide distribution of control signals (ack,invl) • High performance distributed network that handles high throughput • data communication between pairs of modules BENoC: A Bus-Enhanced Network on-Chip for a Power Efficient CMP Isask'har Walter, Israel Cidon, and Avinoam Kolodny

BeNOC –Bus enhanced Network on Chip • β -reflects the network-to-bus broadcastlatency ratio • n- The number of modules in the system • When broadcast operations are compared,the bus is considerably more energy efficient than thenetwork BENoC: A Bus-Enhanced Network on-Chip for a Power Efficient CMP Isask'har Walter, Israel Cidon, and Avinoam Kolodny

Network topology awareness Invl. P1 L2$ P2 L2$ Invl. P3 L2$ • Wait for furthest invalidation acknowledgment therefore send Invl to P3 first • Cache coherence protocol should be aware of the network topology • Send invalidation messages according to distances from the directory

Network topology awareness • Calculate at each transaction furthest sharing node and send invalidation • The total delay will be roundtrip time of invalidation/acknowledge to furthest node • Send long delay roundtrip messages first to mask short delay messages.

References In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06) DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs , Daewook Kim 2006 IEEE BENoC: A Bus-Enhanced Network on-Chip for a Power Efficient CMP Isask'har Walter, Israel Cidon, and Avinoam Kolodny Cache Coherency Communication Cost in a NoC-based MPSoC Platform , Gustavo Girão, SBCCI’07, September 3–6, 2007, Rio de Janeiro, Brazil. TEACHING THE CACHE MEMORY COHERENCE WITH THE MESI PROTOCOL SIMULATOR F. J. JIMÉNEZ1, J. GÓMEZ1, A. MESONES1, E. HERRUZO1, J. I. BENAVIDES1 Y F. J. SÁNCHEZ2 1Dpto. Electrotecnia y Electrónica. Escuela Politécnica Superior. Universidad de Córdoba. Av. Menéndez Pidal s/n. 14081. Córdoba. Spain. On cache coherency and memory consistency issues in NoC based shared memory multiprocessor SoC architectures Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD'06) Exploration of distributed shared memory architectures for NoC-based multiprocessors Matteo Monchiero, Gianluca Palermo, Cristina Silvano *, Oreste Villa

NoC for Cache Coherence

NoC for Cache Coherence

Presentation Transcript

Extra Cache Coherence Examples

Directory-Based Cache Coherence

Cache Coherence Schemes for Multiprocessors

Cache Coherence for GPU Architectures

Cache coherence

Cache Coherence

Cache coherence for CMPs

The Cache-Coherence Problem

Cache Coherence

Cache coherence, etc… - MIMD –

Cache Coherence Protocols

The Cache-Coherence Problem

Power Efficient Cache Coherence

Directory-based Cache Coherence

Detailed Cache Coherence Characterization for OpenMP Benchmarks

Cache Coherence

The Cache-Coherence Problem

The Cache-Coherence Problem

Example Cache Coherence Problem

Cache Coherence Techniques for Multicore Processors

The Cache-Coherence Problem