A Domain Specific On-Chip Network Design for Large Scale Cache Systems

A Domain Specific On-Chip Network Design for Large Scale Cache Systems Yuho Jin, Eun Jung Kim, Ki Hwan Yum HPCA 2007

Motivation • Large caches are becoming the norm of the day • Not optimized/ larger access times • Overprovision and underutilization of network resources(!!) • Can same performance be achieved with lesser resources?

Contributions • Single stage multicast router • Not really new! (Not really feasible??) • New network topology for large caches (banks) • Minimizes the number of links in system • A new routing algorithm • For the new topology • A new replacement policy in NUCA caches • FAST-LRU, exploits multicast Overall aim was to reduce network overhead for minimal performance loss!!

Router Architecture • 4 VCs per PC • Near 1 stage router • Look-ahead routing • Buffer bypass • Spec Switch Alloc • Arbitration precomputation Co-ordinate System

Multicast Support • Multicast: Sync vs Async • Async Multicast => flit replication=> buffer space • Do it without extra h/w? • Use existing VC buffers • Copy flit to a different PC buffer • Use lesser used PCs • Get a free VC • Send flit to different destinations Figure courtesy: Chita R Das, OCIN ‘06

Set Associative Cache

Fast-LRU Replacement • Bank Set arrangement - Sets distributed among banks Column ..……… S0, W1 ……… S1, W0 S0, W2 ……… S0, W3 ……… S0, W4 ………

Multicast Fast-LRU

Network Topology Access Patterns • A – data request • B: Check for data • B/C: Move data • D/E : Data to Core • A’/B’: Hit or Miss • F: Data delivery from mem to MRU • G: Dirty block to mem

Network Topology => Horizontal Links mostly not required Except here!!

The HALO Network

Implications • Underutilized links removed => simplified network, area savings(?) • Power savings come for free!! (lesser buffers) • Links removed => constrained routing • XYX routing proposed • XY is dimension order • Algorithm simplified - • If going from core/mem to $ bank travel horizontal first • If going from $ bank to core/mem, travel vertical first

Example Yoff = -ve , Xoff = +ve -> Channel = Y-

Example Yoff = +ve, Xoff = +ve Channel -> X+, then Y+

Results

Results Avg Access latency They target to optimize network latency – 50 -60% of total latency

IPC Comparisons

A Domain Specific On-Chip Network Design for Large Scale Cache Systems

A Domain Specific On-Chip Network Design for Large Scale Cache Systems

Presentation Transcript

Network-on-chip

Actor-Oriented Design: A focus on domain-specific languages for embedded systems

Network-on-Chip

Large-scale Recommender Systems on Just a PC

On the Design of a Photonic Network-on-Chip

Network On Chip Cache Coherency

Network-on-Chip

Planetary-Scale Views on a Large Instant-Messaging Network

Scratchpad Memories: A Design Alternative for Cache On-chip Memory in Embedded Systems

Automatic Synthesis of Microfluidic Large Scale Integration Chips from a Domain-Specific Language

ROBTIC : On chip I-cache design for low power embedded systems

A Large-Scale Network Testbed

Enhancing Security in Ultra-Large Scale (ULS) Systems using Domain-specific Modeling

Enhancing Security in Ultra-Large Scale (ULS) Systems using Domain-specific Modeling

Network On Chip Cache Coherency

Network On Chip Cache Coherency

Large – Scale Sensor network

Network On Chip Cache Coherency

Large-Scale Systems

Large Scale Systems Design G52LSS

Planetary-Scale Views on a Large Instant-Messaging Network

Lecture 12: Large Cache Design