Optimizing Off-Chip Memory Interconnects for Many-Core Processors

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008

Motivation • In modern many-core systems, there is significant asymmetry between the number of cores and the number of memory access points • Tilera’s multiprocessor has 64 cores and only 4 memory controllers • PARSEC benchmarks suggest that off-chip memory traffic increases with the number of cores for CMPs • We explore mechanisms to lower latency and power consumption for processor-memory interconnect

Tilera Tile64 x5

Tilera Tile64 • Five physical mesh networks • UDN, IDN, SDN, TDN, MDN • TDN and MDN are used for handling memory traffic • Memory requests transit TDN • Large store requests, small load requests • Memory responses transit MDN • Large load responses, small store responses • Includes cache-to-cache transfers and off-chip transfers

Tapered Fat-Tree • Good for many-to-few connectivity • Fewer hops  Shorter latency • Fewer routers  Less power, less area • Root nodes directly connect to memory controller • Replace MDN mesh network with two tapered fat-tree networks • One for routing requests up • One for routing responses down

Tile64 with Tapered Fat Tree

Memory Model • Directory-based cache coherence • Directory cache at every node • Off-chip directory controller • Tile-to-tile requests and responses transit the TDN • Off-chip memory requests and responses transit the MDN

TDN and MDN Traffic for L2 Read Misses

Synthetic Benchmarks • Statistical simulation • Model benchmarks from PARSEC suite • Based on off-chip traffic for 64-byte cache-line for 64 cores Working Set Size Small Large Sharing More Less

Breakdown of Average Latency • Latency of memory intensive applications dominated by queuing delay. • Benchmarks with little off-chip traffic save on transit time.

Power Modeling • Orion power simulator for on-chip routers from Princeton University • Models switching power as sum of • Buffer power • Crossbar power • Arbitration power • Specify parameters • Activity factor, number of input and output ports, virtual channels, size of input buffer, etc.

Tilera MDN Routers

Tree Routers

Parameters • 100 nm CMOS process • VDD = 1.0V • Clock Frequency = 750 MHz • 32-bit flit width

Conclusion • Physical design of the tapered fat-tree is more difficult • The TFT topology can reduce memory latency and power dissipation for many-core systems

Optimizing Off-Chip Memory Interconnects for Many-Core Processors

Optimizing Off-Chip Memory Interconnects for Many-Core Processors

Presentation Transcript

Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors

Express Cube Topologies for On-chip Interconnects

Reinventing germanium avalanche photodetector for nanophotonic on-chip optical interconnects

Designing On-chip Memory Systems for Throughput Architectures

A Multi- Vdd Dynamic Variable-Pipeline On-Chip Router for CMPs

Designing On-chip Memory Systems for Throughput Architectures

CCNoC : On-Chip Interconnects for Cache-Coherent Manycore Server Chips

Hardware Core for Off-chip Memory Security Management in Embedded Systems

Off-chip Decoupling Capacitor Allocation for Chip Package Co-Design

21.1 Efficient On-Chip Global Interconnects

Comparing Memory Systems for Chip Multiprocessors

Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Cache coherence for CMPs

Token Coherence for CMPs

Interconnects

On-Chip Interconnects in Sub-100nm Circuits

Si-based On-chip Optical Interconnects

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

LOW-LEAKAGE REPEATERS FOR NETWORK-ON-CHIP INTERCONNECTS

Photonic On-Chip Networks for Performance-Energy Optimized Off-Chip Memory Access

Comparing Memory Systems for Chip Multiprocessors

IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability