Throughput-Effective On-Chip Networks for Manycore Accelerators

Throughput-Effective On-Chip Networks for Manycore Accelerators Ali Bakhoda, John Kim¹ and Tor M. Aamodt ¹KAIST, Korea

Manycore Accelerators and NoC • Manycoreaccelerators • Prevalent example: high-end GPUs • 10s of thousands of threads running at the same time • Bulk Synchronous Parallel programming style • 3 / 5 top supercomputers • Based on the Nov. 2010 Top500 list • Primary goal: Higher application level throughput • NoCin accelerators • Needs a different perspective from CPUs • Not very well studied in this context

The Need for Throughput-Effective NoCs Throughput-Effective design: Improves application level performance per unit chip area

Contributions • Study impact of NoC on application level performance • Traditional improvements (router latency reduction): minimal impact on application level performance • Increasing channel width: High performance gain + high area cost • Consider application level throughput per unit area of NoC • Throughput correlated with injection rate of few nodes • Many-to-few-to-many traffic pattern • Propose Throughput-Effective NoC design • Checkerboard network • Multi-port router structure

Outline • Introduction • Baseline architecture • NoC properties in accelerators • Throughput-Effective NoC design • Experimental results • Conclusion

Accelerator Overview

Baseline Network • Mesh with MCs at periphery of the chip • Similar to Tilera’s TILE64 or Intel’s 80-core Teraflops chip • Simple and Scalable • Dimension Order Routing • Virtual Channel Flow Control • 4-cycle routers

Finding a Balanced Design Bisection bandwidth of baseline mesh

Gap between Balanced Mesh and Ideal NoC

NoC properties in ManyCore Accelerators • Router latency has minimal impact on application level throughput • Aggressive 1-cycle routers instead of 4-cycle router • Only2.3% application level speedup • Channel Bandwidth is very important • 27% speedup by doubling BW • But quadratic area increase

2x Channel Bandwidth

Many-to-Few-to-Many Traffic Pattern MC Injection bandwidth MC1 MC0 C0 C0 C1 C1 C2 C2 Cn Cn reply network request network MCm

Throughput-Effective Network design

Checkerboard Routing: Half-Routers • Half-Routers • No turns allowed at half-routers • Limited connectivity • Saves ~50% of router crossbar area • Full-Routers: • Normal routers w/ complete connectivity • Use Half-Routers every other node Half-Router Connectivity

Solution: Routing Restriction (1) • Routing from a full-router to a half-router that is: • An odd number of columns away • Not in the same row • Solution: Use YX routing instead of XY routing in this case

Solution: Routing Restriction (2) • Routing from a half-router to a half-router that is: • An even number of columns away • Not in the same row • Solution: needs two turns (1) To intermediate full-router using YX (2) To the destination using XY • Requires an extra VC to avoid deadlock

Routing Restriction (3) • Full-routers that are odd number of columns away • We avoid this case by using a different MC placement (next 2 slides)

Placement of MCs • Exploit Many-to-Few • Place the MCs at Half-Router nodes • Half-Routers can communicate will all nodes with no penalty • Common case for BSP: compute cores communicate with MCs not each other • [CMP-MSI’08] “Extending the Scalability of Single Chip Stream Processors with On-chip Caches”, • Bakhoda et al. • [ISCA’09] “Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs" Abts et al.

Multi-port routers at MCs • Reduce the bottleneck at the few nodes • Increase terminal BW of the fewnodes • Increase the injection ports of MC routers • Minimal area overhead (~1% in total NoC area) • Speedups of up to 25%

Methodology • Compute simulation: GPGPU-Sim (2.2.1b) • NoCsimulation: Booksim-2 • Integrated into GPGPU-Simas network simulator • Area estimations: Orion 2.0 • Benchmarks: 24 CUDA applications including the Rodinia benchmarks

Results • Combination of • Checkerboard routing and placement • Channel Slicing • Multi-port routers at MCs • Overall HM speedup 17% across 24 benchmarks over balanced baseline • Total NoC area reduction of 43% High Speedup High Traffic Low Speedup High Traffic Low Speedup Low Traffic

Throughput-Effective NoC

Summary • Throughput-Effective design: Consider system level performance impact + area impact of NoC • Observations • NoC BW is more important than latency in accelerators • Many-to-Few-to-Many traffic pattern • Throughput-Effective NoC for accelerators • Checkerboard • Multi-port MC routers • Channel-slicing

Thank you

Backups…

Channel Slicing – Double networks • Divide the single network into two physical networks • Each new network: half the bisection BW of the original network • Overall bisection BW: constant • Saves area • Quadratic dependency of crossbar area on channel BW • Increases serialization latency • But compute accelerators are not sensitive to latency

Results • Memory Controller placement • HM of speedup 13% over balanced baseline design

Results • Checkerboard routing • Less than 1% performance loss compared to DOR with same resources • Reduces total router area by 14.2%

Results • Channel slicing • Average change in performance < 1% • NoC area reduction of 37%

Top 5 systems • TOP 5 Systems - 11/2010 • 1 Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, Nvidia GPU, FT-1000 8C • 2 Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz • 3 Nebulae - Dawning TC3600 Blade, Intel X5650, Nvidia Tesla C2050 GPU • 4TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/Windows • 5 Hopper - Cray XE6 12-core 2.1 GHz

Alternative MC placement example

Many-to-Few-to-Many Traffic Pattern MC output bandwidth Core input bandwidth MC input bandwidth Core output bandwidth MC1 MC0 C0 C0 C1 C1 C2 C2 Cn Cn reply network request network MCm

Throughput-Effective On-Chip Networks for Manycore Accelerators

Throughput-Effective On-Chip Networks for Manycore Accelerators

Presentation Transcript

Networks-on-Chip

Networks-on-Chip

On-Chip Networks and Testing

CAD and Design Tools for On-Chip Networks

System Busses / Networks-on-Chip

Networks-on-Chip

Lecture 16: On-Chip Networks

Designing On-chip Memory Systems for Throughput Architectures

Designing On-chip Memory Systems for Throughput Architectures

Gaussian Interconnections for On-Chip Networks

A Cost Effective Centralized Adaptive Routing for Networks on Chip

CCNoC : On-Chip Interconnects for Cache-Coherent Manycore Server Chips

Flattened Butterfly Topology for On-Chip Networks

Networks on Chip

Networks-on-Chip

On-Chip Communication: Networks on Chip (NoCs)

High Throughput Computing on P2P Networks

Networks-on-Chip

System Architecture for On-Chip Networks

Networks-on-Chip

Networks on Chip

A High Throughput Network-on-Chip Architecture for System-on-Chip Interconnect