1 / 38

Throughput-Effective On-Chip Networks for Manycore Accelerators

Throughput-Effective On-Chip Networks for Manycore Accelerators. Ali Bakhoda , John Kim ¹ and Tor M. Aamodt ¹ KAIST, Korea . Manycore Accelerators and NoC. Manycore accelerators P revalent example: high-end GPUs 10s of thousands of threads running at the same time

randi
Télécharger la présentation

Throughput-Effective On-Chip Networks for Manycore Accelerators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Throughput-Effective On-Chip Networks for Manycore Accelerators Ali Bakhoda, John Kim¹ and Tor M. Aamodt ¹KAIST, Korea

  2. Manycore Accelerators and NoC • Manycoreaccelerators • Prevalent example: high-end GPUs • 10s of thousands of threads running at the same time • Bulk Synchronous Parallel programming style • 3 / 5 top supercomputers • Based on the Nov. 2010 Top500 list • Primary goal: Higher application level throughput • NoCin accelerators • Needs a different perspective from CPUs • Not very well studied in this context

  3. The Need for Throughput-Effective NoCs Throughput-Effective design: Improves application level performance per unit chip area

  4. Contributions • Study impact of NoC on application level performance • Traditional improvements (router latency reduction): minimal impact on application level performance • Increasing channel width: High performance gain + high area cost • Consider application level throughput per unit area of NoC • Throughput correlated with injection rate of few nodes • Many-to-few-to-many traffic pattern • Propose Throughput-Effective NoC design • Checkerboard network • Multi-port router structure

  5. Outline • Introduction • Baseline architecture • NoC properties in accelerators • Throughput-Effective NoC design • Experimental results • Conclusion

  6. Accelerator Overview

  7. Baseline Network • Mesh with MCs at periphery of the chip • Similar to Tilera’s TILE64 or Intel’s 80-core Teraflops chip • Simple and Scalable • Dimension Order Routing • Virtual Channel Flow Control • 4-cycle routers

  8. Finding a Balanced Design Bisection bandwidth of baseline mesh

  9. Gap between Balanced Mesh and Ideal NoC

  10. Outline • Introduction • Baseline architecture • NoC properties in accelerators • Throughput-Effective NoC design • Experimental results • Conclusion

  11. NoC properties in ManyCore Accelerators • Router latency has minimal impact on application level throughput • Aggressive 1-cycle routers instead of 4-cycle router • Only2.3% application level speedup • Channel Bandwidth is very important • 27% speedup by doubling BW • But quadratic area increase

  12. 2x Channel Bandwidth

  13. Many-to-Few-to-Many Traffic Pattern MC Injection bandwidth MC1 MC0 C0 C0 C1 C1 C2 C2 Cn Cn reply network request network MCm

  14. Outline • Introduction • Baseline architecture • NoC properties in accelerators • Throughput-Effective NoC design • Experimental results • Conclusion

  15. Throughput-Effective Network design

  16. Checkerboard Routing: Half-Routers • Half-Routers • No turns allowed at half-routers • Limited connectivity • Saves ~50% of router crossbar area • Full-Routers: • Normal routers w/ complete connectivity • Use Half-Routers every other node Half-Router Connectivity

  17. Solution: Routing Restriction (1) • Routing from a full-router to a half-router that is: • An odd number of columns away • Not in the same row • Solution: Use YX routing instead of XY routing in this case

  18. Solution: Routing Restriction (2) • Routing from a half-router to a half-router that is: • An even number of columns away • Not in the same row • Solution: needs two turns (1) To intermediate full-router using YX (2) To the destination using XY • Requires an extra VC to avoid deadlock

  19. Routing Restriction (3) • Full-routers that are odd number of columns away • We avoid this case by using a different MC placement (next 2 slides)

  20. Throughput-Effective Network design

  21. Placement of MCs • Exploit Many-to-Few • Place the MCs at Half-Router nodes • Half-Routers can communicate will all nodes with no penalty • Common case for BSP: compute cores communicate with MCs not each other • [CMP-MSI’08] “Extending the Scalability of Single Chip Stream Processors with On-chip Caches”, • Bakhoda et al. • [ISCA’09] “Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs" Abts et al.

  22. Throughput-Effective Network design

  23. Multi-port routers at MCs • Reduce the bottleneck at the few nodes • Increase terminal BW of the fewnodes • Increase the injection ports of MC routers • Minimal area overhead (~1% in total NoC area) • Speedups of up to 25%

  24. Throughput-Effective Network design

  25. Outline • Introduction • Baseline architecture • NoC properties in accelerators • Throughput-Effective NoC design • Experimental results • Conclusion

  26. Methodology • Compute simulation: GPGPU-Sim (2.2.1b) • NoCsimulation: Booksim-2 • Integrated into GPGPU-Simas network simulator • Area estimations: Orion 2.0 • Benchmarks: 24 CUDA applications including the Rodinia benchmarks

  27. Results • Combination of • Checkerboard routing and placement • Channel Slicing • Multi-port routers at MCs • Overall HM speedup 17% across 24 benchmarks over balanced baseline • Total NoC area reduction of 43% High Speedup High Traffic Low Speedup High Traffic Low Speedup Low Traffic

  28. Throughput-Effective NoC

  29. Summary • Throughput-Effective design: Consider system level performance impact + area impact of NoC • Observations • NoC BW is more important than latency in accelerators • Many-to-Few-to-Many traffic pattern • Throughput-Effective NoC for accelerators • Checkerboard • Multi-port MC routers • Channel-slicing

  30. Thank you

  31. Backups…

  32. Channel Slicing – Double networks • Divide the single network into two physical networks • Each new network: half the bisection BW of the original network • Overall bisection BW: constant • Saves area • Quadratic dependency of crossbar area on channel BW • Increases serialization latency • But compute accelerators are not sensitive to latency

  33. Results • Memory Controller placement • HM of speedup 13% over balanced baseline design

  34. Results • Checkerboard routing • Less than 1% performance loss compared to DOR with same resources • Reduces total router area by 14.2%

  35. Results • Channel slicing • Average change in performance < 1% • NoC area reduction of 37%

  36. Top 5 systems • TOP 5 Systems - 11/2010 • 1 Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, Nvidia GPU, FT-1000 8C • 2 Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz • 3 Nebulae - Dawning TC3600 Blade, Intel X5650, Nvidia Tesla C2050 GPU • 4TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/Windows • 5 Hopper - Cray XE6 12-core 2.1 GHz

  37. Alternative MC placement example

  38. Many-to-Few-to-Many Traffic Pattern MC output bandwidth Core input bandwidth MC input bandwidth Core output bandwidth MC1 MC0 C0 C0 C1 C1 C2 C2 Cn Cn reply network request network MCm

More Related