1 / 16

Diamonds are a Memory Controller’s Best Friend*

Dennis Abts Google . Natalie Enright Jerger University of Toronto. John Kim KAIST. Diamonds are a Memory Controller’s Best Friend*. Dan Gibson Univ of Wisconsin. Mikko Lipasti Univ of Wisconsin.

kamala
Télécharger la présentation

Diamonds are a Memory Controller’s Best Friend*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dennis Abts Google Natalie Enright Jerger University of Toronto John Kim KAIST Diamonds are a Memory Controller’s Best Friend* Dan Gibson Univ of Wisconsin Mikko Lipasti Univ of Wisconsin *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked.

  2. Executive Summary ® On what tiles should memory controllers reside? Three-tiered simulation approach Heuristic-guided search Detailed network simulation Full-system simulation Diamond MC placement works well for on-chip meshes and tori Diamonds minimize maximum channel load Diamonds deliver lower and more predictable runtimes

  3. Background Diverse on-chip communication Cache-to-cache LD/ST to Memory Off-chip traffic (e.g., I/O) Processors/chip on the rise Pins available for memory not rising as fast: Memory bandwidth becomes more precious Reality: Many Cores, Few Memory Controllers Tiled architectures gaining popularity Commonly employ on-chip meshes or tori

  4. The Problem What Memory Controller placement is best overall? Flip-chip packaging allows flexible escape routes n tiles and m ports: Don’t worry, there are only configurations! What are the characteristics of the best configuration? Performance:Lowruntime for a set of objective workloads Throughput:Low latency as a function of offered load Fairness: Similar (low) average memory latency across all nodes. Predictability:Low latency and runtime variance Slight Simplification: Assume n = k2 and m = 2k

  5. Baseline Placement: row0_7 X-Dimension Traffic Encounters Congestion on Rows with Memory Controllers • Ports to MCs located at top and bottom of chip • Conceptually similar to real parts: • Tilera’s Tile64 • 64 cores, 4 MCs (4 ports each, top/bottom of chip) • Intel TeraFLOPs • 80 cores, 2 MCs (8 ports each, top/bottom of chip)

  6. Three-Tiered Approach Link Contention Simulation Detailed Network Simulation More Runs Shorter Runtimes More Detail Full System

  7. Tier 0.5: Exhaustive Search It turns out is tractable for k<7 (At least on the link contention simulator – only 3,268,760 possibilities for k=5) Another Contender Patterns Emerge!

  8. Tier 1: Heuristic-Guided Search k>6: Intractable to search all configurations Use search heuristics and random search Genetic Algorithm: Represent designs as a population of strings (Bit Vectors) Generate new designs by combining members of the population via genetic crossover(Bit Selection) Occasionally, mutate new population members (Swap adjacent bits) Reduce population size by removing least-fit members – Survival of the Fittest

  9. Genetic MC Placement 0x00AA550000AA5500 0x0000FF0000FF0000 0x00AAF00000F25100 Mutate 0x00AAF00000F25080

  10. Link Contention Results k=8 GA Selected Diamond as most fit solution for 8x8 Minimizes MCs in a single row/column Spreads DOR load Sanity Check: GA also prefers Diamond for 4x4, 5x5, and 6x6

  11. Network Simulation: Open-Loop Evaluation Detailed simulation of all network events (buffers, links, etc.) Cores are Bernoulli injection processes, uniform random traffic Measure latency vs. offered load

  12. Open-Loop Results 25 20 row0_7 row2_5 Diamond X 15 Latency (cycles) 10 5 0 0 0.2 0.4 0.6 0.8 1 Offered load (flits/cycle)

  13. Closed-Loop Evaluation Each processor executes N memory operations Up to r operations outstanding at a time Models MSHRs Uniform Random requests, and real request streams with ‘hot spot’ behavior

  14. Closed-Loop Results 20 16 12 Number of Processors 8 4 0 3500 4000 4500 5000 5500 6000 8000 8500 9000 9500 10000 10500 11000 6500 Completion Time Diamond row0_7

  15. Full System Results JBB WEB TPC-W+H TPC-H TPC-W Average Network Latency (cycles) for Request to Memory Controller JBB WEB TPC-H Diamond placement yields lower latency and lower latency variance. TPC-W TPC-W+H Standard Deviation

  16. Conclusion MC Placement Matters! Diamond reduces contention, improves latency, and reduces latency/runtime variance X does fairly well

More Related