Diamonds: Optimal Memory Controller Placement in Many-Core Systems

Dennis Abts Google Natalie Enright Jerger University of Toronto John Kim KAIST Diamonds are a Memory Controller’s Best Friend* Dan Gibson Univ of Wisconsin Mikko Lipasti Univ of Wisconsin *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked.

Executive Summary ® On what tiles should memory controllers reside? Three-tiered simulation approach Heuristic-guided search Detailed network simulation Full-system simulation Diamond MC placement works well for on-chip meshes and tori Diamonds minimize maximum channel load Diamonds deliver lower and more predictable runtimes

Background Diverse on-chip communication Cache-to-cache LD/ST to Memory Off-chip traffic (e.g., I/O) Processors/chip on the rise Pins available for memory not rising as fast: Memory bandwidth becomes more precious Reality: Many Cores, Few Memory Controllers Tiled architectures gaining popularity Commonly employ on-chip meshes or tori

The Problem What Memory Controller placement is best overall? Flip-chip packaging allows flexible escape routes n tiles and m ports: Don’t worry, there are only configurations! What are the characteristics of the best configuration? Performance:Lowruntime for a set of objective workloads Throughput:Low latency as a function of offered load Fairness: Similar (low) average memory latency across all nodes. Predictability:Low latency and runtime variance Slight Simplification: Assume n = k2 and m = 2k

Baseline Placement: row0_7 X-Dimension Traffic Encounters Congestion on Rows with Memory Controllers • Ports to MCs located at top and bottom of chip • Conceptually similar to real parts: • Tilera’s Tile64 • 64 cores, 4 MCs (4 ports each, top/bottom of chip) • Intel TeraFLOPs • 80 cores, 2 MCs (8 ports each, top/bottom of chip)

Three-Tiered Approach Link Contention Simulation Detailed Network Simulation More Runs Shorter Runtimes More Detail Full System

Tier 0.5: Exhaustive Search It turns out is tractable for k<7 (At least on the link contention simulator – only 3,268,760 possibilities for k=5) Another Contender Patterns Emerge!

Tier 1: Heuristic-Guided Search k>6: Intractable to search all configurations Use search heuristics and random search Genetic Algorithm: Represent designs as a population of strings (Bit Vectors) Generate new designs by combining members of the population via genetic crossover(Bit Selection) Occasionally, mutate new population members (Swap adjacent bits) Reduce population size by removing least-fit members – Survival of the Fittest

Genetic MC Placement 0x00AA550000AA5500 0x0000FF0000FF0000 0x00AAF00000F25100 Mutate 0x00AAF00000F25080

Link Contention Results k=8 GA Selected Diamond as most fit solution for 8x8 Minimizes MCs in a single row/column Spreads DOR load Sanity Check: GA also prefers Diamond for 4x4, 5x5, and 6x6

Network Simulation: Open-Loop Evaluation Detailed simulation of all network events (buffers, links, etc.) Cores are Bernoulli injection processes, uniform random traffic Measure latency vs. offered load

Open-Loop Results 25 20 row0_7 row2_5 Diamond X 15 Latency (cycles) 10 5 0 0 0.2 0.4 0.6 0.8 1 Offered load (flits/cycle)

Closed-Loop Evaluation Each processor executes N memory operations Up to r operations outstanding at a time Models MSHRs Uniform Random requests, and real request streams with ‘hot spot’ behavior

Closed-Loop Results 20 16 12 Number of Processors 8 4 0 3500 4000 4500 5000 5500 6000 8000 8500 9000 9500 10000 10500 11000 6500 Completion Time Diamond row0_7

Full System Results JBB WEB TPC-W+H TPC-H TPC-W Average Network Latency (cycles) for Request to Memory Controller JBB WEB TPC-H Diamond placement yields lower latency and lower latency variance. TPC-W TPC-W+H Standard Deviation

Conclusion MC Placement Matters! Diamond reduces contention, improves latency, and reduces latency/runtime variance X does fairly well

Diamonds: Optimal Memory Controller Placement in Many-Core Systems

Diamonds: Optimal Memory Controller Placement in Many-Core Systems

Presentation Transcript

Matthew and Tilly

Memory Management

Virtual Memory

Memory Techniques for Interpreters

MegaShift 4L60E Transmission Controller GPIO Program

Raster-scan system

Chapter 14 Memory System

Unit 3 – Area of study 2: Memory

MEMORY

Memory Management:

Martin Donohoe

Dynamic Memory Management

CS4100: 計算機結構 Memory Hierarchy

Biopsychology of Memory

Myers’ PSYCHOLOGY

AP PSYCHOLOGY Review for the AP Exam

MEMORY

Unit 2 – Memory

Virtual Memory

Memory

Memory Interface