Multi-Core Systems

Multi-Core Systems Multi-Core Chip CORE 0 CORE 1 CORE 2 CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE unfairness Shared DRAM Memory System DRAM MEMORY CONTROLLER . . . DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank 7

DRAM Bank Operation Access Address (Row 0, Column 0) Access Address (Row 1, Column 0) Access Address (Row 0, Column 9) Access Address (Row 0, Column 1) Columns Row address 0 Row address 1 Row decoder Rows Row Buffer CONFLICT ! HIT HIT Row 1 Row 0 Empty Column address 0 Column address 9 Column address 1 Column decoder Column address 0 Data

DRAM Controllers • A row-conflict memory access takes much longer than a row-hit access (50ns vs. 100ns) • Current controllers take advantage of the row buffer • Commonly used scheduling policy (FR-FCFS): • Row-hit first: Service row-hit memory accesses first • Oldest-first: Then service older accesses first • This scheduling policy aims to • Maximize DRAM throughput • But, yields itself to a new form of denial of service with multiple threads

Memory Performance Attacks:Denial of Memory Service in Multi-Core Systems Thomas Moscibroda and Onur Mutlu Computer Architecture Group Microsoft Research

Outline • The Problem • Memory Performance Hogs • Preventing Memory Performance Hogs • Our Solution • Fair Memory Scheduling • Experimental Evaluation • Conclusions

The Problem • Multiple threads share the DRAM controller • DRAM controllers are designed to maximize DRAM throughput • DRAM scheduling policies are thread-unaware and unfair • Row-hit first: unfairly prioritizes threads with high row buffer locality • Streaming threads • Threads that keep on accessing the same row • Oldest-first: unfairly prioritizes memory-intensive threads • Vulnerable to denial of service attacks

Memory Performance Hog (MPH) • A thread that exploits unfairness in the DRAM controller • to deny memory service (for long time periods) to threads co-scheduled on the same chip • Can severely degrade other threads’ performance • While maintaining most of its own performance! • Easy to write an MPH • An MPH can be designed intentionally (malicious) • Or, unintentionally! • Implications in commodity multi-core systems: • widespread performance degradation and unpredictability • user discomfort and loss of productivity • vulnerable hardware could lose market share

An Example Memory Performance Hog STREAM • Sequential memory access • Very high row buffer locality (96% hit rate) • Memory intensive

A Co-Scheduled Application RDARRAY • Random memory access • Very low row buffer locality (3% hit rate) • Similarly memory intensive

What Does the MPH Do? Row decoder T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T1: Row 5 Row size: 8KB, data size: 64B 128 requests of stream serviced before rdarray T1: Row 111 T0: Row 0 T1: Row 16 T0: Row 0 Request Buffer Row Buffer Row 0 Row 0 Column decoder Data

Already a Problem in Real Dual-Core Systems 2.82X slowdown 1.18X slowdown Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora)

More Severe Problem with 4 and 8 Cores • Simulated config: • 512KB private L2s • 3-wide processors • 128-entry instruction • window

Preventing Memory Performance Hogs • MPHs cannot be prevented in software • Fundamentally hard to distinguish between malicious and unintentional MPHs • Software has no direct control over DRAM controller’s scheduling • OS/VMM can adjust applications’ virtualphysical page mappings • No control over DRAM controller’s hardwired pagebank mappings • MPHs should be prevented in hardware • Hardware also cannot distinguish malicious vs unintentional • Goal: Contain and limit MPHs by providingfair memory scheduling

Fairness in Shared DRAM Systems • A thread’s DRAM performance dependent on its inherent • Row-buffer locality • Bank parallelism • Interference between threads can destroy either or both • Well-studied notions of fairness are inadequate • Fair scheduling cannot simply • Equalize memory bandwidth among threads • Unfairly penalizes threads with good bandwidth utilization • DRAM scheduling is not a pure bandwidth allocation problem • Unlike a network wire, DRAM is a stateful system • Equalize the latency or number of requests among threads • Unfairly penalizes threads with good row-buffer locality • Row-hit requests are less costly than row-conflict requests Bandwidth, Latency or Request Fairness Performance Fairness (slowdown)

Our Definition of DRAM Fairness • A DRAM system is fair if it slows down each thread equally • Compared to when the thread is run alone • Experienced Latency = EL(i) = Cumulative DRAM latency of thread i running with other threads • Ideal Latency= IL(i) = Cumulative DRAM latency of thread i if it were run alone on the same hardware resources • DRAM-Slowdown(i) = EL(i)/IL(i) • The goal of a fair scheduling policy is to equalize DRAM-Slowdown(i) for all Threads i • Considers inherent DRAM performance of each thread

Fair Memory Scheduling Algorithm (1) • During each time interval of cycles, for each thread, DRAM controller • Tracks EL (Experienced Latency) • Estimates IL (Ideal Latency) • At the beginning of a scheduling cycle, DRAM controller • Computes Slowdown = EL/IL for each thread • Computes unfairness index = MAX Slowdown / MIN Slowdown • If unfairness <  • Use baseline scheduling policy to maximize DRAM throughput • (1) row-hit first • (2) oldest-first

Fair Memory Scheduling Algorithm (2) • If unfairness ≥  • Use fairness-oriented scheduling policy • (1) requests from thread with MAX Slowdown first (if any) • (2) row-hit first • (3) oldest-first • Maximizes DRAM throughput if cannot improve fairness • : Maximum tolerable unfairness • : Lever between short-term versus long-term fairness • System software determines  and  • Conveyed to the hardware via privileged instructions

How Does FairMem Prevent an MPH? T0: Row 0 T1: Row 5 T0: Row 0 T0: Row 0 T1: Row 111 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T1: Row 16 Row Buffer Row 16 Row 111 Row 0 Row 0 T0 Slowdown 1.04 1.07 1.04 1.00 1.10 1.03 T1 Slowdown 1.14 1.08 1.03 1.06 1.11 1.06 1.00 Unfairness 1.06 1.03 1.03 1.04 1.06 1.04 1.03 1.00 Data  1.05

Hardware Implementation • Tracking Experienced Latency • Relatively easy • Update a per-thread latency counter as long as there is an outstanding memory request for a thread • Estimating Ideal Latency • More involved because thread is not running alone • Need to estimate inherent row buffer locality • Keep track of the row that would have been in the row buffer if thread were running alone • For each memory request, update a per-thread latency counter based on “would-have-been” row buffer state • Efficient implementation possible – details in paper

Evaluation Methodology • Instruction-level performance simulation • Detailed x86 processor model based on Intel Pentium M • 4 GHz processor, 128-entry instruction window • 512 Kbyte per core private L2 caches • Detailed DRAM model • 8 banks, 2Kbyte row buffer • Row-hit latency: 50ns (200 cycles) • Row-conflict latency: 100ns (400 cycles) • Benchmarks • Stream (MPH) and rdarray • SPEC CPU 2000 and Olden benchmark suites

Dual-core Results MPH with non-intensive VPR Fairness improvement 7.9X to 1.11 System throughput improvement 4.5X MPH with intensive MCF Fairness improvement 4.8X to 1.08 System throughput improvement 1.9X

4-Core Results • Non-memory intensive applications suffer the most (VPR) • Low row-buffer locality hurts (MCF) • Fairness improvement: 6.6X • System throughput improvement: 2.2X

8-Core Results • Fairness improvement: 10.1X • System throughput improvement: 1.97X

Effect of Row Buffer Size • Vulnerability increases with row buffer size • Also with memory latency (in paper)

Conclusions • Introduced the notion of memory performance attacks in shared DRAM memory systems • Unfair DRAM scheduling is the cause of the vulnerability • More severe problem in future many-core systems • We provide a novel definition of DRAM fairness • Threads should experience equal slowdowns • New DRAM scheduling algorithm enforces this definition • Effectively prevents memory performance attacks • Preventing attacks also improves system throughput!

Questions?

Multi-Core Systems

Multi-Core Systems

Presentation Transcript

Multi-Core Systems

Multi-core architectures

Programming Multi-Core Systems

マルチコア /Multi-Core

Multi-core Programming

Multi-core Programming

Multi-core Systems and Coherence Hierarchies

Multi-core systems System Architecture COMP25212

Multi-core systems System Architecture COMP25212

Multi-core Programming

Multi-Core Computing

Multi-core programming frameworks for embedded systems

Multi-Core Development

Multi-Core Architectures

Multi-core systems System Architecture COMP25212

Multi-core CPU’s

FPGA Multi-core

Why multi-threading/multi-core?

Multi-core systems COMP25212 System Architecture

Multi-core systems

Multi-core programming frameworks for embedded systems