Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems Jiang Lin1, Qingda Lu2, Xiaoning Ding2, Zhao Zhang1, Xiaodong Zhang2, and P. Sadayappan2 1 Department of ECE Iowa State University 2 Department of CSE The Ohio State University

Shared Caches Can be a Critical Bottleneck in Multi-Core Processors • L2/L3 caches are shared by multiple cores • Intel Xeon 51xx (2core/L2) • AMD Barcelona (4core/L3) • Sun T2, ... (8core/L2) • Effective cache partitioning is critical to address the bottleneck caused by the conflicting accesses in shared caches. • Several hardware cache partitioning methods have been proposed with different optimization objectives • Performance: [HPCA’02], [HPCA’04], [Micro’06] • Fairness: [PACT’04], [ICS’07], [SIGMETRICS’07] • QoS: [ICS’04], [ISCA’07] …… Core Core Core Shared L2/L3 cache 2

Limitations of Simulation-Based Studies • Excessive simulation time • Whole programs can not be evaluated. It would take several weeks/months to complete a single SPEC CPU2006 benchmark • As the number of cores continues to increase, simulation ability becomes even more limited • Absence of long-term OS activities • Interactions between processor/OS affect performance significantly • Proneness to simulation inaccuracy • Bugs in simulator • Impossible to model many dynamics and details of the system 3

Our Approach to Address the Issues Design and implement OS-based Cache Partitioning • Embedding cache partitioning mechanism in OS • By enhancing page coloring technique • To support both static and dynamic cache partitioning • Evaluate cache partitioning policieson commodity processors • Execution- and measurement-based • Run applications to completion • Measure performance with hardware counters 4

Four Questions to Answer • Can we confirm the conclusions made by the simulation-based studies? • Can we provide new insights and findings that simulation is not able to? • Can we make a case for our OS-based approach as an effective option to evaluate multicore cache partitioning designs? • What are advantages and disadvantages for OS-based cache partitioning? 5

Outline • Introduction • Design and implementation of OS-based cache partitioning mechanisms • Evaluation environment and workload construction • Cache partitioning policies and their results • Conclusion 6

OS-Based Cache Partitioning Mechanisms • Static cache partitioning • Predetermines the amount of cache blocks allocated to each program at the beginning of its execution • Page coloring enhancement • Divides shared cache to multiple regions and partition cache regions through OS page address mapping • Dynamic cache partitioning • Adjusts cache quota among processes dynamically • Page re-coloring • Dynamically changes processes’ cache usage through OS page address re-mapping 7

Page Coloring • Physically indexed caches are divided into multiple regions (colors). • All cache lines in a physical page are cached in one of those regions (colors). Physically indexed cache Virtual address virtual page number page offset OS control Address translation … … Physical address physical page number Page offset OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits). = Cache address Cache tag Set index Block offset page color bits 8

Enhancement for Static Cache Partitioning Physical pages are grouped to page bins according to their page color OS address mapping Physically indexed cache 1 2 3 4 … … …… i i+1 i+2 … … Shared cache is partitioned between two processes through address mapping. …… Process 1 … … ... 1 2 Cost: Main memory space needs to be partitioned too (co-partitioning). 3 4 … … …… i i+1 i+2 … … …… 9 Process 2

Dynamic Cache Partitioning • Why? • Programs have dynamic behaviors • Most proposed schemes are dynamic • How? • Page re-coloring • How to handle overhead? • Measure overhead by performance counter • Remove overhead in result (emulating hardware schemes) 10

Dynamic Cache Partitioning through Page Re-Coloring • Page re-coloring: • Allocate page in new color • Copy memory contents • Free old page Allocated color Allocated color 0 • Pages of a process are organized into linked lists by their colors. • Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points. 1 2 3 …… N - 1 page links table 11

Control the Page Migration Overhead • Control the frequency of page migration • Frequent enough to capture application phase changes • Not too often to introduce large page migration overhead • Lazy migration: avoid unnecessary page migration • Observation: Not all pages are accessed between their two migrations. • Optimization: do not migrate a page until it is accessed 12

Lazy Page Migration • After the optimization • On average, 2% page migration overhead • Up to 7%. Allocated color Allocated color 0 1 2 3 …… N - 1 Avoid unnecessary page migration for these pages! Process page links 13

Outline • Introduction • Design and implementation of OS-based cache partitioning mechanisms • Evaluation environment and workload construction • Cache partitioning policies and their results • Conclusion 14

Experimental Environment • Dell PowerEdge1950 • Two-way SMP, Intel dual-core Xeon 5160 • Shared 4MB L2 cache, 16-way • 8GB Fully Buffered DIMM • Red Hat Enterprise Linux 4.0 • 2.6.20.3 kernel • Performance counter tools from HP (Pfmon) • Divide L2 cache into 16 colors 15

Benchmark Classification 6 9 6 8 29 benchmarks from SPEC CPU2006 • Is it sensitive to L2 cache capacity? • Red group: IPC(1M L2 cache)/IPC(4M L2 cache) < 80% • Give red benchmarks more cache: big performance gain • Yellow group: 80% <IPC(1M L2 cache)/IPC(4M L2 cache) < 95% • Give yellow benchmarks more cache: moderate performance gain • Else: Does it extensively access L2 cache? • Green group: > = 14 accesses / 1K cycle • Give it small cache • Black group: < 14 accesses / 1K cycle • Cache insensitive 16

Workload Construction 6 9 6 2-core 6 RR (3 pairs) 9 RY (6 pairs) YY (3 pairs) 6 RG (6 pairs) YG (6 pairs) GG (3 pairs) 27 workloads: representative benchmark combinations 17

Outline • Introduction • OS-based cache partitioning mechanism • Evaluation environment and workload construction • Cache partitioning policies and their results • Performance • Fairness • Conclusion 18

Performance – Metrics • Divide metrics into evaluation metrics and policy metrics [PACT’06] • Evaluation metrics: • Optimization objectives, not always available during run-time • Policy metrics • Used to drive dynamic partitioning policies: available during run-time • Sum of IPC, Combined cache miss rate, Combined cache misses 19

Static Partitioning • Total #color of cache: 16 • Give at least two colors to each program • Make sure that each program get 1GB memory to avoid swapping (because of co-partitioning) • Try all possible partitionings for all workloads • (2:14), (3:13), (4:12) ……. (8,8), ……, (13:3), (14:2) • Get value of evaluation metrics • Compared with performance of all partitionings with performance of shared cache 20

Performance – Optimal Static Partitioning • Confirm that cache partitioning has significant performance impact • Different evaluation metrics have different performance gains • RG-type of workloads have largest performance gains (up to 47%) • Other types of workloads also have performance gains (2% to 10%) 21

A New Finding • Workload RG1: 401.bzip2 (Red) + 410.bwaves (Green) • Intuitively, giving more cache space to 401.bzip2 (Red) • Increases the performance of 401.bzip2 largely (Red) • Decreases the performance of 410.bwaves slightly (Green) • However, we observe that 22

Insight into Our Finding 23

Insight into Our Finding • We have the same observation in RG4, RG5 and YG5 • This is not observed by simulation • Did not model main memory sub-system in detail • Assumed fixed memory access latency • Shows the advantages of our execution- and measurement-base study

Performance - Dynamic Partition Policy Init: Partition the cache as (8:8) Yes finished Exit No Run current partition (P0:P1) for one epoch • A simple greedy policy. • Emulate policy of [HPCA’02] Try one epoch for each of the two neighboring partitions: (P0 – 1: P1+1) and (P0 + 1: P1-1) Choose next partitioning with best policy metrics measurement 25

Performance – Static & Dynamic • Use combined miss rates as policy metrics • For RG-type, and some RY-type: • Static partitioning outperforms dynamic partitioning • For RR- and RY-type, and some RY-type • Dynamic partitioning outperforms static partitioning 26

Fairness – Metrics and Policy [PACT’04] • Metrics • Evaluation metrics FM0 • difference in slowdown, small is better • Policy metrics • Policy • Repartitioning and rollback 27

Fairness - Result • Dynamic partitioning can achieve better fairness • If we use FM0 as both evaluation metrics and policy metrics • None of policy metrics (FM1 to FM5) is good enough to drive the partitioning policy to get comparable fairness with static partitioning • Strong correlation was reported in simulation-based study – [PACT’04] • None of policy metrics has consistently strong correlation with FM0 • SPEC CPU2006 (ref input)  SPEC CPU2000 (test input) • Complete trillions of instructions  less than one billion instruction • 4MB L2 cache  512KB L2 cache 28

Conclusion • Confirmed some conclusions made by simulations • Provided new insights and findings • Give cache space from one to another, increase performance of both • Poor correlation between evaluation and policy metrics for fairness • Made a case for our OS-based approach as an effective option for evaluation of multicore cache partitioning • Advantages of OS-based cache partitioning • Working on commodity processors for an execution- and measurement-based study • Disadvantages of OS-based cache partitioning • Co-partitioning (may underutilize memory), migration overhead 29

Ongoing Work • Reduce migration overhead on commodity processors • Cache partitioning at the compiler level • Partition cache at object level • Hybrid cache partitioning method • Remove the cost of co-partitioning • Avoid page migration overhead 30

Jiang Lin1, Qingda Lu2, Xiaoning Ding2, Zhao Zhang1, Xiaodong Zhang2, and P. Sadayappan2 Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems Thanks! 1 Iowa State University 2 The Ohio State University

Backup Slides 32

Fairness - Correlation between Evaluation Metrics and Policy Metrics (Reported by [PACT’04]) Strong correlation was reported in simulation study – [PACT’04] 33

Fairness - Correlation between Evaluation Metrics and Policy Metrics (Our result) • None of policy metrics has consistently strong correlation with FM0 • SPEC CPU2006 (ref input)  SPEC CPU2000 (test input) • Complete trillions of instructions  less than one billion instruction • 4MB L2 cache  512KB L2 cache 34

Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems