Smart Scheduling for Asymmetric Cache CMPs

ACCESS: Smart Scheduling for Asymmetric Cache CMPs Xiaowei Jiang†, Asit Mishra‡, Li Zhao†, Ravi Iyer†, Zhen Fang†, Sadagopan Srinivasan†, Srihari Makineni†, Paul Brett†, Chita Das‡ Intel Labs (Oregon)† Penn State‡

Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions

Core0 Core1 Core0 Core1 Small Cache Large Cache Cache Motivation Applications tend to have non-uniform cache capacity requirement Energy inefficiency Virtual Asymmetry Physical Asymmetry

Benefit of Physically Asymmetric Caches • Fit the asymmetry in working set size of apps • Apps have small WSS/streaming apps on small cache • Apps have large WSS on large cache • Help improve energy per instruction • Large cache can be power gated when not in use • Smaller cache enables lower operating voltage • Fit the need of heterogeneous-core architectures • Asymmetric cores naturally need asymmetric caches 512KB: 0.8v 4MB: 1.0v

Challenges in Asymmetric Caches • What are the H/W supports needed? • H/W exposes certain cache stats to OS • What are the OS scheduler changes needed? • Scheduler be aware of the underlying cache asymmetry • New scheduling policy to explore cache asymmetry

Contribution of ACCESS • ACCESS Architecture • Enables asymmetric caches • ACCESS Prediction Engine • Runtime online measurement of cache stats • Stats exposed to OS • Asymmetric Cache Scheduler • Finds out best-performing schedule with one-time training • Deals with private caches and shared cache • Estimate shared cache contention effects with simple heuristics • O(1) complexity • Real machine measurement shows >20% performance improvement over default Linux scheduler

Related Work • OS schedulers for heterogeneous-core architectures • Li et al. HPCA’10 • Kumar et al. Micro’03, ISCA’04 • OS scheduler or H/W approaches for mitigating cache contention effects • Chandra et al. HPCA’05 • Kim et al. PACT’05

ACCESS Architecture • Run tasks with one-time training • APE measures/predicts task cache stats on big/small caches • OS makes schedule based on cache stats by APE

App 1 App 2 Tag Array Way 0 Way 1 Way 15 Set 0 Set 1 Set 2 Hit / Miss 4 MB LLC Controller Set 4095 0 1 15 0 0 15 1 1 0 15 0 1 1 Set 0 Set 1 512 KB Shadow 4 MB 4 MB Tags Set 16 App 1 App 2 App 1 App 2 Access Prediction Engine Provides cache stats for each app on each cache (running alone) • Shadow tags • Use set sampling to reduce size and #accesses • Using multiple shadow tags, although App1&2 share the cache, we can still measure the cache stats of App1&2 running alone on the 4MB(L) and 512KB(S) cache Shadow tag = cache w/o data array

Asymmetric Cache Scheduler • Goal of the scheduler: improve overall threads performance • Perform least training to minimize training overhead • Thread’s stats available to the scheduler • Instruction count etc. • Cache misses of each thread running alone on each cache • In practice, we find schedule that has minimal overall MPI yields best overall performance

Core1 Core0 Core2 Core3 Core0 Core1 Small Small Large Cache Cache Cache Large Cache ACS Examples T1 T2 • Private caches, e.g. 2T case • calculate <MPIT1_L, MPIT1_S>, <MPIT2_L, MPIT2_S> • compute MPIsum of all possible schedules • MPIsum1= MPIT1_L + MPIT2_S • MPIsum2= MPIT1_S + MPIT2_L • pick min(MPIsum1, MPIsum2) • Shared caches, e.g. 4T case • calculate <MPITi_L, MPITi_S> • compute MPIsum of all possible schedules • MPIsum= MPITiTj_L + MPITxTy_S • MPITiTj_Land MPITxTy_Sare estimated • pick MPIsum min T1 T2 T3 T4

Estimating Cache Contention Effect • Task: given MPITi_L/S,MPITj_L/S, estimate MPITiTj_L/S • Cache power law Hartstein et al. MRnew= MRold* (Cnew/Cold)-α MPInew= MPIold* (Cnew/Cold)-α We can compute α of each thread α= -logCL/CS(MPITi_L/MPITi_S) α measures how sensitive the app is to cache capacity

Estimating Cache Contention Effect (cont.) • Estimating cache occupancy for Ti when Ti,Tj share cache

Scheduler Compute Overhead • Computing and sorting all possible schedules has O(n2) complexity • To arrive at the best schedule, #thread migrations might be unbounded

O(1) ACS • Goal of O(1) ACS • O(1) complexity • Limited number of thread migrations to arrive a best schedule • O(1) ACS algorithm • For each thread (Ti) arrival, comparing MPIsum of 6 cases • 1. Ti on L • 2. Ti on L, migration candidate on L -> S • 3. Ti on L, migration candidate on L <-> migration candidate on S • 4. Ti on S • 5. Ti on S, migration candidate on S -> L • 6. Ti on S, migration candidate on S <-> migration candidate on L • Pick the best schedule in 1-6 • Update migration candidate based on the 2nd best schedule

O(1) ACS Example T1 T2 MPIs Thread MPI on L MPI on S State at t0 T1 0.40 0.50 T2 0.45 0.90 T3 MPI ACS computation at t1 Thread on L Thread on S Candidate on L Candidate on S MPI on L MPI on S MPIsum Case MPIL MPIS MPIsum T2 T1 T2 T1 0.45 0.50 0.95 1 1.05 0.50 1.55 T3 on L 2 0.60 1.40 2.00 T3 on L, T2->S 3 1.00 0.90 1.90 T3 on L, T2->S, T1->L 4 0.45 1.25 1.70 T3 on S Thread MPI on L MPI on S 5 0.85 0.75 1.60 T3 on S, T1->L T3 0.60 75 6 0.40 1.65 2.05 T3 on S, T1->L, T2->S State after t1 Thread on L Thread on S Candidate on L Candidate on S MPI on L MPI on S MPIsum T2,T3 T1 1.05 0.50 1.55 T3 T1

O(1) ACS Efficacy • Constant computation overhead • Always 97% close to best schedule

Evaluation Setup • Real machine based measurement on Xeon5160 • 4 cores at 3Ghz • 32KB split L1 caches • 2 cores share L2, 4MB and 512KB each • ACS scheduler • Implemented in Linux 2.6.32 • Enable fast thread migration • Since no APE h/w available, MPIs profiled offline with 2% errors applied (to take into account effects of set sampling) • Benchmarks • 17 C/C++ SPEC2006 benchmarks • 2T and 4T workloads that cover both cache sensitive (S) and insensitive (I) benchmarks • Run until first thread exits

Evaluation Results of ACS (2T) • Performance improvement in all 70 cases • Avg 20% speedup • Demonstrate the efficacy of ACS

Evaluation Results of ACS (4T) • Performance improvement in all 30 cases • Avg 31% speedup • Demonstrate the efficacy of ACS and cache contention estimation effort

Conclusions • We have proposed ACCESS architecture • Enforce physically asymmetric caches • ACCESS Prediction Engine • Use shadow tags to conduct online cache simulation • We have also proposed ACS scheduler • One time training, using MPIsum metric to derive the best performing schedule • Practical approach to estimate shared cache contention effects • O(1) ACS scheduler • Minimizes scheduler computation overhead • Limits thread migrations • Real platform measurements show >20% speedup over Linux scheduler

Thanks!

Smart Scheduling for Asymmetric Cache CMPs

Smart Scheduling for Asymmetric Cache CMPs

Presentation Transcript

Asymmetric Hydrogenations

Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs

Stall-Time Fair Memory Access Scheduling

Operations Scheduling

Chapter 5:

STT-RAM as a sub for SRAM and DRAM

Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE

Client Cache Management

Client Cache Management

Cache performance

Cache Memory

Understanding Performance, Power and Energy Behavior in Asymmetric Processors

COEN 180

NET+OS 6.1 Training

Chapter Four : Processor Management

Smart access

Chapter 4 CPU Scheduling

L3 Cache