190 likes | 279 Vues
This research explores managing distributed, shared L2 caches through OS-level page allocation in multicore processors, evaluating private and shared cache strategies, page allocation techniques, and their performance impact. It discusses advantages and disadvantages of cache designs and proposes a page allocation system controlled by the OS to improve cache utilization. The use of page spreading, spilling, and home allocation policies along with evaluation metrics is detailed for efficient cache management without additional hardware support.
E N D
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5th, 2008 Based on “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation” by Sangyeun Cho and Lei Jin, appearing in IEEE/ACM Int'l Symposium on Microarchitecture (MICRO), December 2006.
Outline • Background and Motivation • Page Allocation • Specifics of Page Allocation • Evaluation of Page Allocation • Conclusion
Motivation • With multicore processors, on-chip memory design and management becomes crucial • Increasing L2 cache sizes result in non-uniform cache access latencies, which complicates the management of these caches
Private Caches • A cache slice is associated with a specific processor core • Data must be replicated across processors as it is accessed • Advantages? • Data is always close to the processor, reducing hit latency • Disadvantages? • Limits overall cache space, resulting in more capacity misses Blocks in memory
Shared Caches S = A mod N • Each memory block uniquely maps to one (and only one) cache slice that all processors will access • Advantages? • Increase effective L2 cache size • Easier to implement coherence protocols (data only exists in one place) • Disadvantages? • Requested data is not always close, so hit latency increases • Increase network traffic due to movement of data that is not close to requesting processor Blocks in memory
Page Allocation S = PPN mod N • Add another level of indirection – pages! • Built on top of a shared cache architecture • Use the physical page number (PPN) to map physical pages to the correct cache slice • The OS controls the mapping of virtual pages to physical pages – if the OS knows where a physical page maps to, then it can assign virtual pages based on which cache slice it desires! Pages in memory Pages in VM
How does Page Allocation work? • A Congruence Group (CGi) is the partition of physical pages that map to the unique processor core i • Each congruence group needs to maintain a “free list” of available pages • To implement private caching, when a page is requested by processor i, allocate a free page from CGi • To implement shared caching, when any page is requested, allocate a page from any CG • To implement hybrid caching, split the CGs into K groups, keeping track of which CG maps to which group – when a page is requested, allocate a page from any CG in the correct group All of this is controlled by the OS without any additional hardware support!
Page Spreading & Page Spilling • If the OS always allocates pages from the CG corresponding to the requesting processor, then it acts like a private cache. • The OS can choose to direct allocations to cache slices in other cores in order to increase the effective cache size. This is page spreading. • When available pages in a CG drop below some threshold, the OS may be forced to allocate pages from another group. This is page spilling. • Each tile is on a specific tier that corresponds to how close it is to the target tile. Tier-1 tiles
Cache Pressure • Add hardware support for counting “unique” page accesses in a cache • But we aren’t supposed to need hardware support? It still doesn’t hurt! • When cache pressure is measured to be high, pages are allocated to other tiles on the same tier, or tiles on the next tier
Home allocation policy • Profitability of choosing a home cache slice depends on different factors: • Recent miss rates of L2 caches • Recent network contention levels • Current page allocation • QoS requirements • Processor configuration (# of processors, etc.) • The OS can easily find the cache slice with the highest profitability
Virtual Multicore (VM) • For parallel applications, the OS should try to coordinate page allocation to minimize latency and traffic – schedule a parallel application onto a set of cores in close proximity • When cache pressure increases, pages can be still be allocated outside of the VM
Hardware Support • The best feature of OS-level page allocation is that it can be built on a simple shared cache organization with no hardware support • But additional hardware support can still be leveraged! • Data replication • Data migration • Bloom filter
Evaluation • Use SimpleScalar tool set to model 4x4 mesh multicore processor chip • Demand paging – every memory access is checked against allocated pages; when a memory access is the first access to an unallocated page, a physical page is allocated based on the desired policy • No page spilling was ever experienced • Used single-threaded, multiprogrammed, and parallel workloads • Single-threaded = variety of SPEC2k benchmarks, integer programs, and floating-point programs • Multiprogrammed = one core (core 5 in the experiments) runs a target benchmark, while other processors run a synthetic benchmark that continuously generates memory accesses • Parallel = SPLASH-2 benchmarks
Performance on single-threaded workloads • PRV: private • PRV8: 8MB cache size (instead of 512k) • SL: shared • SP: OS-based page allocation • SP-RR: round-robin allocation • SP-80: 80% allocated locally, 20% spread across tier-1 cores
Performance on single-threaded workloads Decreased sharing = higher miss rate Decreased sharing = less on-chip traffic
Performance on multiprogrammed workloads • SP40-CS: use controlled spreading to limit spreading of unrelated pages onto cores that have data of target application • Synthetic benchmarks produce low, mid, or high traffic • SP40 usually performs better in high traffic, but performance is similar to SL in low traffic • Not shown here, but SP40 reduces on-chip network traffic by 50% (compared to SL)
Performance on parallel workloads • VM: virtual multicore with round-robin page allocations on participating cores • lu and ocean have higher L1 miss rates, so the L2 cache policy had a greater effect on performance No real difference here! VM outperforms the rest!
Related Issues • Remember NUMA? They used a page scanner that maintained reference counters and generated page faults to allow the OS to take some control • In CC-NUMA, hardware-based counters affected OS decisions • Big difference: NUMA deals with main memory, while OS-level page allocation presented here deals with distributed L2 caches
Conclusion • Page allocation allows for a very simple shared cache architecture, but how can we use advances in architecture for our benefit? • Architecture can provide more detailed information about current state of the cores • CMP-NuRAPID, victim replication, cooperative caching • Can we apply OS-level modifications also? • Page coloring and page recoloring • We are trading hardware complexity for software complexity – where is the right balance?