1 / 33

SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors

SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors. Lei Jin and Sangyeun Cho. Dept. of Computer Science University of Pittsburgh. Chip Multiprocessor Development.

inigo
Télécharger la présentation

SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors Lei Jin and Sangyeun Cho Dept. of Computer Science University of Pittsburgh

  2. Chip Multiprocessor Development • Cease of performance scaling of uniprocessors has turned researchers to chip multiprocessor architectures • The number of cores is increasing at a fast pace Source: Wikipedia

  3. The CMP Cache • A CMP = N cores + one (coherent) cache system Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache Cache

  4. The CMP Cache • A CMP = N cores + one (coherent) cache system • How can one cache system sustain the growth of N cores? Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache

  5. The CMP Cache • A CMP = N cores + one (coherent) cache system • How can one cache system sustain the growth of N cores? Core L1 I/D Cache L2 Cache Slice Directory Router • Non-Uniform Cache Architecture (NUCA) • Shared cache scheme vs. private cache scheme

  6. Hybrid Cache Schemes • Victim Replication [Zhang and Asanovic ISCA `05] • Adaptive Selective Replication [Beckmann et al. MICRO `06] • CMP-NuRAPID [Chishti et al. ISCA `05] • Cooperative Caching [Chang and Sohi ISCA `06] • R-NUCA [Hardavelles et al. ISCA `09] • Problems with hardware-based schemes: • Hardware complexity • Limited scalability

  7. The Challenge • CMPs provide the scalability of the core count • A cache system with scalable performance is critical in CMPs • Available hardware-based schemes failed to do so • We propose a Software-Oriented Shared (SOS) cache management approach: • Minimum hardware support • Good scalability

  8. Our Contributions • We studied access patterns in multithreaded workloads and found they can be utilized to improve locality • We proposed the SOS scheme, which offloads the work from hardware to software analysis • We evaluated our scheme and proved that it is a promising approach

  9. Outline • Motivation • Observation in access patterns • SOS scheme • Evaluation results • Conclusions

  10. Observation • L2 cache access distribution of Cholesky # of access to blocks shared by 15 threads or less during whole execution. Cumulative Percentage of Accesses # of access to blocks shared by 15 threads or less simultaneously Sharer Count

  11. Observation • L2 cache accesses are skewed at the two extremes Cumulative Percentage of Accesses ~50% highly shared access ~30% private data access Sharer Count

  12. Access Patterns • Static data vs. dynamic data • Static data: location and size are known prior to execution (e.g. global data) • Dynamic data: location and size vary among executions, but patterns may persist (e.g. data allocated by malloc(), stack data) • Dynamic data is more important than static data • Common access patterns for dynamic data are: • Even partition • Scattered • Dominant owner • Shared

  13. Even Partition Pattern • A continuous memory space is partitioned evenly among threads • Main thread: • Array = malloc(sizeof(int) * NumProc * N); • Thread [ProcNo]: • for(i = 0; i < N; i++) • Array[ProcNo * N + i] = x; T0 T1 T2 T3

  14. Scattered Pattern • Memory spaces are not continuous, but each is owned by one thread • Main thread: • ArrayPtr = malloc(sizeof(int) * NumProc); • for(i = 0; i < NumProc; i++) • ArrayPtr[i] = malloc(sizeof(int) * Size[i]); • Thread [ProcNo]: • for(i = 0; i < Size[i]; i++) • ArrayPtr[ProcNo][i] = i; T2 T3 Gap T1 Gap T0

  15. Other Patterns • Dominant owner: data are accessed by multiple threads, but one thread contributes the access significantly more than the others • Shared: data are widely shared

  16. Outline • Motivation • Observation in access patterns • SOS scheme • Evaluation results • Conclusions

  17. SOS Scheme • The SOS scheme consists of 3 components: Page coloring L2 Cache Access Profiling Page Clustering & Pattern Recognition Replication Run-time One-time offline analysis

  18. Page Clustering • We take a machine-learning based approach: Per-Page Histogram C0 C1 C2 C3 P0 P3 P2 P1 P5 P6 P4 Per-thread L2 Cache Access Trace T0 K-means Clustering Pattern Recognition T3 T1 T2 Hint (Even Partition) main.c :123L C0 (1, 0, 0, 0) C1 (0, 1, 0, 0) C2 (0, 0, 1, 0) C3 (0, 0, 0, 1) C4 (1, 1, 1, 1) Dynamic Area

  19. Pattern Recognition • Assume a dynamic area consists of 8 pages: Pages accessed mostly by thread 0 Pages accessed mostly by thread 3 Highly shared pages Initial centroids for K-means clustering C0 C1 C2 C3 C4 P1 P0 P3 P2 P4 P6 P7 P5 C0 (1, 0, 0, 0) C1 (0, 1, 0, 0) C2 (0, 0, 1, 0) C3 (0, 0, 0, 1) C4 (1, 1, 1, 1)

  20. Pattern Recognition • Assume a dynamic area consists of 8 pages: Ideal Partition P0 – P1 Initial centroids for K-means clustering C0 C1 C2 C3 C4 P7 P5 P6 P4 P2 P3 P0 P1 P2 – P3 C0 (1, 0, 0, 0) C1 (0, 1, 0, 0) C2 (0, 0, 1, 0) C3 (0, 0, 0, 1) C4 (1, 1, 1, 1) Compare P4 – P5 P6 – P7

  21. Hints Representation & Utilization • For dynamic data, pattern type is associated with every dynamic allocation system call [FileName, Line#, Pattern Type] • For static data, page location is explicitly given: [Virtual Page Num, Tile ID] • SOS data management policy: • Pattern type is translated into actual partition when the dynamic area location and size are known by the OS • Page location is assigned on demand if the partition information (hint) is available • Data without corresponding hints are treated as highly shared and distributed at block level • Data replication is enabled for shared data

  22. Architectural Support • To allow flexible data placement in L2 cache, we add two fields in page table entry and TLB entry [Jin and Cho CMP-MSI `07, Cho and Jin MICRO `06] • The OS is responsible for providing TID and BIN • Main memory access is the same as before, with the translated physical page address • L2 cache addressing mode depends the value of TID and BIN P Virtual Page Number Physical Page Number TID BIN a TLB entry To form physical address for main memory access To locate page in L2 cache

  23. Outline • Motivation • Observations in access patterns • SOS scheme • Evaluation results • Conclusions

  24. Experiment Setup • We use a simics-based memory simulator, modeling a 16-tile CMP with 4x4 2D mesh on-chip network • Each core has 2-issue in-order pipeline with private L1 I/D caches and an L2 cache slice • Programs from SPLASH-2 suite and PARSEC suite are selected as benchmarks with 3 different input sizes • Small input set is used to profile and generate hints, while median and large input sets are used to evaluate the SOS performance • For brevity, we only present results of 4 representative programs (barnes, lu, cholesky, swaption) and the overall average of 14 programs

  25. Hint Accuracy • Accuracy is measured by the percentage of pages that are placed in the tile with most access Small input Median input

  26. Breakdown of L2 Cache Accesses • Patterns vary among different programs • A large percentage of L2 access can be tackled by page placement • The shared data are evenly distributed and handled by replication

  27. Remote Access Comparison • Hint-guided data placement significantly reduces the number of remote cache accesses • Our SOS scheme removes nearly 87% of remote accesses!

  28. Execution Time • Hint-guided data placement tracks private cache performance closely • SOS performs nearly 20% better than shared cache scheme

  29. Related Work • Lu et al. PACT `09 • Analyzing the array access and performing data layout transformation to improve the data affinity • Marathe and Mueller PPoPP `06 • Profiling truncated program before every run • Deriving optimal page location based on the sampled access trace • Optimizing data locality for cc-NUMA • Hardavellas et al. ISCA `09 • Dynamic identification of private and shared pages • Private mapping for private pages and fine-grained broadcast-mapping of shared pages • Focuses of server workloads

  30. Conclusions • We propose a software-oriented approach for shared cache management: controlling data placement and replication • This is the first work on software-managed distributed shared cache scheme for CMPs • We show that multithreaded programs exhibit data access patterns that can be exploited to improve data affinity • We demonstrate that software-oriented shared cache management is a promising approach through experiments • 19% performance improvement over shared cache scheme

  31. Thank you and Questions?

  32. Future Work • Further study of more complex access patterns can show more benefits of our software-oriented cache management scheme. • Extend the current scheme to server workloads, which exhibit totally different cache behaviors from scientific workloads.

  33. Hint Coverage • Hint coverage measures the percentage of L2 cache accesses to the pages guided by SOS. Small input Median input

More Related