1 / 14

To Include or Not to Include?

To Include or Not to Include?. Natalie Enright Dana Vantrease. Motivation. CMP technology affects coherence protocols differently than previously studied MP systems New shared on-chip resources (e.g. L2) Low latency between on-chip caches Need for scalability in design Industry Examples

Télécharger la présentation

To Include or Not to Include?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. To Include or Not to Include? Natalie Enright Dana Vantrease

  2. Motivation • CMP technology affects coherence protocols differently than previously studied MP systems • New shared on-chip resources (e.g. L2) • Low latency between on-chip caches • Need for scalability in design • Industry Examples • IBM Power 4 – Inclusion • Piranha – Exclusion • Our goal: Determine at which point, each inclusion protocol (strict inclusion, non-inclusion and exclusion) is the best choice for CMP performance.

  3. SMP vs CMP Opportunities L1 L1 L1 L1 VS L2 L2 L2 L1 L1 L1 L1 VS L2 L2 L2

  4. Multilevel Inclusion • Protocol given to us with the simulator • L1 has Modified, Shared and Invalid States • L2 has Modified, Owned, Shared, and Invalid States • When an L2 line is replaced, any copies present on the chip must be invalidated (the sharers are given in the directory entry) • In a single processor chip, there are only 2 caches (Instruction and Data) connected to a single L2 cache • Chip multiprocessors introduce an additional 2 level 1 caches per processor which could make this forced inclusion harmful.

  5. Non-Inclusion • Protocol courtesy of Mike • L1 now has owned and exclusion states • Complexity of the on chip directory has increased significantly • States added to indicate local level 1 sharers or a local level 1 owner. • L1 directory state also needs to be visible for external requests from other chips • Increase effective on-chip cache storage

  6. Directory Exclusion • No replication of Data between a single L1 and the L2 • L2 Acts as Large Victim Cache • Utilizes cache space, lowering required off-chip bandwidth • L2 is centralized coherency point (tag lookup) • L1 States: M, E, I, SC, SM • L2 States: M, E, I • No ownership – simply request 1st Sharer in Tag Lookup for Data Request

  7. Directory Exclusion L1 L1 L1 L1 L1 L1 L1 L1 L2 L1 Tags L2 L1 Tags L1 L1 L1 L1 L2 L1 Tags

  8. Tag Lookup Cache • Aids in off-chip coherency and directing on-chip requests • Associativity = L1 associativity * # L1s • # Sets = #Sets in a single L1 • # Data Entries = # L1s • Data Entry = The L1 corresponding to the Data Entry has the data or not (1/0). • Scalability?

  9. Methodology • Vary the L1 cache size to find the design point at which an inclusive protocol hurts performance. • As the number of cores increases, so does the aggregate L1 cache size

  10. Simulation Configuration • Configuration • 4 processors per chip and 1 chip • 2 MB of L2 cache • Small but wanted to see the effect of changing the ratio of L1 size to L2 size. • 16 processors per chip as future work • Only simulated one chip to isolate the effects of intra-chip coherence from inter-chip coherence • Future work: see how extending the life of a block on chip through non-inclusion or exclusion affects other chips.

  11. Results Inclusion vs. Non-Inclusion

  12. Results (cont.) Inclusion vs. Pseudo-Exclusion

  13. Conclusion/Future Work • An inclusive protocol is less complex • Esp. considering inter-chip communication • Non-Inclusion performs consistently better than inclusion • Additional complexity only warranted after the total L1 cache size is greater than 25% of the L2 cache size. • Longer runs and more benchmarks would provide more conclusive evidence

  14. Future Work • Ongoing: Get working exclusion protocol in Ruby tester and Simics. • Current Status: Currently runs 500 memory transactions in the Ruby tester. • Run comparable tests to those run for Non-inclusion • Analyze benefits of exclusion over inclusion. • Expand to 16 cores and study scalability issues.

More Related