1 / 36

Predictor Virtualization

Predictor Virtualization. Ioana Burcea * Stephen Somogyi § , Andreas Moshovos*, Babak Falsafi § #. *University of Toronto Canada. § Carnegie Mellon University # École Polytechnique Fédérale de Lausanne. ASPLOS 13 March 4, 2008. Why Predictors? History Repeats Itself . CPU.

orinda
Télécharger la présentation

Predictor Virtualization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predictor Virtualization Ioana Burcea* Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008

  2. Why Predictors? History Repeats Itself CPU Branch Prediction Prefetching Value Prediction Predictors Pointer Caching Cache Replacement • Application footprints grow • Predictors need to scale to remain effective

  3. Extra Resources: CMPs With Large On-Chip Caches CPU CPU CPU CPU I$ I$ I$ I$ D$ D$ D$ D$ L2 Cache 10’s – 100’s of MB Main Memory

  4. Predictor Virtualization CPU CPU CPU CPU I$ I$ I$ I$ D$ D$ D$ D$ L2 Cache Physical Memory Address Space

  5. Predictor Virtualization (PV) • Emulate large predictor tables • Reduce predictor table dedicated resources

  6. Research Contributions • PV – metadata stored in conventional cache hierarchy • Benefits • Emulate larger tables → increased accuracy • Less dedicated resources • Why now? • Large caches / CMPs / Need for larger predictors • Will this work? • Metadata locality → intrinsically exploited by caches • First Step – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Advantages of virtualization

  7. Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

  8. Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

  9. PV Architecture Optimization Engine CPU request prediction I$ D$ Predictor Table Virtualize L2 Cache Main Memory

  10. PV Architecture Optimization Engine CPU PVStart request prediction I$ D$ index PVCache PVProxy L2 Cache Physical Memory Address Space PVTable

  11. PV: Variable Prediction Latency Optimization Engine CPU PVStart request prediction I$ D$ index Common Case PVCache PVProxy L2 Cache Infrequent Physical Memory Address Space Rare PVTable

  12. Metadata Locality • Entry reuse • Temporal • One entry used for multiple predictions • Spatial – can be engineered • One miss overcome by several subsequent hits • Metadata access pattern predictability • Predictor metadata prefetching

  13. Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

  14. Spatial Memory Streaming [ISCA 06] 1100001010001… Spatial patterns stored in a pattern history table (PHT) Memory 1100000001101… spatial patterns *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

  15. Virtualizing “Spatial Memory Streaming” (SMS) Virtualize data access stream patterns Detector Predictor patterns ~1KB ~60 KB triggeraccess prefetches

  16. Virtualizing SMS tag pattern tag pattern tag pattern unused 39 bits 11 bits 32 bits Virtual Table PVCache 8 sets 1K sets 11 ways 11 ways Set entries → cache block – 64 bytes

  17. Current Implementation • Non-Intrusive • Virtual table stored in reserved physical address space • One table per core • Caches oblivious to metadata • Options • Predictor tables stored in virtual memory • Single, shared table per application • Caches aware of metadata

  18. Simulation Infrastructure • SimFlex • Full-system simulator based on Simics • Base processor configuration • 4-core CMP • 8-wide OoO • 256-entry ROB • L1D/L1I 64KB 4-way set-associative • UL2 8MB 16-way set-associative • Commercial workloads • TPC-C: DB2 and Oracle • TPC-H: Query 1, Query 2, Query 16, Query 17 • SpecWeb: Apache and Zeus

  19. Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better

  20. Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better

  21. Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better

  22. Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better Small Tables Diminish Prefetching Accuracy

  23. Virtualized Prefetcher - Performance Speedup better Original Prefetcher ~60KB Virtualized Prefetcher < 1KB Hardware Cost

  24. Impact on L2 Memory Requests L2 Memory Requests Increase better Dark Side: Increased L2 Memory Requests

  25. Impact of Virtualization on Off-Chip Bandwidth Indirect impact on performance Off-Chip Bandwidth Increase Direct impact on performance better Minimal Impact on Off-Chip Bandwidth

  26. Conclusions • Predictor Virtualization • Metadata stored in conventional cache hierarchy • Benefits • Emulate larger tables → increased accuracy • Less dedicated resources • First Step – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Opportunities • Metadata sharing and persistence • Application directed prediction • Predictor adaptation

  27. Predictor Virtualization Ioana Burcea* ioana@eecg.toronto.edu Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008

  28. Predictor Virtualization Ioana Burcea* ioana@eecg.toronto.edu Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008

  29. Predictor Virtualization Ioana Burcea* ioana@eecg.toronto.edu Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008

  30. PV – Motivating Trends • Dedicating resources to predictors hard to justify • Larger predictor tables • Increased performance • Chip multiprocessors • Space dedicated to predictors ↔ # processors • Memory hierarchies offer the opportunity • Increased capacity • Diminishing returns Use conventional memory hierarchies to store predictor metadata

  31. Virtualizing the Predictor Table Pattern History Table Trigger Access Tag Pattern Tag Pattern Address PC … 1 1 1 0 1 0 1 0 … 0 0 1 1 1 0 1 1 Pattern index Tag … 0 0 1 1 1 0 1 0 Prefetch Virtualize • PHT stored in physical address space • Multiple PHT entries packed in one memory block • one memory request brings an entire table set

  32. Packing Entries in One Cache Block • Index: PC + offset within spatial group • PC →16 bits • 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern • Pattern table: 1K sets • 10 bits to index the table → 11 bit tag • Cache block: 64 bytes • 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tag pattern tag pattern tag pattern unused 85 0 11 43 54

  33. Memory Address Calculation + PC Block offset 16 bits 5 bits PV Start Address 10 bits 000000 tag Memory Address

  34. Increase in Off-Chip Bandwidth – different L2 sizes Off-Chip Bandwidth Increase

  35. Increased L2 Latency Speedup

  36. Conclusions • PV – metadata stored in conventional cache hierarchy • Benefits • Less dedicated resources • Emulate larger tables → increased accuracy • Example – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Why now? • Large caches / CMPs / Need for larger predictors • Will this work? • Metadata locality →intrinsically exploited by caches • Metadata access pattern predictability • Opportunities • Metadata sharing and persistence • Application directed prediction • Predictor adaptation

More Related