1 / 39

Teaching Old Caches New Tricks: Predictor Virtualization

Teaching Old Caches New Tricks: Predictor Virtualization. Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi (CMU) and Babak Falsafi (EPFL). Prediction The Way Forward. Prediction has proven useful – Many forms – Which to choose?.

alaqua
Télécharger la présentation

Teaching Old Caches New Tricks: Predictor Virtualization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Teaching Old Caches New Tricks:Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’sThesis work Some parts joint with Stephen Somogyi (CMU) and BabakFalsafi (EPFL)

  2. Prediction The Way Forward Prediction has proven useful – Many forms – Which to choose? Prefetching Branch Target and Direction Cache Replacement Cache Hit CPU Predictors • Application footprints grow • Predictors need to scale to remain effective • Ideally, fast, accurate predictions • Can’t have this with conventional technology

  3. Predictor The Problem with Conventional Predictors • Accuracy • Latency • What we have • Small • Fast • Not-so-accurate • What we want • Small • Fast • Accurate • Hardware Cost • Predictor Virtualization  • Approximate Large, Accurate, Fast Predictors

  4. Why Now? Extra Resources: CMPs with Large Caches CPU CPU CPU CPU D$ I$ D$ I$ D$ I$ D$ I$ L2 Cache 10-100MB Physical Memory

  5. Predictor Virtualization (PV) CPU CPU CPU CPU D$ I$ D$ I$ D$ I$ D$ I$ L2 Cache Physical Memory Use the on-chip cache to store metadata Reduce cost of dedicated predictors

  6. Predictor Virtualization (PV) CPU CPU CPU CPU D$ I$ D$ I$ D$ I$ D$ I$ L2 Cache Physical Memory Use the on-chip cache to store metadata Implement otherwise impractical predictors

  7. Research Overview • PV breaks the conventional predictor design trade offs • Lowers cost of adoption • Facilitates implementation of otherwise impractical predictors • Freeloads on existing resources • Adaptive demand • Key Design Challenge: • How to compensate for the longer latency to metadata • PV in action • Virtualized “Spatial Memory Streaming” • Virtualized Branch Target Buffers

  8. Talk Roadmap • PV Architecture • PV in Action • Virtualizing “Spatial Memory Streaming” • Virtualizing Branch Target Buffers • Conclusions

  9. PV Architecture Optimization Engine CPU request prediction D$ I$ Predictor Table Virtualize L2 Cache Physical Memory

  10. PV Architecture Optimization Engine CPU request prediction PVProxy D$ I$ PVCache L2 Cache Requires access to L2 Back-side of L1 Not as performance critical Physical Memory PVTable

  11. PV Challenge: Prediction Latency Optimization Engine CPU request prediction PVProxy D$ I$ Common Case PVCache L2 Cache Infrequent latency: 12-18 cycles Physical Memory PVTable Rare latency: 400 cycles Key: How to pack metadata in L2 cache blocks to amortize costs

  12. To Virtualize or Not To Virtualize • Predictors redesigned with PV in mind • Overcoming the latency challenge • Metadata reuse • Intrinsic: one entry used for multiple predictions • Temporal: one entry reused in the near future • Spatial: one miss overcome by several subsequent hits • Metadata access pattern predictability • Predictor metadata prefetching • Looks similar to designing caches • BUT: • Does not have to be correct all the time • Time limit on usefullnes

  13. PV in Action • Data prefetching • Virtualize “Spatial Memory Streaming” [ISCA06] • Within 1% performance • Hardware cost from 60KB down to < 1KB • Branch prediction • Virtualize branch target buffers • Increase the perceived BTB accuracy • Up to 12.75% IPC improvement with 8% hardware overhead

  14. Spatial Memory Streaming [ISCA06] 1100001010001… Memory Pattern History Table 1101100000001… spatial patterns [ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

  15. data access stream patterns pattern Virtualize trigger access prefetches Spatial Memory Streaming (SMS) ~1KB ~60KB Detector Predictor

  16. Virtualizing SMS Virtual Table PVCache 8 sets 1K sets 11 ways 11 ways ... tag pattern tag pattern tag pattern unused L2 cache line Region-level prefetchingis naturally tolerant of longer prediction latencies Simply pack predictor entries spatially

  17. Experimental Methodology • SimFlex: • full-system, cycle-accurate simulator • Baseline processor configuration • 4-core CMP - OoO • L1D/L1I 64KB 4-way set-associative • UL2 8MB 16-way set-associative • Commercial Workloads • Web servers: Apache and Zeus • TPC-C: DB2 and Oracle • TPC-H: several queries • Developed by Impetus group at CMU • Anastasia Ailamaki& BabakFalsafi PIs

  18. SMS Performance Potential Percentage L1 Read Mises (%) Conventional Predictor Degrades with Limited Storage

  19. Virtualized SMS Speedup better • Hardware Cost • Original Prefetcher ~ 60KB • Virtualized Prefetcher < 1KB

  20. Impact of Virtualization on L2 Requests Percentage Increase L2 Requests (%)

  21. Impact of Virtualization on Off-Chip Bandwidth Off-Chip Bandwidth Increase

  22. PV in Action • Data prefetching • Virtualize “Spatial Memory Streaming” [ISCA06] • Same performance • Hardware cost from 60KB down to < 1KB • Branch prediction • Virtualize branch target buffers • Increase the perceived BTB capacity • Up to 12.75% IPC improvement with 8% hardware overhead

  23. The Need for Larger BTBs BTB entries Branch MPKI better • Commercial applications benefit from large BTBs

  24. Virtualizing BTBs: Phantom-BTB L2 Cache BTB Virtual Table PC Small and Fast Large and Slow • Latency challenge • Not latency tolerant to longer prediction latencies • Solution: predictor metadata prefetching • Virtual table decoupled from the BTB • Virtual table entry: temporal group

  25. Facilitating Metadata Prefetching • Intuition: Programs follow mostly similar paths Detection path Subsequent path

  26. Temporal Groups Past misses  Good indicator of future misses Dedicated Predictor acts as a filter

  27. Fetch Trigger Preceding miss triggers temporal group fetch Not precise  region around miss

  28. Temporal Group Prefetching

  29. Temporal Group Prefetching

  30. Phantom-BTB Architecture • Temporal Group Generator • Generates and installs temporal groups in the L2 cache • Prefetch Engine • Prefetches temporal groups Temporal Group Generator L2 Cache BTB PC Prefetch Engine

  31. Branch Stream Temporal Group Generation Temporal Group Generator miss • Miss L2 Cache BTB PC • Hit Prefetch Engine • BTB misses generate temporal groups • BTB hits do not generate any PBTB activity

  32. Branch Stream Branch Metadata Prefetching Temporal Group Generator • Miss L2 Cache BTB Virtual Table PC • Hit miss Prefetch Buffer Prefetch Engine • Prefetch Buffer Hits • BTB misses trigger metadata prefetches • Parallel lookup in BTB and prefetch buffer

  33. Phantom-BTB Advantages • “Pay-as-you-go” approach • Practical design • Increases the perceived BTB capacity • Dynamic allocation of resources • Branch metadata allocated on demand • On-the-fly adaptation to application demands • Branch metadata generation and retrieval performed on BTB misses • Only if the application sees misses • Metadata survives in the L2 as long as there is sufficient capacity and demand

  34. Experimental Methodology • Flexus cycle-accurate, full-system simulator • Uniprocessor - OoO • 1K-entry conventional BTB • 64KB 2-way ICache/DCache • 4MB 16-way L2 Cache • Phantom-BTB • 64-entry prefetch buffer • 6-entry temporal group • 4K-entry virtual table • Commercial Workloads

  35. PBTB vs. Conventional BTBs better Speedup Performance within 1% of a 4K-entry BTB with 3.6x less storage

  36. Phantom-BTB with Larger Dedicated BTBs Speedup better PBTB remains effective with larger dedicated BTBs

  37. Increase in L2 MPKI L2 MPKI better Marginal increase in L2 misses

  38. Increase in L2 Accesses L2 Accesses per KI better • PBTB follows application demand for BTB capacity

  39. Summary • Predictor metadata stored in memory hierarchy • Benefits • Reduces dedicated predictor resources • Emulates large predictor tables for increased predictor accuracy • Why now? • Large on-chip caches / CMPs / need for large predictors • Predictor virtualization advantages • Predictor adaptation • Metadata sharing • Moving Forward • Virtualize other predictors • Expose predictor interface to software level

More Related