1 / 18

Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture. Aanjhan Ranganathan ( ETH Zurich ) , Ali Galip Bayrak ( EPFL ), Theo Kluter ( BFH ), Philip Brisk ( UC Riverside ), Edoardo Charbon ( TU Delft ), Paolo Ienne ( EPFL ). Multicore Embedded Systems.

quant
Télécharger la présentation

Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali GalipBayrak (EPFL), Theo Kluter(BFH), Philip Brisk (UC Riverside), EdoardoCharbon (TU Delft), Paolo Ienne(EPFL)

  2. Multicore Embedded Systems Increasing number of multiprocessor based embedded systems. Low energy requirement with little compromise on performance. Significant energy consumption in the memory subsystem (caches, shared bus, main memory). Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  3. Symmetric Multiprocessor System CPU 1 CPU n CPU 2 D$ D$ I$ I$ D$ I$ Shared Memory Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  4. Cache Coherency Problem CPU 1 CPU n CPU 2 D$ D$ I$ I$ D$ I$ Shared Memory Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  5. Snoopy Hardware Coherence Protocols CPU 1 CPU n CPU 2 Snoop misses consume excessive energy D$ D$ I$ I$ D$ I$ Shared Memory Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  6. Snoop Filters CPU 1 CPU n CPU 2 D$ D$ I$ I$ D$ I$ Snoop filter lookup costs lesser energy than a cache lookup SF SF SF Shared Memory Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  7. Snoop Filters in Prior Art • Include, Exclude and Hybrid JETTY • Expensive for an embedded system in terms of area. • Energy consumed by the JETTYs itself is significant. • Stream Registers • Present in IBM's BlueGene Supercomputer. • Inclusive filter. • Uses a base and mask register pair to track the cache lines. Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  8. Stream Registers Base Valid Mask --- 0 --- No general mechanism to remove address from SR without compromising correctness 1 1 1 1 1 1 0 0 1 0b1001 1 1 0 0 1 1 0 0 1 0b1010 Addresses with 10XX result in snoop filter hit Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  9. Drawbacks of Stream Register based Snoop Filters • No efficient way to update the registers when a line is removed from cache • Degraded filtering performance over time • Additional logic units introduced but not efficient (e.g., cache wrap detection) Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  10. Our Contribution • Counting Stream Registers • Eliminates cache wrap detection logic • Counter to track cache lines • More robust to workload variability • Better or similar energy savings compared to SRs Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  11. Counting Stream Registers Base Counter Mask Removes the need for extra logic such as cache wrap detection, active register history etc. --- 0 --- 1 1 1 1 0x01 1 0 0 1 0b1001 1 1 0 0 0x02 1 0 0 1 0b1010 Invalidated cache lines can be tracked by decrementing the counter Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  12. Snoop Filter Architecture Set of cache lines grouped into a page • Index to direct mapped • snoop filter table • Used for comparison with base register Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  13. Experimental Analysis • Virtex 2 FPGA running OpenRISC soft cores • Configurable no. of processors, associativity and size of data and instruction cache, cache type and coherence protocol • EEMBC Multibench Benchmarks • CACTI 5.3 energy model • Total memory subsystem energy accounted for main memory r/w energy, data and instruction cache r/w energy, leakage and snoop energy Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  14. Cache Design Space Exploration Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  15. Results: Filtering Percentage CSR achieves higher filtering % for smaller number of registers Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  16. Analysis: RGB2CMYK Benchmark Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  17. Discussion: Energy Consumption • For most benchmarks, snoop energy was around 8-10% of the total memory subsystem energy without snoop filters • CSR filters more effective for certain benchmarks (H.264, Image rotation) • Better filtering performance with smaller no. of stream registers. • Small reduction in overall energy • Platform limited to 32 MB of off-chip SDRAM • No complex data sharing and limited no. of multiple producers of same data Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

  18. Summary Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture • Introduced counting stream registers based snoop filter architecture • Lesser hardware complexity and ability to track cache line invalidations • Experimental evaluation shows better filtering percentage than stream registers with lesser performance variation for different workloads.

More Related