1 / 17

Miss Reduction in Embedded Processors Through Dynamic, Power-Friendly Cache Design

Miss Reduction in Embedded Processors Through Dynamic, Power-Friendly Cache Design. Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008. Abstract.

louie
Télécharger la présentation

Miss Reduction in Embedded Processors Through Dynamic, Power-Friendly Cache Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Miss Reduction in Embedded Processors Through Dynamic, Power-Friendly Cache Design Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008

  2. Abstract • Today, embedded processors are expected to be able to run complex, algorithm-heavy applications that were originally designed and coded for general-purpose processors. As a result, traditional methods for addressing performance and determinism become inadequate. • This paper explores a new data cache design for use in modern high-performance embedded processors that will dynamically improve execution time, power efficiency, and determinism within the system. The simulation results show significant improvement in cache miss ratios and reduction in power consumption of approximately 30% and 15 %, respectively.

  3. What’s the Problem • Primary (L1) caches in embedded processors are direct-mapped for power efficiency • However, direct-mapped caches are predisposed to thrashing • Hence, require a cache design that will • Improve performance, power efficiency, and determinism • Minimize the area cost

  4. Related Works Cache optimization techniques for embedded processors Reduce cache conflict and cache pollution Increase power efficiency Provide extended associativity Improve cache utilization Retain data evicted from $ in a small associative victim cache [2] Pseudo-associative caches: place blocks in a second associated line[5] Dual data cache scheme that can distinguish spatial, temporal, singe-use memory reference [3] Application-specific cache partitioning [4] Filter caches [6] Shut down cache ways that adapts to application [7] Expandable cache lookup only when necessary Dynamically detect the thrashing behavior and expand the select sets for data cache This Paper:

  5. Motivative Example • Illustrate why we need to expand the select sets dynamically • Insufficiency of the victim cache • Example thrashing code • B and E map to Set-S, C and F map to Set-Q, A and D map to Set-R Set-R Set-S Set-Q Successive cache thrashing

  6. Motivative Example- Cont. • Cache trace of the example thrashing code • B and E map to Set-S, C and F map to Set-Q, A and D map to Set-R Main cache B[i] B[i] E[i] Set-S Set-Q C[i] C[i] F[i] Set-R A[i] A[i] D[i] Set-* … 2 entry victim cache B[i] A[i] F[i] C[i] E[i] D[i] Uncorrelated evicted data polluting the victim cache

  7. The Dynamically Expandable L1 Cache Architecture 2nd. Expandable cache lookup 1st. Circular recently-evicted-set list

  8. (1) Circular Recently-Evicted-Set List • A small circular list • Keep track of the index of the most recently evicted sets • Goal: detect a probable thrashing set • Operation • Look up the circular list only when a cache miss • If the missed set is present in the list • Enable the expand bit for that set • The access and update of the circular list • Only occur during a cache miss • Timing is not affected Conclude the current set is in a thrashing state and should dynamically be expanded

  9. (2) Expandable Cache Lookup • Goal: allow a set to re-lookup into a predefined secondary set (virtually double associativity of a given set) • Operation • The secondary set is determined by a fixed mapping function • Flip the most significant bit of the set index • Besides expand bit, toggle bit for each cache set • Lookup initially on primary set or secondary set • Enable: when a cache hit occurs on the secondary set • Disable: when a cache hit occurs on the primary set Probable thrashing set is detected by first mechanism 1st lookup: cache miss and expand bit= 1 2nd lookup in the predefined secondary set on next cycle - found: cache hit with one cycle penalty - not found: full cache miss “00” “01” “10” “11” Index

  10. A Demonstrative Example • Cache trace of the proposed cache architecture Main cache Expand 1 Set-S B[i] E[i] Set-Q C[i] F[i] 1 Set-R A[i] D[i] 1 … … Set-S’ B[i] Set-Q’ C[i] A[i] Set-R’ Circular list Update Set-S Update Set-Q Update Set-R

  11. Experimental Setup • Use SimpleScalar toolset [8] for performance evaluation • Two baseline configuration • 256-set, direct-mapped L1 data cache with a 32-byte line size • 256-set, 4-way set-associative L1 data cache with a 32-byte line size • Use CACTI[10] to evaluate the power efficiency • Assume L1/L2 power ratio of 20 • Cost 20 times of power to access data in L2 than it does in L1 • Benchmarks • 7 representative programs of the SPEC CPU2000 suite [9]

  12. Performance Improvement- Direct-Mapped Cache Improvement over baseline • Criteria: miss rate reduction • The miss rate improvement of the proposed implementation • The arithmetic mean is 30.75% 5-entry recently-evicted-set list 8-entry victim $

  13. Performance Improvement- 4-Way Set-Associative Cache • Criteria: miss rate reduction • The miss rate improvement of the proposed implementation • The arithmetic mean is 26.74% Improvement over baseline 8-entry recently-evicted-set list 64-entry victim $ Significant miss rate reduction for both direct-mapped and set-associative caches

  14. Power Improvement- Direct-Mapped Cache • The power reduction of the proposed implementation • The average is 15.73% Consistently provide power reduction

  15. Power Improvement- 4-Way Set-Associative Cache • However, the power reduction across the benchmarks • The average was still an improvement of 4.19% Exception: Higher power costs

  16. Conclusions • This paper proposed a dynamically expandable data cache architecture • Compose of two main mechanisms • Circular recently-evicted-set list • Detect a probable thrashing set • Expandable cache lookup • Virtually increase the associativity of a given set • Experimental results show that the proposed technique • Significant reduction in cache misses and power consumption • For both direct-mapped and set-associative caches

  17. Comment for This Paper • The related works are not strongly connected • The results for power usage improvement are too coarse • Don’t show the extra power consumption of the support circuit • The results for different length of the circular list are not shown

More Related