1 / 18

Highly-Associative Caches for Low-Power Processors

Highly-Associative Caches for Low-Power Processors. Michael Zhang Krste Asanovic {rzhang|krste}@lcs.mit.edu. Motivation. Cache uses 30-60% processor energy in embedded systems. Example: 43% for StrongArm-1 Many academic studies on cache

fern
Télécharger la présentation

Highly-Associative Caches for Low-Power Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic {rzhang|krste}@lcs.mit.edu

  2. Motivation • Cache uses 30-60% processor energy in embedded systems. • Example: 43% for StrongArm-1 • Many academic studies on cache • [Albera, Bahar, ’98] – Power and performance trade-offs • [Amrutur, Horowitz, ‘98,’00] – Speed and power scaling • [Bellas, Hajj, Polychronopoulos, ’99] – Dynamic cache management • [Ghose, Kamble,’99] – Power reduction through sub-banking, etc. • [Inoue, Ishihara, Murakami,’99] – Way predicting set-associative cache • [Kin,Gupta, Mangione-Smith, ’97] – Filter cache • [Ko, Balsara, Nanda, ’98] – Multilevel caches for RISC and CISC • [Wilton, Jouppi, ’94] – CACTI cache model • Many Industrial Low-Power Processors use CAM (content-addressable-memory) • ARM3 – 64-way set-associative – [Furber et. al. ’89] • StrongArm – 32-way set-associative – [Santhanam et. al. ’98] • Intel XScale – 32-way set-associative – ’01 • CAM: Fast and Energy-Efficient

  3. Talk Outline • Structural Comparison • Area and Delay Comparison • Energy Comparison • Related work • Conclusion

  4. Tag Status Data Tag Status Data =? =? Tag Index Offset Set-Associative RAM-tag Cache • Not energy-efficient • All ways are read out • Two-phase approach • More energy-efficient • 2X latency

  5. Cache BUS 128 I/O 32 gwl BUS lwl lwl Tag SRAM Cells Data SRAM Cells Address Decoder Data SRAM Cells Offset Dec. Offset Dec. Tag Comp Sense Amps Sense Amps addr offset offset Set-Associative RAM-tag Sub-bank • Not energy-efficient • All ways are read out • Two-phase approach • More energy-efficient • 2X latency • Sub-banking • 1 sub-bank = 1 way • Low-swing Bitlines • Only for reads, writes performed full-swing • Wordline Gating

  6. Tag Status Data HIT? HIT? CAM-tag Cache Tag Status Data • Only one sub-bank activated • Associativity within sub-bank • Easy to implement high associativity HIT? Word Tag Bank Offset

  7. 128 32 gwl lwl lwl I/O BUS SRAM Cells SRAM Cells CAM-tag Array Offset Dec. Offset Dec. Sense Amps Sense Amps tag offset offset CAM-tag Cache Sub-bank • Only one sub-bank activated • Associativity within sub-bank • Easy to implement high associativity

  8. SBit_b Bit Bit_b SBit WL SRAM Mismatch Match SBit_b SBit_b Bit Bit Bit_b Bit_b SBit SBit match WL WL XOR match match 1 1 1 0 1 1 0 0 1 0 1 0 CAM Functionality and Energy Usage • CAM Energy Dissipation • Search Lines • Match Lines • Drivers 10-T CAM Cell With Separate Write/Search Lines And Low-Swing Match Line

  9. 32x64 RAM Array 2x12x32 CAM Array CAM-tag Cache Sub-bank Layout 1-KB Cache Sub-bank implemented in 0.25 m CMOS technology • 10% area overhead over RAM-tag cache

  10. Global Wordline Decoding Local Wordline Decoding gwl lwl Decoded offset Tag Comp. Tag bits Tag readout Data out Data readout Tag bits Local Wordline Decoding Tag bits broadcasting gwl lwl Tag Comp. Decoded offset Data out Data readout Delay Comparison RAM tag Cache Critical Path: Index Bits CAM tag Cache Critical Path: Tag bits Within 3% of each other

  11. Hit Energy Comparison Hit Energy per Access for 8KB Cache in pJ Associativity and Implementation

  12. LZW pegwit ijpeg perl gcc m88ksim Miss Rate Results

  13. Total Access Energy (pegwit) Pegwit – High miss rate for high associativity Total Energy per Access for 8KB Cache in pJ Miss Energy Expressed in Multiples of 32-bit Read Access Energy

  14. Total Access Energy (perl) Perl – Very low miss rate for high associativity Total Energy per Access for 8KB Cache in pJ Miss Energy Expressed in Multiples of 32-bit Read Access Energy

  15. Other Advantages of CAM-tag • Hit signal generated earlier • Simplifies pipelines • Simplified store operation • Wordline only enabled during a hit • Stores can happen in a single cycle • No write buffer necessary

  16. Related Work • CACTI and CACTI2 • [Wilton and Jouppi ’94],[Reinman and Jouppi, ’99] • Accurate delay and energy estimate • Results within 10% • Energy estimate not suited for low-power designs • Typical Low-power features not included in CACTI • Sub-banking • Low-swing bitlines • Wordline gating • Separate CAM search line • Low-swing match lines • Energy Estimation 10X greater than our model for one CAM-tag cache sub-bank • Our results closely agree with [Amruthur and Horowitz, 98]

  17. Conclusion • CAM tags – high performance and low-power • Energy consumption of 32-way CAM < 2-way RAM • Easy to implement highly-associative tags • Low area overhead (10%) • Comparable access delay • Better CPI by reducing miss rate

  18. Thank You! http://www.cag.lcs.mit.edu/scale/

More Related