220 likes | 369 Vues
Cache Pipelining with Partial Operand Knowledge. Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison. http://www.ece.wisc.edu/~pharm. Cache Power Consumption. Increasing on-chip cache size Increasing cache power consumption
E N D
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison http://www.ece.wisc.edu/~pharm
Cache Power Consumption • Increasing on-chip cache size • Increasing cache power consumption • Increasing clock frequency • Increasing dynamic power • Lots of prior work to reduce cache power consumption
Prior Work • Cache subbanking, bitline segmentation[Su et al. 1995, Ghose et al. 2001] • Cache decomposition [Huang et al. 2001] • Block buffering [Su et al. 1995] • Reducing Leakage power • Drowsy caches [Flautner et al. 2002, Kim et al. 2002] • Cache decay [Kaxiras et al. 2001] • Gated Vdd [Powell et al. 2000]
Cache Subbanking • Proposed by Su et al. 1995 • Fetching only requested subline • Partitioned data array vertically into several subbanks • Further study by Ghose et al. 2001 • Partitioned data array vertically and horizontally • Only activate the requested subbanks
Bit-sliced ALU • Originally proposed by Hsu et al. 1985 • Slices the addition operations • i.e. 32-bit addition -> four 8-bit addition • Avoids waiting for full-width addition • Bypasses partial operand result • Has been successfully implemented in Pentium 4 staggered adder
Outline • Motivation • Prior Work • Bit-sliced Cache • Experiment Results • Conclusion
Power Consumption in Cache • Row decoding consumes up to 40% of active power
Bit-sliced Cache • Extends cache subbanking technique • Saves decoding power • Enables only row decoders that are accessed • Serializes subarray decoding with row decoding • Uses low order index bits to select row decoder • Minimal changes to subbanking technique
Pipelining the Cache Access • Cache access time increases due to: • Serializing subarray decoder with row decoder • Pipeline the access to hide the delay • Need to balance the latency of each stage • Choose operations for each stage carefully • Provide more throughput • Same throughput as a conventional cache with n ports
Pipelined-Cache’s Access Steps Cycle 1 <Cycle 1> • Start subarray decoding for data and tag Cycle 2 • Activate necessary row decoders • Read tag array while waiting Cycle 3 <Cycle 2> • Read data array • Concurrently, do partial tag comparison Cycle 4 • Compare the rest of the tag bits • Use tag comparison result to select data
Bit-sliced Cache + Bit-sliced ALU • Optimal performance benefit • Cache access starts sooner • As soon as the first slice is available • Limited number of subarrays • According to the number of bits per slice • When the bitslice is too small • Unable to achieve optimal power saving
lw R1, 0(R3) lw R4, 4(R3) lw R4, 4(R3) addi R3, R3, 4 add R3, R2, R1 add R3, R2, R1 addi R3, R3, 4 add R3, R2, R1 addi R3, R3, 4 lw R4, 4(R3) lw R1, 0(R3) Pipelining with Bit-sliced Cache Pipelined Execution Stage with Pipelined Cache lw R1, 0(R3) Bit-sliced Execution Stage with Pipelined Cache Bit-sliced Execution Stage with Bit-sliced Cache
Cache Model Simulation • Estimates energy consumption and cache latency • Uses a modified version of CACTI 3.0 • Parameters: Ntbl, Ndbl, Ntwl, Ndwl. • Enumerates all possible configurations • Chooses the one with the best weighted value (cycle time and energy consumption) • Simulates: • Various cache sizes (8K-512K), 64 B blocks • DM, 2-way, 4-way, and 8-way • Uses 0.18 um technology
Processor Simulation • Estimates performance benefit • Uses a heavily modified SimpleScalar 3.0 • Supports bit-sliced execution stage • Supports speculative slice execution • Benchmarks • Eight Spec2000 Integer benchmarks • Full reference input set • Fast forward 500M, simulate 100M
Machine Configuration • 4-wide fetch, issue, commit • 128 entry ROB • 32 entry scheduler • 20 stage pipeline • 64K-entry gshare • L1 I-Cache: 32KB, 2-way, 64B block • L1 D-Cache: 8KB, 4-way, 64B block • L2 Cache: 512KB, 8-way, 128B block
Conclusion • Bit-sliced cache • Achieves significant power reduction • Without adds much complexity • Adds some delay to access latency • Pipelined bit-sliced cache • Reduces cycle time • Provides more bandwidth • Measurable speed up (w/ bit-sliced ALU)
Question? Thank you