210 likes | 302 Vues
This research focuses on developing efficient partitioned data caches managed by compilers to reduce memory power consumption in computer systems. The study explores the advantages and disadvantages of using partitioned cache architectures, including trade-offs and optimizations for handling data accesses. The approach combines hardware assistance with compiler control to achieve proactive management and optimize cache performance. Experimental setups involve cache configurations and software simulations to analyze the impact on energy consumption and cache access efficiency.
E N D
Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor * Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California
Introduction: Memory Power • On-chip memories are a major contributor to system energy • Data caches ~16% in StrongARM [Unsal et. al, ‘01] Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive – No dynamic adaptability – Conservative
Global program knowledge • Proactive optimizations • Dynamic adaptability • Efficient execution • Aggressive software optimizations Reducing Data Memory Power:Compiler Managed, Hardware Assisted Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information ー Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー No dynamic adaptability ー Conservative
Data Caches: Tradeoffs Advantages Disadvantages • + Capture spatial/temporal locality • + Transparent to the programmer • + General than software scratch-pads • + Efficient lookups • – Fixed replacement policy • – Set index no program locality • – Set-associativity has high overhead • – Activate multiple data/tag-array • per access
Traditional Cache Architecture tag set offset tag data lru tag data lru tag data lru tag data lru Replace =? =? =? =? 4:1 mux • Lookup Activate all ways on every access • Replacement Choose among all the ways
Partitioned Cache Architecture Ld/St Reg [Addr] [k-bitvector] [R/U] tag set offset tag data lru tag data lru tag data lru tag data lru P0 P1 P2 P3 Replace =? =? =? =? 4:1 mux • Advantages • Improve performance by controlling replacement • Reduce cache access power by restricting number of accesses • Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions • Replacement Restricted to partitions specified in bit-vector
Partitioned Caches: Example for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld3 ld5 ld2/st2 ld4 ld6 way-0 way-1 way-2 tag data tag data tag data ld1 [100], R ld5 [010], R ld3 [001], R ld1, st1, ld2, st2 ld5, ld6 ld3, ld4 y w1/w2 x • Reduce number of tag checks per iteration from 12 to 4 !
Compiler Controlled Data Partitioning • Goal: Place loads/stores into cache partitions • Analyze application’s memory characteristics • Cache requirements Number of partitions per ld/st • Predict conflicts • Place loads/stores to different partitions • Satisfies its caching needs • Avoid conflicts, overlap if possible
Cache Analysis: Estimating Number of Partitions • Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate • Use the working-set to compute number of partitions j-loop k-loop X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y B1 B1 B1 B1 M M M M • M has working-set size = 1
Cache Analysis:Estimating Number Of Partitions • Avoid conflict/capacity misses for an instruction • Estimates hit-rate based on • Reuse-distance (D), total number of cache blocks (B), associativity (A) (Brehob et. al., ’99) D = 2 D = 1 D = 0 1 2 3 4 1 2 3 4 1 2 3 4 8 8 8 16 16 16 24 24 24 32 32 32 • Compute energy matrices in reality • Pick most energy efficient configuration per instruction
Cache Analysis: Computing Interferences • Avoid conflicts among temporally co-located references • Model conflicts using interference graph X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1 M4 D = 1 M1 D = 1 M2 D = 1 M3 D = 1
Partition Assignment • Placement phase can overlap references • Compute combined working-set • Use graph-theoretic notion of a clique • For each clique, new D Σ D of each node • Combined D for all overlaps Max (All cliques) M4 D = 1 Clique 1 M1 D = 1 M2 D = 1 Clique 1 : M1, M2, M4 New reuse distance (D) = 3 Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3 M3 D = 1 Clique 2
Experimental Setup • Trimaran compiler and simulator infrastructure • ARM9 processor model • Cache configurations: • 1-Kb to 32-Kb • 32-byte block size • 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache • Mediabench suite • CACTI for cache energy modeling
Reduction in Tag & Data-Array Checks 8 8-part 4-part 2-part 7 6 5 Average way accesses 4 3 2 1 0 1-K 2-K 4-K 8-K 16-K 32-K Average Cache size • 36% reduction on a 8-partition cache
Improvement in Fetch Energy 16-Kb cache 60 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way 50 40 30 Percentage energy improvement 20 10 0 epic cjpeg djpeg unepic Average pegwitenc pegwitdec rawcaudio rawdaudio mpeg2dec mpeg2enc pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode
Summary • Maintain the advantages of a hardware-cache • Expose placement and lookup decisions to the compiler • Avoid conflicts, eliminate redundancies • 24% energy savings for 4-Kb with 4-partitions • Extensions • Hybrid scratch-pad and caches • Disable selected tags convert them to scratch-pads • 35% additional savings in 4-Kb cache with 1 partition as SP
Thank You & Questions
Cache Analysis Step 1: Instruction Fusioning • Combine ld/st that accesses the same set of objects • Avoids coherence and duplication • Points-to analysis for (i = 0; i < N1; i++) { … for (j = 0; j < readInput1(); j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < readInput2(); k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld3 ld5 M1 M2 ld2/st2 ld4 ld6
Partition Assignment • Greedily place instructions based on its cache estimates • Overlap instructions if required • Compute number of partitions for overlapped instructions • Enumerate cliques within interference graph • Compute combined working-set of all cliques • Assign the R/U bit to control lookup M4 D = 1 Clique 1 M1 D = 1 M2 D = 1 M3 D = 1 Clique 2
Related Work • Direct addressed, cool caches [Unsal ’01, Asanovic ’01] • Tags maintained in registers that are addressed within loads/stores • Split temporal/spatial cache [Rivers ’96] • Hardware managed, two partitions • Column partitioning [Devdas ’00] • Individual ways can be configured as a scratch-pad • No load/store based partitioning • Region based caching [Tyson ’02] • Heap, stack, globals • More finer grained control and management • Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99] • Reduce tag check power • Compromises on cycle time • Orthogonal to our technique
Code Size Overhead Annotated LD/STs Extra MOV instructions 15% 16% 12 10 8 6 Percentage instructions 4 2 0 epic cjpeg djpeg unepic Average pegwitenc pegwitdec rawcaudio rawdaudio mpeg2dec mpeg2enc pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode