High-level transformations for efficient data cache utilization

High-level transformations for efficient data cache utilization Cedric Ghez &Tycho van Meeuwen ACTMA seminar22/05/2002

Input: from MHLA (Memory Hierarchy Layer Assignment) - Inplace optimisation for mem.size - Data layout for cache Situation in the DTSE methodology C-in Analysis/Pruning Dataflow transformations Loop/control-flow transformations Data reuse Storage Cycle budget distribution Memory allocation and assignment Data layout optimisation C-out

Frameworks for the measurement of the different cache miss types • a miss type: • is related to a specific cache non-ideality • is accumulative with previous type • Simulations with ideal cache for compulsory miss • Infinite size • Optimal replacement policy • Fully associative • line size = 1 data element and disable the fetch-on-write • Simulations done using Trimaran environment

minimal capacity miss data-layout conflict miss Different simulations for different miss types • Ideal cache for compulsory miss • + Finite size for minimal capacity miss • + block miss for block level transfer (*) • + non-FA + optimal layout for associativity miss • + LRU for replacement miss • + non-optimal layout for data-layout conflict miss (*) The block missare temporary ignored by: • setting the cache line size to 1 data element • disabling the fetch-on-write policy of simulator (to avoid unnecessary write-miss)

which ? size ? by-pass ? which ? A’ copy candidates (Reuse) Trade-off: too small = many misses too big = high energy/access power vs area vs time A’’ Minimal capacity miss reduction • Memory hierarchy Layer Assignment (MHLA) Main memory Top level array A data cache data path

100 80 60 #misses 40 estimate size 20 0 15 0 5 10 15 20 #elements R1(A) 5 A’ A’ Cost function needs both size and number of accesses to intermediate array estimate #misses from different levels for one iteration of i for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; 2*3*5 =30 3*5 =15 2*3*3*5 =90 Reserved space in the cache for this array

index time declared size Declared size = maximum number of elements ALIVE AT THE SAME TIME Intra-array inplace optimization: local analysis of life-time index • This storage size • estimation can be • done on: • copy candidates • top level arrays declared size time

30 15 15 22 22 45 90 135 150 Layer 0 120 e 6 t 45 105 a # of accesses 7 16 m 100 i t s E y g r e 50 Layer 1 120 n E 6 Layer 1 Layer 2 0 140 150 160 170 180 # elements 51 38 135 35 155 165 150 170 AREA Estimated energy Very simplistic power and area estimation for different data-reuse versions A 150 A 150 A 150 A 150 30 15 15 A’ 5 A’ 15 A’ 15 30 A’’ 5 90 90 90 90 R1(A) R1(A) R1(A) R1(A)

A A B Layer 1 B B’ Layer 2 A’ B’ A’ B’’ A’’ B’’’ Layer 3 A’’ B’’’ R1(B) R1(A) R1(B) R1(A) Even more freedom in assignment to layers

Layer size (cache) # misses = 3 # misses = 2.75 # misses = 5.25 Trading off the individual capacity misses to the global capacity misses minimum Array A Array B misses misses 0 0 copy size copy size • Usually the number of misses for all cache locations is then “equalized” • Factor “misses/copy-size” is important (good heuristic steering)

C C B B D D A A A B C D Inter-array inplace data mapping: global analysis of life-time memory locations memory locations time time A || C Memory Memory B || D

Experimental framework • The combined approach MHLA + Data-Layout opt. has been applied on 2 drivers: • Digital Audio Broadcasting: • for an implementation on a TriMedia platform • Result: a miss rate reduction of a factor 2 • QSDPCM driver (video compression)

There is a tradeoff for the layer size for QSDPCM Main memory size is 1MB, E/acc=0.107

Which solutions are spanning the Pareto tradeoff for power, area and timing? Size = 1024 , E =47.28 "top level arrays" "top level arrays" prev_frame_sub2 prev_frame_sub2 prev_frame_sub4 prev_frame_sub4 m4,n4,t4,QC_t2,QC_m2 m4,n4,t4,QC_t2,QC_m2 Main mem prev_frame Main mem prev_frame frame_sub2 frame frame_sub2 frame frame_sub4 frame_sub4 m2,n2,t2,QC_t4 m2,n2,t2,QC_t4 mean2 mean2 L1 Which? L1 "copy candidates" "copy candidates" reg reg ųP ųP Data-path / functional units / processing elements Data-path / functional units / processing elements QSDPCM has many possible layer assignments

Miss rate 140 0.05 misses related to layout measured with direct mapped cache conflict misses 120 0.04 100 cache hits 0.03 80 60 0.02 minimal capacity misses (estimated by MHLA and obtained by simulation) 40 0.01 20 0 0.00 64 1024 128 256 512 1024 8192 2048 4096 16384 capacity misses MHLA doesn’t take layout into account • MHLA assumes that problem related to layout can be solved • Measured minimal capacity misses are very close to MHLA estimation • But once considering the layout, conflict misses are present • Removing those misses is the focus of the next part (Tycho) Energy

Cache fundamentals: the mapping function Fully associative Memory 0 Cache location = flexible 1 0 2 Cachesize 1 3 2 4 3 5 4 6 5 7 6 7 8 Direct mapped 9 10 0 11 1 12 2 Cachesize 13 3 14 4 Cache location = fixed 15 5 ... 6 7 Mapping function: cache_location = memory_location MOD cache_size

Data-layout conflict misses Memory Cache Cachesize C Conflict situation! Partly overfull, partly empty Enough space but not all elements own location Solvable by arranging placing in memory (= Data-layout conflict-miss removal) Elements in main memory Selected to be stored in the cache

a[i] = b[i-1] + c[i]; mem[i%3+i/3 * 8 + 0] = mem[(i-1)%2 + (i-1)/2 * 8 + 3] + mem[i%3 + i/3 * 8 + 5]; Concept of tiling Interleaving arrays in memory Inter-array-conflicts removed: because elements from different arrays are mapped to different cache locations Cache-”tile” where only elements of array A are mapped Memory-”tile” where only elements of array A are stored 0 A 1 0 1 2 2 3 B 4 3 4 5 C 5 6 6 7 7 8 9 10 11 12 13 14 15 ...

Data-layout optimization approach • Initial tile-size determined by MHLA (fill cache by selecting combination of copy candidates = trade-off tile-sizes for different arrays) • Fine-tuning to optimize data-layout conflict misses Intra-array-conflicts optimization: study arrays one by one; one array mapped to cache with flexible size (tile-size) 0 1 2 0 1 2 8 tile 9 part of the cache 10 ... Memory

Intra-array conflict misses: When present? Memory array tile Mappingfunction Cache Possibly conflicts Signal-size equals tile-size array Memory No conflicts tile array Mappingfunction Memory Cache tile Mappingfunction Signal-size bigger than tile-size AND ‘working set’ scattered around the array Cache ‘Working set’ is continuous

An analyzable situation: blocks on equal distance in memory mapping to the cache • Typical image-processing situation: 2-D signal, smaller block frequently used • Storage-order image row/column major  repeated groups in memory TS M TS ! CONFLICT Equal distances M Memory (2-D) Memory (linear)

How to influence the mapping (1) Slight increase of image dimension (padding) Memory Cache M M’ TS TS CONFLICT? M’ L Cache-address = Memory-address MOD TS M

How to influence the mapping (2) Slight increase/decrease of tile-size Memory array Mapping Memory Cache tile M TS TS’ TS TS’ Cache CONFLICT? Cache-address = Memory-address MOD TS

Predict good/bad mapping Elements in memory at distances M mapped to tile t • Jumps in memory of M  ensure that 0*M mod t, 1*M mod t, 2*M mod t … covers cache completely 012... 0123 Tile-size = 4 M cache Distance M = 3 GCD ( 3, 4 ) = 1 memory • Coverage ensured if M and t are co-prime  GCD (M, t) = 1

Optimization • Change M’ or TS’ to find optimal situation with GCD ( M, t ) = 1 • Pareto trade-off: extra memory usage (padding) / extra cache space (tiling) 12 11 10 9 8 Optimal point 7 Memory overhead (Padding) Optimal pareto point 6 5 4 3 2 1 0 5 6 7 8 9 10 11 12 Cache overhead (Δtile-size) Maximal padding: If optimal situation with minimum tile-size found Maximal ΔTS: If optimal situation with minimum padding found

Consequences of changes in tile-size • Δ capacity-miss < Δ data-layout conflict miss • Σ tile-sizes all arrays alive <= cache-size Δdata-layout conflict miss Δcapacity miss Selected TS ΔTS

Influences of padding / tiling on memory overhead if more than one array considered Padding Tiling A Repetitions in memoryeach cache-size B C Empty memorylocations Padding array B No Padding Memory-overhead:proportion (declared_size / tile_size) extreme compared to other arrays Empty memorylocations not necessaryoverhead

QSDPCM video compression contains a hierarchical motion estimation Sub sample Sub sample Input image Compute V4 Compute V2 V1 Diff Sub sample Sub sample Previous Image Quad Dec/ Constr. Reconstr. + Up sample Output bitstream

"top level arrays" prev_frame_sub2 prev_frame_sub4 m4,n4,t4,QC_t2,QC_m2 prev_frame frame_sub2 frame frame_sub4 m2,n2,t2,QC_t4 mean2 Layer 1: 1K-cache "copy candidates" Signals with potential dl-conflict misses QSDPCM: a possible layer assignment for L1 = 1K (optimal for power) Pareto tradeoff for power, area and timing Main mem Size: 176x144 176x144 88x72 44x36 8x8 64 16 4x4 4 12x12 12x4 12x12 L1 16x16 12x8 8x8 18x18 16x18 reg 16x16 ųP

intra conflict misses prev_frame qsdpcm 350000 300000 250000 200000 #misses 150000 100000 50000 0 250 260 270 280 290 300 310 320 330 tile-size Compulsory + minimal capacity comp+min_cap+conflict (pad = 0, M=>176) comp+min_cap+conflict (pad = 4, M=>180) comp+min_cap+conflict (pad = 22, M=>198) Optimization per array Illustration prev_frame: find tile size / padding • Array size: 176 x 144; Copy size: 18x16 #accesses to prev_frame

400000 350000 300000 250000 194910 #misses 200000 150000 35340 100000 data-layout conflict 64112 associativity conflict 50000 minimal capacity 50688 compulsory 0 • Memory overhead: 43% • Miss rate reduction: 56% Results: Miss-rates per miss type (dm cache) optimized QSDPCM driver Tot #accesses: 1.400.000 Initial, no data layout Only padding Directly using MHLA tiles Detailed tiling + padding“optimal data-layout” Challenge to remove Because of bypassing signals

cache inplace inplacable B' - C' A' - B' 20 20 A B C inplacable A' 200 400 Main mem B' C' cache memory overhead reg ųP Further extensions:Cache in-place / memory in-place related to data-layout Cache size = 20 In Memory: A[100], B[100], C[100] In Cache: A’[10], B’[10], C’[10] A B C A’ B’ C’

Q u es ti on s? ? o n t i e s u s Q

High-level transformations for efficient data cache utilization

High-level transformations for efficient data cache utilization

Presentation Transcript

Energy Efficient Instruction Cache for Wide-issue Processors

Data Transformations

Cache Utilization-Aware Scheduling for Multicore Processors

More Trees, More Time: Enabling Efficient Data Utilization Without Paper

Efficient Frequency Spectrum Utilization

An Efficient Texture Cache for Programmable Vertex Shaders

Dual Data Cache

High-level Data Link Control

Cache Models and Program Transformations

High-Level Data Access With Ease

Cache Models and Program Transformations

Power Efficient Cache Coherence

Locality-Aware Data Replication in the Last-Level Cache

Data Transformations

High-Level Transformations for Embedded Computing

Cache Efficient Data Structures and Algorithms for d -Dimensional Problems

Enabling BITE: High-Performance Snapshots in a High-Level Cache

High-Level Data Race Detection

From High-level Haskell to Efficient Low-level Code

Efficient Similarity Search with Cache-Conscious Data Traversal

High-level Synthesis Transformations