Software Methods to Increase Data Cache Performance

Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Outline • Introduction • Example: Multiple Vector Additions • Example: Linked List • Example: Binary Tree • Conclusion

Introduction • Cache hit time is critical to system performance • Often determines a processor’s clock period • Cache controllers must be as simple as possible • The miss rate of a cache can be decreased if we know something about the access patterns • If we use software to use better access patterns or hint at how the cache can best be used, we can improve performance

Introduction • Various methods can be used: • Loop Fusion – combine multiple loops that access the same elements • Array Merge – combine multiple arrays to increase spatial locality • Cache Prefetch – ask for values to be loaded into cache in advance • Cache Bypass – prevent certain accesses from allocating in the cache

Vector Addition – Base Code #define SIZE_N 1024 int a[SIZE_N], b[SIZE_N], c[SIZE_N]; int s1[SIZE_N], s2[SIZE_N]; for (int i = 0; i < SIZE_N; i++) s1[i] = a[i] + b[i]; for (int i = 0; i < SIZE_N; i++) s2[i] = a[i] + c[i];

Vector Addition – Base Code • Assume a perfect instruction cache • Ignore conflict data misses • Assume a cache line size of 4 words • Assume write miss penalties can be hidden • First loop: • a, b: 256 misses each (every 4th access) • Second loop: • a, c: 256 misses each unless cache is large enough to hold entire a and b arrays • 1024 total misses

Vector Addition – Loop Fusion #define SIZE_N 1024 int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N], s2[SIZE_N]; for (int i = 0; i < SIZE_N; i++) { s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i]; }

Vector Addition – Loop Fusion • a, b, c: 256 misses each • 768 total misses • Are there always loops that can be combined?

Vector Addition – Array Merge #define SIZE_N 1024 struct vectors_type { int a; int b; int c; } int s1[SIZE_N], s2[SIZE_N]; vectors_type vectors[SIZE_N]; for (int i = 0; i < SIZE_N; i++) { s1[i] = vectors[i].a + vectors[i].b; s2[i] = vectors[i].a + vectors[i].c; }

Vector Addition – Array Merge • 3072 accesses, every 4th one misses • 768 misses • May not be a viable optimization method in all cases • If we have a large set of vectors and want to be able to add any two • Dynamic memory allocation • What if we only want to traverse one vector?

Vector Addition – Prefetch • Speculatively load data into cache before we need it • Useful if we know which data we need far enough in advance • Assume prefetch is useful if we know the address 10 iterations in advance • Assume prefetch past end of array is non-faulting

Vector Addition – Prefetch #define SIZE_N 1024 int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N], s2[SIZE_N] for (int i = 0; i < SIZE_N; i++) { s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i]; prefetch(a[i+10]); prefetch(b[i+10]); prefetch(c[i+10]); }

Vector Addition – Prefetch • Only 30 misses • 3072 prefetch instructions issued • Does the cost outweigh the benefit? • 768 – 30 = 738 fewer misses • Miss cost only needs to be 4.2 cycles for prefetch be worthwhile • Multiple issue processors can help hide the cost of issuing prefetches • Improves performance even if we’re only adding 2 vectors

Vector Addition – Prefetch • Do we want a special load instruction that prefetches several blocks ahead? • Reduces instruction count • Works in the case of sequential access, but what if we want to prefetch from non-contiguous locations?

Vector Addition – Cache Bypass • Assume a 2-set fully associative cache with 4 word line size for (int i = 0; i < SIZE_N; i++) { s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i]; } • Assume write non-allocate • Very worst case: cache always misses (4096 misses) • If we use LRU and write our assembly so that a is always in cache: 2048 misses for b[i] and c[i] + 256 misses for a[i] = 2304 • If we use non-caching reads for c[i]: 1024 misses • a[i] and b[i] 256 misses each: 1536 total

Linked List • Suppose we are sequentially traversing a linked list • We can prefetch the next several items • Calculating addresses repeatedly could be expensive (requires multiple memory accesses) • Use 2 pointers: one for prefetch

Linked List – Base Code struct linked{ int data; *linked next; } *linked start; *linked temp = start; int a[SIZE], index=0; while (temp->next) { a[index++] = temp->data; temp = temp->next; }

Linked List – Prefetch struct linked{ int data; *linked next; } *linked start; *linked temp = start, temp2 = start; int a[SIZE], index=0; for (int i = 0; i < 10; i++) temp2 = temp2->next; while (temp->next){ a[index++] = temp->data; temp = temp->next; if (temp2->next){ temp2 = temp2->next; prefetch(temp2->next); } }

Linked List – Prefetch • Instead of every element potentially missing the cache, only the first 10 do • If prefetch takes longer to complete, more cache space is necessary

Binary Tree • Suppose we are traversing a binary tree where we can’t easily predict which branch we’ll access next. Is prefetch useful? • We can speculatively prefetch all values • How far down tree? • Cache Pollution • May be valuable to speculatively fetch next two possible elements if we can do useful work until the prefetch completes (ie, if it takes enough cycles to determine which branch to take)

Binary Tree struct node{ int data; *node left, right; } *node top; *node temp = top; int search_value, found=0; do{ if (temp->left) prefetch(temp->left); if (temp->right) prefetch(temp->right); temp = next_node(temp, search_value, &found); }until (found);

Conclusion • Some methods improve contrived cases, but are they always useful? • Loop fusion • Array merge • Prefetch works well for predictable access patterns • Dynamic memory and pointers? • Is prefetch worthwhile for large block size and random access of small elements?

Conclusion • Cache miss time measured in clock cycles is increasing • Requires prefetch farther ahead – larger caches • Software methods are static • Low cost of implementation • Potentially pipeline independent

Software Methods to Increase Data Cache Performance