A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics, University of Oslo, Norway OMS 2007

Overview • Motivation, CPU versus Memory speed • Caches • A cache test • A simple model for the execution times of algorithms • Does theoretical cache tests carry over to real programs? • A real example – three Radix sorting algorithms compared The number of instructions executed is no longer a good measure for the performance of an algorithm OMS 2007

The need for caches, the CPU-Memory performance gap from: John L. Hennessy , David A. Patterson, Computer architecture a quantitative approach,: Morgan Kaufmann Publishers Inc., San Francisco, CA, 2003 OMS 2007

A cache test random vs. sequential access i large arrays • Both a and b are of length n (n= 100, 200, 400,..., 97m) • 2 test runs – the same number of instruction performed: • Random access: set b[i] = random(0..n-1) • We will get 15 random accesses i b and 1 in a, and 1 sequential access i b (the innermost) • Sequential access :set b[i] = i. • then b[b[.....b[i]....]] = i, and we will get 16 sequential accesses in b and 1 in a for (int i= 0; i < n; i++) a[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[i]]]]]]]]]]]]]]]]] = i; OMS 2007

Random vs. sequential access times, the same number of instructions performed. Cache-misses slowing random access down a factor: 50 – 60 (4 CPUs) start of cache miss from L2 to memory start of cache miss from L1 to L2 OMS 2007

Why a slowdown of 50-60 and not factor 400 ? • Patterson and Hennessy suggests a slowdown factor of 400, test shows 50 to 60 – why? • Answer: Every array access in Java is checked for lower and upper array limits – say: • load array index • compare with zero (lower limit) • load upper limit • compare index and upper limit • load array base address • load/store array element ( = possible cache miss) • We see 5 cache hit operations + one cache miss – then average = (5 + 400)/6 = 67 OMS 2007

n L1 L2 A simple model for the execution time of a program • For every loop in program • Count the number of sequential references • Count the number of random accesses and the number of places n in which the randomly accessed object (array) is used • From the figure for random access, we see a asymptotical slowdown factor of: 1 if n < L1 4 if L1 < n < L2 50 if L2 < n • The access time TR for one random read or write is then: TR = 1* Pr (access in L1) + 4* Pr (access in L2) + 50* Pr (access in memory) ( = 1* L1/n + 4* L2/n + 50* (n - L2)/n , when n > L2 ) • The sequential reads and writes is set to 1, and we can then estimate the total execution time as the weighted sum over all loop accesses OMS 2007

Applying the model to Radix sorting – the test • Three Radix algorithms • radix1, sorting the array in one pass with one ‘large’ digit • radix2, sorting the array in two passes with two half sized digits • radix3, sorting the array in three passes with three ‘small’ digits • radix3 performs almost three times as many instructions as radix1 • should be almost 3 times as slow as radix1? • radix2 performs almost twice as many instructions as radix1 • should be almost 2 times as slow as radix1? OMS 2007

static void radixSort ( int [] a, int [] b ,int left, int right, int maskLen, int shift) { int acumVal = 0, j, n = right-left+1; int mask = (1<<maskLen) -1; int [] count = new int [mask+1]; // a) count=the frequency of each radix value in a for (int i = left; i <=right; i++) count[(a[i]>> shift) & mask]++; // b) Add up in 'count' - accumulated values for (int i = 0; i <= mask; i++) { j = count[i]; count[i] = acumVal; acumVal += j; } // c) move numbers in sorted order a to b for (int i = 0; i < n; i++) b[count[(a[i+left]>>shift) & mask]++] = a[i+left]; // d) copy back b to a for (int i = 0; i < n; i++) a[i+left] = b[i] ; } Base: Right Radix sorting algorithm : One pass of array a with one sorting digit of width: maskLen(shifted shift bits up) OMS 2007

Radix sort with 1, 2 and 3 digits = 1,2 and 3 passes static void radix1 (int [] a, int left, int right) { // 1 digit radixSort: a[left..right] int max = 0, numBit = 1, n = right-left+1; for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i]; while (max >= (1<<numBit)) numBit++; int [] b = new int [n]; radixSort( a,b, left, right, numBit, 0); } static void radix3(int [] a, int left, int right) { // 3 digit radixSort: a[left..right] int max = 0,numBit = 3, n = right-left+1; for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i]; while (max >= (1<<numBit)) numBit++; int bit1 = numBit/3, bit2 = bit1, bit3 = numBit-(bit1+bit2); int [] b = new int [n]; radixSort( a,b, left, right, bit1, 0); radixSort( a,b, left, right, bit2, bit1); radixSort( a,b, left, right, bit3, bit1+bit2); } 1 3 OMS 2007

Random /sequential test (AMD Opteron) , Radix 1, 2 and 3 compared withQuicksort and Flashsort radix1 slowed down by a factor 7 radix3, no slowdown radix2, slowdown started OMS 2007

Random /sequential test (Intel Xeon) , Radix 1, 2 and 3 compared withQuicksort and Flashsort OMS 2007

The model, careful counting of loops in radix1,2,3 • Let Ek denote the number of the different operations for a k-pass radix algorithm (k=1,2,..), S denote a sequential read or write, and Rk a random read or write in m different places in an array where: • After some simplification: • and (+ some more simplifications): OMS 2007

Model vs. test results (Opteron and Xeon) OMS 2007

Conclusions • The effects of cache-misses are real and show up in ordinary user algorithms when doing random access in large arrays. • We have demonstrated that radix3, that performs almost 3 times as many instructions as radix1, is 4-5 times as fast as radix1 for large n. • i.e. radix1 experiences a slowdown of factor 7-10 because of cache-misses The number of instructions executed is no longer a good measure for the performance of an algorithm. Algorithms should be rewritten such that random access inlarge data structures is removed. OMS 2007

A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting

Presentation Transcript

COMP108 Algorithmic Foundations Algorithm efficiency + Searching/Sorting

The Effect of a Model on Behavior

Algorithm Efficiency and Sorting

Effect of temperature settings on energy efficiency of a hair dryer

The Effect of Bilingualism on Text Messaging Efficiency

Lower bound for sorting, radix sort

Flash-Based Caching For Databases - Energy Efficiency and Performance

A study on the effect of

Radix Sorting

Sorting Integers in the RAM Model

Two Pass Algorithm Based On Sorting

A Semantic Caching Method Based on Linear Constraints

Efficiency Issues in Model-Based Approaches to On-Board Diagnosis

Database caching in MANETs Based on Separation of Queries and Responses

On the Efficiency of Collaborative Caching in ISP-aware P2P Networks

Sacramento Model Effect of Parameters on Model Response

Cashing In On the Caching Game

Sorting and Efficiency

Lower bound for sorting, radix sort

The Algorithmic Model

The Effect of IT Capital on Hospital Efficiency

Algorithm Efficiency and Sorting