Memory Hierarchy and Cache Performance

Parallel Scientific Computing: Algorithms and ToolsLecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg

Memory • Bits: 0, 1; Bytes: 8 bits • Memory size • PB – 10^15 bytes; TB – 10^12 bytes; GB – 10^9 bytes; MB – 10^6 bytes; KB – 10^3 bytes • Memory performance measures: • Access time, or response time, latency: interval between time of issuance of memory request and time when request is satisfied. • Cycle time: minimum time between two successive memory requests Access time: t1-t0 Cycle time: t2-t0 Memory busy t0 < t < t2 DRAM only If there is another request at t0 < t < t2, memory is busy and will not respond; have to wait until t > t2 t2 t1 t0 Memory request request satisfied

Memory Hierarchy • Memory can be fast (costly) or slow (cheaper). • Increase overall performance: use locality of reference • Faster memory (also smaller) closer to CPU; • slower memory (also larger) farther away from CPU. • Have often-used data in fast memory; leave less-often-used data in slow memory. • Key:When lower levels of hierarchy send value at location x to higher levels, also send content at x+1, x+2, etc. i.e. send a block of data • Cache line

Registers Level-1 cache Level-2 cache Main memory Secondary memory (hard disk) Network storage … … Memory Hierarchy Cache: a piece of fast memory Expensive, CA$H ? • Performance of different levels can be very different • e.g. access time for L1 cache can be 1 cycle, L2 can be 5 or 6 cycles, while main memory can be dozens of cycles and secondary memory can be orders of magnitude slower. Decreasing speed Decreasing cost Increasing size Increasing speed Increasing cost Decreasing size

How Memory Hierarchy Works • (RISC processor) CPU works only on data in registers. • If data is not in register, request data from memory and load to register … • Data in register come only from and go only to L1 cache. • When CPU requests data from memory, L1 cache takes over; • If data is in L1 cache (cache hit), return data to CPU immediately; end memory access; • If data is not in L1 cache (cache miss) …

How Memory Hierarchy Works • If data is not in L1 cache, L1 cache forwards memory request down to L2 cache. • If L2 cache has the data (cache hit), it returns the data to L1 cache, which in turn returns data to CPU; end memory access; • If L2 cache does not have the data (cache miss) … • If data is not in L2 cache, L2 cache forwards memory request down to main memory. • If data is in main memory, main memory passes data to L2 cache, which then passes data to L1 cache, which then passes data to CPU. • If data is not in memory … • Then request is passed to OS to read data from secondary storage (disk), which then is passed to memory, L2 cache, L1 cache, register.

X[9] X[10] X[11] X[12] X[13] X[14] Cache line Cache line Cache Line • A cache line is the smallest unit of data that can be transferred to or from memory (and L2 cache). • usually between 32 and 128 bytes • May contain several data items • When L2 cache passes data to L1 cache, or when main memory passes data to L2 cache, a cache line, instead of a single piece of data, is transferred. • When the data in variable X is requested from memory, the cache line containing X (and adjacent data) is transferred to cache. Assume: 32-byte cache line, X[11] is requested by CPU Result: X[10] – X[13] is brought into cache from memory.

Cache Effect on Performance • Cache miss  degrading performance • When there is a cache miss, CPU is idle waiting for another cache line to be brought from lower level of memory hierarchy • Increasing cache hit rate  higher performance • Efficiency directly related to reuse of data in cache • To increase cache hit rate, access memory sequentially; avoid strides, random access, and indirect addressing in programming. for(i=0;i<100;i++) y[i] = 2*x[i]; for(i=0;i<100;i=i+4) y[i] = 2*x[i]; sequential access for(i=0;i<100;i++) y[i] = 2*x[index[i]]; Indirect addressing strides

1 MB (32,768 cache lines) … 32-byte cache line … 2 GB (67,108,864 cache lines) Where in Cache to Put Data from Memory • Cache is organized into cache lines. • Memory is also logically organized into cache lines. Memory size >> cache size Number of cache lines in memory >> number of cache lines in cache. Many cache lines in memory correspond to one cache line in cache. cache Main memory

Cache Classification • Direct-mapped cache • Given a memory cache line, it is always placed in one specific cache line in cache. • Fully associative cache • Given a memory cache line, it can be placed in any of the cache lines in cache. • N-way set associative cache • Given a memory cache line, it can be placed in any of a set of N cache lines in cache.

8 KB … … … … … … 16K 2G 8K 0 Direct-Mapped Cache • A set of memory cache lines always correspond to exactly the same cache line in cache. • Cheap to implement in hardware; • May cause cache thrashing: repeatedly displacing and loading cache lines. Line-Index = Mod (mem-cache-line-index, tot-cache-lines-in-cache)

Cache Thrashing: Example 1 double value = 8 bytes 131072 double values = 1 MB 1 cache line = 32 bytes = 4 double values X[131072]: 1 MB memory Y[131072]: 1 MB memory • Assumptions: • Direct-mapped cache; • Cache size: 1 MB; • Cache line: 32 bytes; double X[131072], Y[131072]; long i, j; // initialization of X, Y … for(i=0;i<131072;i++) Y[i] = X[i] + Y[i]; …

… Y[0] … … … … Y[4] X[0] … X[4] Y[5] Y[1] … … X[1] … … … X[5] … … … … X[2] X[6] Y[2] … Y[6] … … … … X[3] X[7] … Y[3] … … … Y[7] 1 MB 32768 lines 1 MB 32768 lines cache Memory Cache Thrashing: Example i=0: load line X[0]-X[3] into cache; load X[0] from cache to register; load line Y[0]-Y[3] into cache, displacing line X[0]-X[3]; load Y[0] from cache into register; add, update Y[0] in cache; i=1: load X[0]-X[3] into cache, displacing Y[0]-Y[3], write line Y[0]-Y[3] back to memory; load X[1] from cache to register; load Y[0]-Y[3] into cache, displacing X[0]-X[3]; load Y[1] from cache to register; add, update Y[1] in cache; i=2: load X[0]-X[3] into cache, displacing Y[0]-Y[3], write line Y[0]-Y[3] back to memory; load X[2] from cache to register; load Y[0]-Y[3] into cache, displacing X[0]-X[3]; load Y[2] from cache to register; add, update Y[2] in cache; i=3: … No cache reuse! Poor performance! Avoid cache thrashing! double X[131072], Y[131072]; long i, j; // initialization of X, Y … for(i=0;i<131072;i++) Y[i] = X[i] + Y[i]; …

Fully Associative Cache • A cache line from memory can be placed anywhere in cache; • No cache thrashing; but costly. • Direct-mapped cache at one extreme of spectrum; fully associative cache at another extreme of spectrum. • Disadvantage: search entire cache to determine if a specific cache line is present.

1 MB 32,768 cache lines 16,384 sets Each set has 2 lines … cache … Main memory 2 GB (67,108,864 cache lines) N-Way Set Associative Cache • Compromise between direct-mapped cache and fully associative cache • The cache lines in cache is divided into a number of sets; Each set contains N cache lines. • Given a cache line from memory, the index of set it belongs to is first calculated; Then it is placed in one of the N cache lines in this set. 2-way set associative cache Direct-mapped cache is 1-way set associative cache; Fully associative cache is N_c way set associative cache; N_c is total number of cache lines in cache. Less likely to cause cache thrashing; Less costly;

Instruction/Data Cache • CPU may have separate instruction cache and data cache (split cache). • CPU may have a single cache, for both instructions and data from memory (unified cache).

Remember … • Efficiency directly related to cache reuse • Cache thrashing is eliminated by padding arrays (array dimensions should not be a multiple of cache line – avoid powers of 2) • To improve cache reuse, • Access memory sequentially as much as possible • Avoid stride, random access, indirect addressing • Avoid cache thrashing.

X[0][0] Y[0][0] X[0][1] Y[0][1] …… …… stride 1024 or 8KB X[0][1023] Y[0][1023] Y[1][0] X[1][0] X[1][1] Y[1][1] …… …… Y[1023][1023] X[1023][1023] Example • Large stride in memory access pattern results in not only cache miss/poor reuse, but also TLB miss. double X[1024][1024], Y[1024][1024]; int i,j; … for(j=0;j<1024;j++) for(i=0;i<1024;i++) X[i][j] = Y[i][j];

2GB 2GB 4GB … … 1048KB … … 8KB 8KB 1044KB 4KB 4KB 1040KB 0 0 1036KB 1032KB … … 1028KB 1024KB … … 0 Virtual Memory, Memory Paging Program #1 Modern computers use virtual memory; Memory address seen in a program (virtual address) is not the actual address in physical memory; Memory is divided into pages (e.g. 4KB); A memory page in program’s address space corresponds to a page in physical memory; To access memory, need to translate program’s virtual address to the actual address in physical memory. This is done using a page table; Program #2 Physical Memory

Translation Look-aside Buffer (TLB) • TLB is a special cache for the page tables • Faster access to TLB for virtual-physical translation. • When program accesses a memory location, the translation between virtual and physical pages is loaded into TLB (if it is not already there); • If program exhibits locality of references, entries in TLB can be reused TLB hit  better performance • Otherwise  TLB miss  performance degrades. • Large stride in memory access pattern  TLB miss (and cache miss).

Remedies • Use large memory page size • On some systems, the memory page size can be modified by user programs, e.g. IBM SP, HP machines • Avoid large stride in memory access; Sequential access to memory as much as possible.

0-31 32-63 64-95 96-127 128-159 160-191 192-223 224-255 … … … … Bank 1 Bank 2 Bank 3 Bank 4 Interleaved Memory • Memory interleaving: alleviating the impact of memory cycle time. • Total memory divided into a set of memory banks; • Contiguous memory addresses reside on different banks. • When accessing memory sequentially, effect of memory cycle time minimized • When current bank is busy, next bank is idle and can be accessed immediately. • Stride in memory access not favorable  may access the same bank repeatedly, need to wait due to cycle time  poor performance Total 2GB memory Divide into 4 memory banks Each bank: 512 MB Cache line: assumed 32 bytes 1 cache line (32 bytes)

Memory Hierarchy and Cache Performance