Memory Hierarchies

Memory Hierarchies SonishShrestha October 3, 2013

Memory Hierarchies • Busses • The wires that move data around in a computer, from memory to cpu or to disc controller or screen are called busses. • Front-Side Bus(FSB):connects processor to memory • Typically slower than processor, one reason caches are needed.

Latency and Bandwidth • Latency: • Delay between the processor issuing a request for a memory item and the item actually arriving. (Meaning transfer from memory to cache, cache to register, or summarize them all). • Nano Seconds or Clock periods • Bandwidth: • Rate at which data arrives at its destination. • Kilobyte,Megabyte,gigabyte per sec or per clock cycle. • Time that a message takes from start to finish • T(n) = α + βn • α – latency • Β – inverse of bandwidth(time per byte) • n - number of bytes

Registers: • The registers are what the processor actually operates on • High bandwidth and low latency because they are part of the processor. • A = B + C • Load the value of B from memory into a register • Load the value of C form memory into another register • Compute sum and write that into another register • Write sum value back to memory location A

Cache • Cache is small high speed memory that contains the most recently accessed pieces of main memory • Example: Library • Let's give the librarian a backpack into which he will be able to store 10 books (in computer terms, the librarian now has a 10-book cache). In this backpack, he will put the books the clients return to him, up to a maximum of 10. Let's use the prior example, but now with our new-and-improved caching librarian. • The day starts. The backpack of the librarian is empty. Our first client arrives and asks for Moby Dick. No magic here -- the librarian has to go to the storeroom to get the book. He gives it to the client. Later, the client returns and gives the book back to the librarian. Instead of returning to the storeroom to return the book, the librarian puts the book in his backpack and stands there (he checks first to see if the bag is full -- more on that later). Another client arrives and asks for Moby Dick. Before going to the storeroom, the librarian checks to see if this title is in his backpack. He finds it! All he has to do is take the book from the backpack and give it to the client. There's no journey into the storeroom, so the client is served more efficiently. • What if the client asked for a title not in the cache (the backpack)? In this case, the librarian is less efficient with a cache than without one, because the librarian takes the time to look for the book in his backpack first. One of the challenges of cache design is to minimize the impact of cache searches, and modern hardware has reduced this time delay to practically zero. Even in our simple librarian example, the latency time (the waiting time) of searching the cache is so small compared to the time to walk back to the storeroom that it is irrelevant. The cache is small (10 books), and the time it takes to notice a miss is only a tiny fraction of the time that a journey to the storeroom takes.

Core ~300 cycles Main Memory ~ 1-2 cycle L1 (In multicore,Private(L1 D & L1 I) ~10-15 cycle L2 (Shared) ~ 30 cycle(say) L3

Locality • When instructions or data are accessed, they usually show locality. • Temporal Locality: • Temporal locality is the property that instructions or data are accessed multiple times over time. • X at time t • X at time t+i • In a loop, instructions and data are accessed repeatedly. • Spatial locality • Spatial locality is the property is that instructions or data in consecutive addresses are accessed over time. • X at time t • X+1 at time t+i

Cache misses • Compulsory miss: • When a data is accessed first time, it is not in the cache. This is called compulsory miss. • Conflict miss: • Suppose a data loaded and replaced by some other data. Then it is accessed again, it is not in the cache because it is replaced. This is called conflict miss. • Capacity miss: • In a fully associative cache, we reduce the possibility of conflict misses by using all the available cache lines. But if all the lines are in use, then one of them should be replaced. And miss can occur when we try to access the replaced data. This is called Capacity miss.

Cache line and TLB • Cache line: • data is moved from memory to cache in consecutive chunks named cachelines. • TLB(Translation Look-aside Buffer) • The TLB is a cache of frequently used Page Table Entries: it provides fast address translation for a number of pages. If a program needs a memory location, the TLB is consulted to see whether this location is in fact on a page that is remembered in the TLB. • The case where the page is not remembered in the TLB is called a TLB miss

Replacement policies • LRU(Least Recently Used) • FIFO(First In First Out) • MRU(Most Recently Used)

Cache mapping Direct Mapping Assume 64 Byte cache Offset Byte Address 8 8-byte words 101010 ABCD 8 Byte words Compare Index (Sets) Note: All the addresses whose index bits and byte offset bits are the same are mapped to the same cache line. To tell the different addresses, we keep the remaining upper bits of address bits as a tag. Tag Array Data Array

Set Associativity Byte Address 101010 ABCD Way 1 Way 2 Index (Sets) Compare

Full Associativity • Assume you have a parking lot where they have handed out many parking permits. In fact, there's more parking permits than parking spots. This is not uncommon at a college. When a lot fills up, the students park in an overflow lot.Suppose there's 1000 parking spots, but 5000 students. With a fully associative scheme, a student can park in any of the 1000 parking spots. • Search the entire cache for an address • If a Main memory block can be placed in any of the Cache slots, then the cache is said to be mapped in fully associative.

Calculation Metric • Cache size = Number of cache lines * cache line size (or block size) • Number of cache lines = 2 index bits • cache line size = 2 byte offset bits • Tag bits = Number of bits in a word – index bits – offset bits

Exercise • 1. Suppose we have a 16KB of data in a direct-mapped cache with 4 word blocks. • 2. Suppose we have a 16KB of data in a 2-wat set associative cache with 4 word blocks. Find • the size of index, offset and tag bits.

Answers • 1) • Cache size = 16KB = 16 * 2^10 bytes • cache line size = 4 words = 4 * 4 bytes = 16 bytes • Number of cache lines = 16 * 2^10 bytes / 16 bytes = 2^10 • Index bits = 10 • Offset bits = 4 • Tag bits = 32 – 10 – 4 = 18 • 2) Cache size = 16 * 2^10 bytes • cache line size = 16 bytes • Set size = cache line size * set associativity = 16 bytes * 2 = 32 bytes • Number of sets = 16 * 2^10 bytes / 32 bytes = 2^9 • Index bits = 9 • Offset bits = 4 • Tag bits = 32 – 9 – 4 = 19

Data Reuse • Exercise for (i = 0; i <= 10000; i=i+1000) { for (j = 0; j < 1000; j++ C[i+j] = A[j]; } When and which data is being reused in the above example?

Memory Hierarchies