Finish Memory (then to Cache)

Finish Memory (then to Cache)

Asynchronous DRAM • Asynchronous DRAM was common until the mid to late 1990’s but now is out-dated. • Fast Page Mode • What made FPM fast was that the same row but different columns of data could be accessed without forcing one to reselect the row strobe. • Extended Data Out (EDO) • What was “extended” about EDO was that it could go longer between refreshes. • Burst Extended Data Out (BEDO) • Consecutive data was fetched in “bursts” saving on the addressing part of access time.

Synchronous DRAM • Since the mid to late 1990’s SDRAM has taken over as the standard for use in main memory. • JEDEC (Joint Electron Device Engineering Council) or PC66 or Ordinary SDRAM operates with bus speeds up to 66MHz, now is outdated. • PC100 SDRAM that works at higher bus speed of 100 MHz

Synchronous DRAM (Cont.) • PC133 SDRAM operates at bus speed of 133 MHz and slower. This is a standard memory these days. • There are two versions of PC133 SDRAM that differ in “latency.” • Latency is the time you spend waiting until conditions are right to proceed with some action.

CAS Latency • CAS: Column Address Strobe Latency • Recall that memory was laid out in rows and columns. • The row address is readied, then there is some delay (known as RAS-to-CAS Delay). Next the column address is readied, then there is a delay and finally one can read or write. This second waiting is known as CAS Latency. • For CAS-2 the wait is 2 clock cycles and for CAS-3 the wait is 3 clock cycles. • But you need a chipset that can take advantage.

DDR-SDRAM • Double Data Rate Synchronous DRAM • One allows data to be accessed on both the positive and negative edge of the clock (double pumping). This effectively doubles the throughput. • The associated chips go by PC200 (double PC100) or PC266 (double PC 133) • But the memory modules are designated by throughput. With a 64-bit bus (8 bytes) operating at PC200 (double pumped 100MHz bus), the DDR module goes by PC1600 • 1600 = 200  8

Rambus DRAM • Rambus DRAM or RDRAM • A proprietary memory technology produced by Rambus. • It is a competitor of the DDR-SDRAM but it seems to have lost out. • Which is better depends to the situation. • For very high memory usage in P4 systems, RDRAM might be better. But DDR has been improved to DDR2.

Serial Presence Detect • With all of the different types of memory, the computer needed a way to determine the specs of the memory installed so that it could be used efficiently. • This is the job of the Serial Presence Detect (SPD). • A set of data usually stored on the memory stick itself that informs the BIOS of the module's size, data width, speed, and voltage. • Previously done by the parallel presence detect (PPD).

Memory Packaging • How memory is packaged • DIP: Dual inline pin Package (OLD) • Rectangular, has pins on both sides • 386 • SIPP: single inline chip packages (OLD) • Little circuit board with memory chip • Has pins, hard to install

Memory modules (Cont.) • SIMM single inline memory module • Like SIPP but no pins, easier to install • a 32-bit path to the memory 30-pin SIMM 72-pin SIMM

Memory modules (Cont.) • DIMM dual in-line memory module, • a 64-bit path • RIMM the Rambus version of DIMM (is not really an acronym) • SO-DIMMs (small outline) are memory modules for laptops • Initially laptop memory was proprietary and not easily changed or updated. Furthermore there was no easy access to its location.

Row versus Bank • Row: Physical unit of memory • Bank: logical unit • The bank is based on bus width. Pentiums use a 64-bit width. • Two SIMMs to make a bank • One DIMM to make a bank • One RIMM to make a bank

Installing Memory

Cache Based in part on Chapter 9 in Computer Architecture (Nicholas Carter)

Pentium 4 Blurb (L1) • Some cache terms to understand: • Data cache • Execution Trace Cache

Pentium 4 Blurb (L2) • Some cache terms to understand: • Non-Blocking • 8-way set associativity • on-die

Caching Analogy: Why grading late homework is a pain • To grade a student’s homework problem, a professor must • Solve the problem • Compare the answer with the student’s • When grading the homework of a class of students’ homework, the professor can • Solve the problem • Compare the answer with Student 1’s answer • Compare the answer with Student 2’s answer • …

Caching Analogy (Cont.) • In other words, the professor “caches” the solution so that all students after the first can be graded much more quickly than the first. • Even if the professor “stores” the solution (that is, files it away), it is not handy when it comes time to grade the late student’s homework.

Caching Analogy (Cont.) • You might think the benefits of caching are too contrived in the previous example since the professor instructed all of the students to solve the same problem and submit it at the same time. • Suppose students (on their own volition) looked at the problems at the end of the chapter being discussed. • It’s hard to imagine, I know.

Caching Analogy (Cont.) • Then a student might come to the professor’s office for help on a difficult problem. • The professor should keep the solution handy because a problem that was difficult for one student is likely to be difficult for other students who are likely to turn up soon. • This is the notion of “locality of reference” • What was needed/used recently is likely to be needed/used again soon.

Locality Of Reference • The memory assigned to an executing program will have both data and instructions. At a given time, the probability that the processor will need to access a given memory location is not equally distributed among all of the memory locations. • The program may be more likely to need the same location that it has accessed in the recent past – this is known as temporal locality. • The program may be more likely to need a location that is near the one just accessed – this is known as spatial locality.

Loops and Arrays • Consider that the tasks best suited for automation (to be done by machine including a computer) are repetitive. • Any program with loops and arrays is a good candidate to display locality of reference. • Also waiting for some user event is also very repetitive. This repetition may be hidden from the programmer working with a high-level language.

Locality of reference • Locality of reference is the principle behind caching. • Locality of reference is what allows 256-512 KB of cache to stand in for 256-512 MB of memory. • The cache is a factor of 1000 times smaller, yet the processor finds what it needs in cache ninety-some percent of the time.

Caching • The term cache can be used in different ways. • Sometimes “cache” is used to refer generally to placing something where it can be retrieved more quickly. In this sense of the term, there is an entire hierarchy of caching, SRAM is faster than DRAM is faster than the hard drive is faster than the Internet. • Sometimes “cache” is used to refer specifically to the top layer of the above hierarchy (the SRAM). • For the rest of the presentation, we will be using the latter meaning.

What are we caching? • We have to look one level down in the memory/storage hierarchy to realize what it is we are caching. • One level down is main memory. • Recall how one interacts with memory (DRAM) – one supplies an address to obtain the value located at that address.

What are we caching? • We must cache the address and the value. • Recall our analogy – if the professor writes down the answer (analogous to the value) but does not recall what problem it is the answer to (analogous to the address), it is useless. • Ultimately we want the value, but it is the (memory) address we will be given and that is what we will search for in our cache. • The student does not ask if 43 is the answer (the answer to what?); the student asks what is the answer to problem 5-15.

Some terminology • Think of cache as parallel arrays (address and values). • The array of addresses is called the tag array. • The array of values is called the data array. • Don’t confuse the terms “data array” and “data cache.” • A memory address is supplied: • If the memory address is found in the tag array, one is said to have a cache hit and the corresponding value from the data array is sent out. • If the memory address is not found, one has a cache miss, and the processor must go to memory to obtain the desired value. • The percentage of cache hits is known as the hit rate (usually looking for 90% or better).

Cache Controller • In addition to the tag and data arrays is the cache controller which runs the show. • When L2 cache was separate from the processor, the cache controller was part of the system chipset. • When L2 cache moved onto the microprocessor so too did the controller. • Now it is the L3 cache controller which is part of the system chipset. • Now even L3 is moving onto the microprocessor.

One caches addresses (tags) and values

Data Array versus Data Cache • The term data array refers to the set of values that are placed in cache. (It doesn’t matter what the values correspond to.) • The term data cache refers the caching of data as opposed to the instruction cache where instructions are cached. • In a modern adaptation of the Harvard architecture, called the Harvard cache, data and instructions are sent to separate caches. • Unlike data, an instructionis unlikely to beupdated – overwritten yes, updated no. Therefore data cache and instruction cache can have different write policies.

Capacity • The usual specification (spec) one is given for cache is called the capacity. • E.g. Norwood-core Pentium 4s have a 512 KB L2 cache. • The capacity refers only to the amount of information in the data array (values). • The spec does not include the tag array (addresses), the dirty bits, and so on – though they must of course be there.

Lines and Line Lengths • The basic unit of memory is a byte, the basic unit of cache is a line. • Be careful not to use the word “block” in place of “line.” In cache, blocking means that upon a cache miss, one must write the new values to cache before proceeding. • A line consists of many bytes (typically a power of 2, such as 32, 64 or 128). The number of bytes in a line is called the line length.

Memory  Cache Because cache lines are bigger than memory locations, one does not store full memory address in the tag array. 0 1 2 3 ….

Example • Assume a capacity of 512 KB. • Don’t think of an array with 524,288 (512 K) elements with each element a byte long as you would if it were main memory. • Instead think of an array with 16,384 (16 K) elements with each element 32 bytes long.

Line Length Benefits • The concept of cache lines has a few benefits • It directly builds in the notion of spatial locality – cache is physically designed to hold the contents of several consecutive memory locations. • Eventually we must perform a search on the tags to see if the particular memory address has been cached. The line length shortens the tag, i.e. the item one must search for. • In the example on the earlier slide one would search for FFA instead of FFA3. That is the tag is four bits smaller than the address.

Line Length Benefits • The cached value must have been read from memory. Recall that one can significantly improve the efficiency of reading memory locations if they are consecutive locations (especially if they are all in the same row). • So the paging/bursting improvements of reading memory are particularly important because of the way cache is structured.

Hardware Searching • The cache is handed a memory address, it strips off the least significant bits to form the corresponding search tag, it then must search the tag array for that value. • The most efficient search algorithm you know is useless at this level, we need to perform the search in a couple clock cycles. We need to search using hardware.

Variations • The hardware search can be executed in a number of ways and this is where the terms direct-mapped, fully associativeandset-associative come in. • The Pentium 4’s Advanced Transfer cache has 8-way set associativity. • The variations determine how many comparators (circuitry that determines whether we have a hit or miss) are necessary.

XNOR: Bit Equality Comparator

ANDed XNORs: Word Equality Comparator

Direct Mapping • Direct Mapping simplifies tag-array searching (i.e. minimizes the number of comparators) by saying that a given memory location can be cached in one and only one line of cache. • The mapping is not one-to-one. Since memory is about a thousand times bigger than cache, many memory locations share a cache line, and only one section of memory can be in there at a time.

Direct Mapping Cache A given memory location is mapped to one and only one line of cache. But each line of cache corresponds to several (sets of) memory locations. Only one of these can be cached at a given time. Memory

A Direct Mapping Scenario Memory Address The part of the address actually stored in the tag array Determines the cache address that will be used Determines position within the line of cache

A Direct Mapping Scenario (Cont.) • A memory address is handed to cache. • The middle portion is used to select the cache address. • The tag stored at that cache address and the upper portion of the original memory address are sent to a comparator. • Note there’s one comparator! • If they are equal (a cache hit), then the lower portion of the original memory address is used to select out the byte from within the line.

A Potential Problem with Direct Mapping • Recall that locality of reference (the notion behind caching) is particularly effective during repetitive tasks. • Imagine that a loop involves two memory locations that share the same cache address (perhaps it processes a large array). Then each time the processor wanted one of the locations, the other would be in the cache. Thus, there would be two cache misses for each iteration of the loop. But loops are when caching is supposed to be at its most effective. • TOO MANY CACHE MISSES!

Fully Associative Cache: The Other Extreme • In Direct Mapping, a given memory location is mapped onto one and only one cache location. • In Fully Associative Caches, a given memory location can be mapped to any cache location. • This will solve the previous problem. There’s no conflict – one caches whatever is needed for the loop. • But with fully-associative cache searching becomes more difficult, one has to examine the entire tag array whereas before with direct mapping there was only one place to look.

Associativity = Many Comparators • Looping through the tag array would be prohibitively slow. We must compare the memory address (or the appropriate portion thereof) to all of the values in the tag array simultaneously.

Finish Memory (then to Cache)

Finish Memory (then to Cache)

Presentation Transcript

Cache Memory

Cache Memory

Lecture 16: Virtual Memory

Finish: The Generation Effect; then Memory Consolidation

Cache Memory

Memory System Performance

Cache Memory

Intro to cache memory

Cache memory

Cache memory

Cache Memory

Cache Memory

Cache Memory

MEMORY SYSTEMS

Lecture 16: Virtual Memory

Finish Cache

Memory Hierarchy