CMPE 421 Parallel Computer Architecture

CMPE 421Parallel Computer Architecture PART 2 CACHING

Caching Principles • The idea is to use small amount of fast memory near the processor (in a cache) • The cache hold frequently needed memory locations • When an instruction references a memory locations, we want that value to be in the cache • For time being, we will focus on a 2 level hierarchy • Cache (small, fast memory directly connected processor (upper) • Main memory (large, slow memory at level 2 in the hierarchy

Caching Principles • -Transfer of data is done between adjacent levels in the hierarchy only • - All access by the processor is to the topmost level • - Performance depends on hit rates Block of data (unit of data copy)

Caching Examples • Principle: Results of operations that are expensive should be kept around for reuse • Examples: • CPU caching • Forwarding table caching • File cashing • Web cashing • Query cashing • Computation cashing

Cache Levels • Register, a cache on variables • First level cache, a cache on second level cache • Second level cache, a cache on memory • Memory cache, cache on disk (virtual mem) • TLB a cache on page table • Branch Prediction a cache on prediction information?

Terminology • Block: The minimum unit of information transferred between the cache and main memory. Typically measured in bytes or words • Block addressing varies by technology at each level • Blocks are moved one level at a time • HIT: Data appears in block in upper level when a program needs a particular data object d (blocks) from lower level, it first looks for d in one of the blocks currently stored at upper level. If d happens to be cached at upper level, then we have what is called a cache hit. • For example, a program with good temporal locality might read a data object from block d, resulting in cache hit from upper level • Remote HTML files stored on WEB servers Finding the information in one of the books on your desk

Terminology • MISS: Data was not in upper level and had to be fetched from a lower level when there is a miss, the cache at upper level fetches the block containing possibly overwriting an existing block if the upper level is already full. • HIT RATE: The ratio of hits to memory access found in the upper level • Used as a measure of the performance of memory hierarchy

MISS EXAMPLE Miss Example: Reading the data object from block 12 in the upper level cache would result in a cache miss because block 12 is not currently stored in the upper level cache once it has been copied from lower level to upper level, block 12 will remain there in expectation of later access

Terminology • MISS RATE: The ratio of miss to memory access found in the upper level • Miss Rate= 1 – Hit rate • HIT TIME: Time to access to upper level (cache) hit time=tc= Access Time + Time to determine hit/miss (cache to processor) (Time to find out if it is in the cache) Ex: The time needed to look through the books on the desk

Terminology • MISS PENALTY: The time to replace a block in the cache with a block from main memory and to deliver the element to the processor Miss penalty= tc+tm= Lower level access time + Replacement time + Time to deliver to upper level EX: The time to get another book from the shelves and place it on desk • Miss penalty is usually much larger than the hit time • Because the upper level is smaller and built using faster memory parts • Time to examine the books on the desk is much smaller than the time to get up and get a new book from the shelves EX: HIT_RATIO=0.9 MISS_RATIO=1.0- 0.9 =0.1 Ideally hit_ratio 1.0, miss_ratio=0.0 In practice, hit_ratio<1.0 0.95 or better

Handling a Cache Miss • A cache hit if it happens in 1 cycle has no affect on our pipeline, but a cache miss does • The action required depends on whether we have an instruction miss or a data miss • For an instruction miss: • Send the original PC value to the memory • Instruct main memory to perform a read and wait for the memory to complete access • Write the result into the appropriate cache entry • Restart the instruction • For a data miss: • Stall the pipeline • Instruct main memory to perform a read and wait for the memory to complete access • Return the result from the memory unit and allow the pipeline to continue

Exploiting Locality • Need to update the contents of the cache to useful stuff – Leverage locality • Spatial locality – Rather than fetching just the word that missed – Fetch a block of data around the word that missed • If you need these words (and you often do) they will now hit • This is also good since you can build memory systems that deliver large blocks of data once they access it (disk/DRAM) • Temporal locality • Keep more recently accessed data items closer to the processor, • so when we need space in the cache, evict the old ones

Access Times • Average Access time Access time = [(hit time)  (hit rate)] + (miss penalty)  (miss rate) • The hope is that the hit time will be low and the hit rate high since the miss penalty is so much larger than the hit time • Average Memory Access Time (AMAT) • Formula can be applied to any level of the hierarchy • Can be generalized for the entire hierarchy

Simple Cache Model • Assume that the processor accesses memory one word at a time. • A block consists of one word. • When a word is referenced and is not in the cache, it is put in the cache (copied from main memory).

Cache Usage • At some point in time the cache holds • memory items X1,X2,…Xn-1 • The processor next accesses memory item Xn which is not in the cache. • How do we know if an item is in the cache? • If it is in the cache, how do we know where it is?

Cache Arrangement • How should the data in the cache be arranged? • Several different approaches • Direct Mapped – Memory addresses map to particular location in the cache • Fully Associative – Data can be placed anywhere in the cache • N-way Set Associative – Data can be placed in a limited number of places in the cache depending upon the memory address

Direct Mapped Cache Organization • Each memory location is mapped to a single location in the cache. – there in only one place it can be! • Remember that the cache is smaller than memory, so many memory locations will be mapped to the same location in the cache.

Mapping Function • The simplest mapping is based on the LS bits of the address. • For example, all memory locations whose address ends in 001 will be mapped to the same location in the cache. • The requires a cache size of 2^n locations (a power of 2).

A Direct Mapped Cache • Memory addresses are mapped to cache index • -The index is given by the (block address) modulo (num blocks in cache) • If cache size is power of 2, mode operators is simply throwing away some high order bits from address • EX: Direct Mapped cache 8 words, MEM size is 32 words • use LOG2 8 = 3 to have cache address = XXX • As the cache size is power of 2 throw away higher bits • EX: 00 001 => the cache addr = 001 • 11 101 => the cache addr = 101

Problem With Direct mapped Cache • We still need a way to find out which of the many possible memory elements is currently in a cache slot. – slot: a location in the cache that can hold a block. • We need to store the address of the item currently using cache slot 001. • We therefore add a tag to each cache entry that identifies which address it currently contains by storing the MSBs that uniquely identify that memory address (LSBs are referred to a particular cache entry) • The tag associated with a cache slot tells who is currently using the slot. • We don’t need to store the entire memory location address, just those bits that are not used to determine the slot number (the mapping).

Solution A Field in a table used for a memory hierarchy that contains the address information required to identify whether the associated block in the hierarchy corresponds to required word

Initialization Problem • Initially the cache is empty. – all the bits in the cache (including the tags) will have random values. • After some number of accesses, some of the tags are real and some are still just random junk. • How do we know which cache slots are junk and which really mean something?

Answer: Introduce Valid BITS • Include one more bit with each cache slot that indicates whether the tag is valid or not. • Provide hardware to initialize these bits to 0 (one bit per cache slot). • When checking a cache slot for a specific memory location, ignore the tag if the valid bit is 0. • Change a slot’s valid bit to a 1 when putting something in the slot (from main memory).

Direct mapped Cache with valid Bit

CMPE 421 Parallel Computer Architecture