Memory Hierarchy Overview: Architecture & Optimization Strategies

University of Palestine Faculty of Engineering and Urban planning Software Engineering Department Computer System Architecture ESGD2204 Chapter 7 Lecture 13 Eng. Mohammed Timraz Electronics & Communication Engineer Saturday, 16th April 2010

Chapter 7 Memory Level

4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level? Memory block 12 placed in an 8-block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo (Number of Sets) (Allowed cache blocks for block 12 shown in blue.) 01234567 01234567 01234567 1111111111222222222233 01234567890123456789012345678901 2-Way Set Assoc (12 mod 4) = 0 Direct Mapped (12 mod 8) = 4 Full Mapped Cache Memory

Q2: How find block if in upper level = cache?Bits = 18b: tag 8b index: 256 entries/cache (4b: 16 wds/block 2b: 4 Byte/wd) or ( 6b: 64 Bytes/block 6 offset bits) Bits: (One-way) Direct Mapped Data Capacity: 16KB Cache = 256 x 512 / 8 Index => cache set Location of all possible blocks Tag for each block: No need to check index, offset bits Increasing associativity: Shrinks index & expands tag size Bit Fields in Memory Address Used to Access “Cache” Word ______________________________________________________________ Virtual Memory “Cache Block” Block (a.k.a. Page) Address Offset Bits In Page Tag Index • 18

Q3: Which block to replace after a miss? (After start up, cache is nearly always full) Easy if Direct Mapped (only 1 block “1 way” per index) If Set Associative or Fully Associative, must choose: Random (“Ran”) Easy to implement, but not best, if only 2-way: 1bit/way LRU (Least Recently Used) LRU is best, but hard to implement if > 8-way Also other LRU approximations better than Random Miss Rates for 3 Cache Sizes & Associativities Associativity 2-way 4-way 8-way DataSize LRU Ran LRU Ran LRU Ran 16 KB 5.2%5.7% 4.7%5.3%4.4%5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% Random picks => same low miss rate as LRU for large caches

Q4: Write policy: What happens on a write? Additional option -- let writes to an un-cached address allocate a new cache line (“write-allocate”), else just Write-Through.

Write Buffers for Write-Through Caches Lower Level Memory Cache Processor Write Buffer Write buffer holds (addresses&) data awaiting write-through to lower levels Q. Why a write buffer ? A. So CPU not stall for writes Q. Why a buffer, why not just one register ? A. Bursts of writes are common. Q. Are Read After Write (RAW) hazards an issue for write buffer? A. Yes! Drain buffer before next read or check buffer addresses before read-miss.

5 Basic Cache Optimizations Reducing Miss Rate Larger Block size (reduce Compulsory, “cold”, misses) Larger Cache size (reduce Capacity misses) Higher Associativity (reduce Conflict misses) (… and multiprocessors have cache Coherence misses) (4 Cs) Reducing Miss Penalty Multilevel Caches {total miss rate = π(local miss ratek), where π means product of all itemsk, for k = 1 to max. } Reducing Hit Time (minimal cache latency) Giving Reads Priority over Writes, since CPU waiting Read completes before earlier writes in write buffer

Definition: Performance • performance(x) = 1 execution_time(x) Performance(X) Execution_time(Y) N = = Performance(Y) Execution_time(X) • Performance is in units of things-done per second • bigger is better • If we are primarily concerned with response time • " X is N times faster than Y" means The Speedup = N The BIG Time “mushroom”: the little time

Performance: What to measure Usually rely on benchmarks vs. real workloads To increase predictability, collections of benchmark applications, called benchmark suites, are popular SPECCPU: popular desktop benchmark suite CPU only, split between integer and floating point programs SPECint2000 had 12 integer, SPECfp2000 had 14 integer codes SPEC CPU2006 has 12 integer benchmarks (CINT2006) and 17 floating-point benchmarks (CFP2006) SPECSFS (NFS file server) and SPECWeb (WebServer) have been added as server benchmarks

Performance: What to measure • Transaction Processing Council measures server performance and cost-performance for databases • TPC-C Complex query for Online Transaction Processing • TPC-H models ad hoc decision support • TPC-W a transactional web benchmark • TPC-App application server and web services benchmark

Define and quantify dependability How to decide when a system is operating properly? Infrastructure providers now offer Service Level Agreements (SLA) which are guarantees how dependable their networking or power service will be Systems alternate between two states of service: Service accomplishment (working), where the service is delivered as specified in SLA 2. Service interruption (not working), where the delivered service is different from the SLA Failure = transition from state 1 (working) to state 2 Restoration = transition from state 2 (not) to state 1 Fault Error Failure

Define and quantity dependability Module reliability = measure of continuous service accomplishment (or time to failure). Mean Time To Failure (MTTF) measures Reliability Failures In Time (FIT) = 1/MTTF, the failure rate Usually reported as failures per billion hours of operation Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate between the two states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / ( MTTF + MTTR)

Example calculating reliability If modules have exponentially distributed lifetimes (the age of a module does not affect its probability of failure), the overall failure rate (FIT) is the sum of failure rates of the modules Calculate FIT (rate) and MTTF (1/rate) for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF): { x 109 }

The Cache Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Associativity Block Size Bad Factor A Factor B Good Less More

The Cache Design The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three Major Uniprocessor Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Capacity Misses: increase cache size Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Write Policy: Write Through vs. Write Back Today CPU time is a function of (ops, cache misses) vs. just f(ops): Increasing performance affects Compilers, Data structures, and Algorithms

Memory Hierarchy Overview: Architecture & Optimization Strategies