Lecture 17: Case Studies

Lecture 17: Case Studies • Topics: case studies for virtual memory and cache • hierarchies (Sections 5.10-5.17)

Alpha Paged Virtual Memory • Each process has the following virtual memory space: seg0 kseg seg1 Reserved for User text, data Reserved for kernel Reserved for page tables • The Alpha uses a separate instruction and data TLB • The TLB entries can be used to map pages of different • sizes

Example Look-Up PTEs T L B Physical Memory Virtual Memory Virtual page abc  Physical page xyz If each PTE is 8 bytes, location of PTE for abc is at virtual address abc/8 = lmn Virtual addr lmn  physical addr pqr

Alpha Address Mapping Virtual address Unused bits Level 1 Level 2 Level 3 Page offset 21 bits 10 bits 10 bits 10 bits 13 bits Page table base register + + + PTE PTE PTE L1 page table L2 page table L3 page table 32-bit physical page number Page offset 45-bit Physical address

Alpha Address Mapping • Each PTE is 8 bytes – if page size is 8KB, a page can • contain 1024 PTEs – 10 bits to index into each level • If page size doubles, we need 47 bits of virtual address • Since a PTE only stores 32 bits of physical page number, • the physical memory can be addressed by at most 32 + offset • First two levels are in physical memory; third is in virtual • Why the three-level structure? Even a flat structure would • need PTEs for the PTEs that would have to be stored in • physical memory – more levels of indirection make it • easier to dynamically allocate pages

Bandwidth • Out of order superscalar processors can issue 4+ instrs • per cycle  2+ loads/stores per cycle  caches must • provide low latency and high bandwidth • Effective caches  memory bandwidth requirements are • usually low; unfortunately, memory bandwidth is easier • to improve • RDRAM improved memory bandwidth by a factor of • eight, but improved performance by less than 2% for • most applications and by 15% for some graphics apps • Bandwidth can help if you prefetch aggressively

Cache Bandwidth Interleaved cache cell L1 D 1 port L1 D 1 port Multi-ported cell Odd words Even words • Similar area to a 1-ported • cache • More complexity in routing • addresses/data • Slight penalty when both • words conflict for the same • bank L1 D 2 ports L1 D 1 port 2-cycle access time 3-cycle access time

Prefetching • High memory latency and cache misses are unavoidable • Prefetching is one of the most effective ways to hide • memory latency • Some programs are hard to prefetch for – unpredictable • branches, irregular traversal of arrays, hash tables, • pointer-based data structures • Aggressive prefetching can pollute the cache and can • compete for memory bandwidth • Prefetch design for: (i) array accesses, (ii) pointers

Stride Prefetching • Constant strides are relatively easy to detect • Keep track of last address fetched by a PC – compare • with current address to confirm constant stride • Every access triggers a fetch of the next word – in fact, • the prefetcher tries to stay ahead enough to entirely • hide memory latency • Prefetched words are stored in a buffer to reduce • cache pollution

Cache Power Consumption • Instruction caches can save on decode time and • power by storing instructions in decoded form (trace caches) • Memory accesses are power hungry – caches can also • help reduce power consumption

Alpha 21264 Instruction Hierarchy • When powered on, initialization code is read from an • external PROM and executed in privileged architecture • library (PAL) mode with no virtual memory • The I-cache is virtually indexed and virtually tagged – this • avoids I-TLB look-up for every access – correctness is not • compromised because instructions are not modified • Each I-cache block saves 11 bits to predict the index of • the next set that is going to be accessed and 1 bit to • predict the way – line and way prediction • An I-cache miss looks up a prefetch buffer and a 128-entry • fully-associative TLB before accessing L2

21264 Cache Hierarchy • The L2 cache is off-chip and direct-mapped (the 21364 • moves L2 on to chip) • Every L2 fetch also fetches the next four physical blocks • (without exceeding the page boundary) • L2 is write-back • The processor has a 128-bit data path to L2 and 64-bit • data path to memory

21264 Data Cache • The L1 data cache is write-back, virtually indexed, physically • tagged, and backed up by a victim buffer • On a miss, the processor checks other L1 cache locations • for a synonym in parallel with L2 look-up (recall two • alternative techniques to deal with the synonym problem) • No prefetching for data cache misses

21264 Performance • 21164: 8KB L1s and 96KB L2 ; 21264: 64KB L1 and off-chip 1MB L2 • The 21264 is out of order and can tolerate L1 misses  speedup is a • function of 21164 L2 misses that are captured by 21264’s L2 • Commercial database/server applications stress the memory system much • more than SPEC/desktop applications

Sun Fire 6800 Server • Intended for commercial applications  aggressive memory • hierarchy design • 8 MB off-chip L2 • wide buses going to L2 and memory for bandwidth • on-chip memory controller to reduce latency • on-chip L2 tags to save latency on a miss • ECC and parity bits for all external traffic to provide high reliability • Large store buffers (write caches) between L1 and L2 • Data prefetch engine that detects strides • Instr prefetch that stays one block ahead of decode • Two parallel TLBs: 128-entry 4-way and 16-entry fully-associative

Title • Bullet

Lecture 17: Case Studies