1 / 16

Lecture 17: Case Studies

Lecture 17: Case Studies. Topics: case studies for virtual memory and cache hierarchies (Sections 5.10-5.17). Alpha Paged Virtual Memory. Each process has the following virtual memory space:. seg0. kseg. seg1. Reserved for User text, data. Reserved for kernel. Reserved for

galena
Télécharger la présentation

Lecture 17: Case Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 17: Case Studies • Topics: case studies for virtual memory and cache • hierarchies (Sections 5.10-5.17)

  2. Alpha Paged Virtual Memory • Each process has the following virtual memory space: seg0 kseg seg1 Reserved for User text, data Reserved for kernel Reserved for page tables • The Alpha uses a separate instruction and data TLB • The TLB entries can be used to map pages of different • sizes

  3. Example Look-Up PTEs T L B Physical Memory Virtual Memory Virtual page abc  Physical page xyz If each PTE is 8 bytes, location of PTE for abc is at virtual address abc/8 = lmn Virtual addr lmn  physical addr pqr

  4. Alpha Address Mapping Virtual address Unused bits Level 1 Level 2 Level 3 Page offset 21 bits 10 bits 10 bits 10 bits 13 bits Page table base register + + + PTE PTE PTE L1 page table L2 page table L3 page table 32-bit physical page number Page offset 45-bit Physical address

  5. Alpha Address Mapping • Each PTE is 8 bytes – if page size is 8KB, a page can • contain 1024 PTEs – 10 bits to index into each level • If page size doubles, we need 47 bits of virtual address • Since a PTE only stores 32 bits of physical page number, • the physical memory can be addressed by at most 32 + offset • First two levels are in physical memory; third is in virtual • Why the three-level structure? Even a flat structure would • need PTEs for the PTEs that would have to be stored in • physical memory – more levels of indirection make it • easier to dynamically allocate pages

  6. Bandwidth • Out of order superscalar processors can issue 4+ instrs • per cycle  2+ loads/stores per cycle  caches must • provide low latency and high bandwidth • Effective caches  memory bandwidth requirements are • usually low; unfortunately, memory bandwidth is easier • to improve • RDRAM improved memory bandwidth by a factor of • eight, but improved performance by less than 2% for • most applications and by 15% for some graphics apps • Bandwidth can help if you prefetch aggressively

  7. Cache Bandwidth Interleaved cache cell L1 D 1 port L1 D 1 port Multi-ported cell Odd words Even words • Similar area to a 1-ported • cache • More complexity in routing • addresses/data • Slight penalty when both • words conflict for the same • bank L1 D 2 ports L1 D 1 port 2-cycle access time 3-cycle access time

  8. Prefetching • High memory latency and cache misses are unavoidable • Prefetching is one of the most effective ways to hide • memory latency • Some programs are hard to prefetch for – unpredictable • branches, irregular traversal of arrays, hash tables, • pointer-based data structures • Aggressive prefetching can pollute the cache and can • compete for memory bandwidth • Prefetch design for: (i) array accesses, (ii) pointers

  9. Stride Prefetching • Constant strides are relatively easy to detect • Keep track of last address fetched by a PC – compare • with current address to confirm constant stride • Every access triggers a fetch of the next word – in fact, • the prefetcher tries to stay ahead enough to entirely • hide memory latency • Prefetched words are stored in a buffer to reduce • cache pollution

  10. Cache Power Consumption • Instruction caches can save on decode time and • power by storing instructions in decoded form (trace caches) • Memory accesses are power hungry – caches can also • help reduce power consumption

  11. Alpha 21264 Instruction Hierarchy • When powered on, initialization code is read from an • external PROM and executed in privileged architecture • library (PAL) mode with no virtual memory • The I-cache is virtually indexed and virtually tagged – this • avoids I-TLB look-up for every access – correctness is not • compromised because instructions are not modified • Each I-cache block saves 11 bits to predict the index of • the next set that is going to be accessed and 1 bit to • predict the way – line and way prediction • An I-cache miss looks up a prefetch buffer and a 128-entry • fully-associative TLB before accessing L2

  12. 21264 Cache Hierarchy • The L2 cache is off-chip and direct-mapped (the 21364 • moves L2 on to chip) • Every L2 fetch also fetches the next four physical blocks • (without exceeding the page boundary) • L2 is write-back • The processor has a 128-bit data path to L2 and 64-bit • data path to memory

  13. 21264 Data Cache • The L1 data cache is write-back, virtually indexed, physically • tagged, and backed up by a victim buffer • On a miss, the processor checks other L1 cache locations • for a synonym in parallel with L2 look-up (recall two • alternative techniques to deal with the synonym problem) • No prefetching for data cache misses

  14. 21264 Performance • 21164: 8KB L1s and 96KB L2 ; 21264: 64KB L1 and off-chip 1MB L2 • The 21264 is out of order and can tolerate L1 misses  speedup is a • function of 21164 L2 misses that are captured by 21264’s L2 • Commercial database/server applications stress the memory system much • more than SPEC/desktop applications

  15. Sun Fire 6800 Server • Intended for commercial applications  aggressive memory • hierarchy design • 8 MB off-chip L2 • wide buses going to L2 and memory for bandwidth • on-chip memory controller to reduce latency • on-chip L2 tags to save latency on a miss • ECC and parity bits for all external traffic to provide high reliability • Large store buffers (write caches) between L1 and L2 • Data prefetch engine that detects strides • Instr prefetch that stays one block ahead of decode • Two parallel TLBs: 128-entry 4-way and 16-entry fully-associative

  16. Title • Bullet

More Related