1 / 18

Lecture 26: Case Studies

Lecture 26: Case Studies. Topics: processor case studies, Flash memory Final exam stats: Highest 83, median 67 70+: 16 students, 60-69: 20 students 1 st 3 problems and 7 th problem: gimmes 4 th problem (LSQ): half the students got full points

derry
Télécharger la présentation

Lecture 26: Case Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 26: Case Studies • Topics: processor case studies, Flash memory • Final exam stats: • Highest 83, median 67 • 70+: 16 students, 60-69: 20 students • 1st 3 problems and 7th problem: gimmes • 4th problem (LSQ): half the students got full points • 5th problem (cache hierarchy): 1 correct solution • 6th problem (coherence): most got more than 15 points • 8th problem (TM): very few mentioned frequent aborts, • starvation, and livelock • 9th problem (TM): no one got close to full points • 10th problem (LRU): 1 “correct” solution with the • tree structure

  2. Finals Discussion: LSQ, Caches, TM, LRU

  3. Case Study I: Sun’s Niagara • Commercial servers require high thread-level throughput • and suffer from cache misses • Sun’s Niagara focuses on: • simple cores (low power, design complexity, can accommodate more cores) • fine-grain multi-threading (to tolerate long memory latencies)

  4. Niagara Overview

  5. SPARC Pipe No branch predictor Low clock speed (1.2 GHz) One FP unit shared by all cores

  6. Case Study II: Sun’s Rock • 16 cores, each with 2 thread contexts • 10 W per core (14 mm2), each core is in-order and • 2.3 GHz (10-12 FO4)(65 nm), total of 240 W and 396 mm2 ! • New features: scout threads that prefetch while the main thread is stalled • on memory access, support for HTM (lazy versioning and eager CD) • Each cluster of 4 cores shares a 32KB I-cache, two 32KB D-caches • (one D-cache for two cores), and 2 FP units. Caches are 4-way p-LRU. • L2 cache is 4-banked 8-way p-LRU and 2 MB. • Clusters are connected with a crossbar switch • Good read: http://www.opensparc.net/pubs/preszo/08/RockISSCC08.pdf

  7. Rock Overview

  8. Case Study III: Intel Pentium 4 • Pursues high clock speed, ILP, and TLP • CISC instrs are translated into micro-ops and stored in a trace cache • to avoid translations every time • Uses register renaming with 128 physical registers • Supports up to 48 loads and 32 stores • Rename/commit width of 3; up to 6 instructions can be dispatched • to functional units every cycle • Simple instruction has to traverse a 31-stage pipeline • Combining branch predictor with local and global histories • 16KB 8-way L1; 4-cyc for ints, 12-cyc for FPs; 2MB 8-way L2, 18-cyc

  9. Clock Rate Vs. CPI: AMD Opteron Vs P4 2.8 GHz AMD Opteron vs. 3.8 GHz Intel P4: Opteron provides a speedup of 1.08

  10. Case Study IV: Intel Core Architecture • Single-thread execution is still considered important  • out-of-order execution and speculation very much alive • initial processors will have few heavy-weight cores • To reduce power consumption, the Core architecture (14 • pipeline stages) is closer to the Pentium M (12 stages) • than the P4 (30 stages) • Many transistors invested in a large branch predictor to • reduce wasted work (power) • Similarly, SMT is also not guaranteed for all incarnations • of the Core architecture (SMT makes a hotspot hotter)

  11. Case Study V: Intel Nehalem • Quad core, each with 2 SMT threads • ROB of 96 in Core 2 has been increased to 128 in Nehalem; • ROB dynamically allocated across threads • Lots of power modes; in-built power control unit • 32KB I&D L1 caches, 10-cycle 256KB private L2 cache • per core, 8MB shared L3 cache (~40 cycles) • L1 dTLB 64/32 entries (page sizes of 4KB or 4MB), • 512-entry L2 TLB (small pages only)

  12. DIMM DIMM DIMM DIMM DIMM DIMM Nehalem Memory Controller Organization MC1 MC2 MC3 MC1 MC2 MC3 Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4 Socket 1 Socket 2 QPI DIMM DIMM DIMM DIMM DIMM DIMM MC1 MC2 MC3 MC1 MC2 MC3 Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4 Socket 3 Socket 4

  13. Flash Memory • Technology cost-effective enough that flash memory can • now replace magnetic disks on laptops (also known as • solid-state disks) • Non-volatile, fast read times (15 MB/sec) (slower than • DRAM), a write requires an entire block to be erased • first (about 100K erases are possible) (block sizes can • be 16-512KB)

  14. Advanced Course • Spr’09: CS 7810: Advanced Computer Architecture • co-taught by Al Davis and me • lots of multi-core topics: cache coherence, TM, networks • memory technologies: DRAM layouts, new technologies, • memory controller design • Major course project on evaluating original ideas with • simulators (can lead to publications) • One programming assignment, take-home final

  15. Case Studies: More Processors • AMD Barcelona: 4 cores, issue width of 3, each core has private • L1 (64 KB) and L2 (512 KB), shared L3 (2 MB), 95 W (AMD also • has announcements for 3-core chips) • Sun Niagara2: 8 threads per core, up to 8 cores, 60-123 W, • 0.9-1.4 GHz, 4 MB L2 (8 banks), 8 FP units • IBM Power6: 2 cores, 4.7 GHz, each core has a private 4 MB L2

  16. Alpha Address Mapping Virtual address Unused bits Level 1 Level 2 Level 3 Page offset 21 bits 10 bits 10 bits 10 bits 13 bits Page table base register + + + PTE PTE PTE L1 page table L2 page table L3 page table 32-bit physical page number Page offset 45-bit Physical address

  17. Alpha Address Mapping • Each PTE is 8 bytes – if page size is 8KB, a page can • contain 1024 PTEs – 10 bits to index into each level • If page size doubles, we need 47 bits of virtual address • Since a PTE only stores 32 bits of physical page number, • the physical memory can be addressed by at most 32 + offset • First two levels are in physical memory; third is in virtual • Why the three-level structure? Even a flat structure would • need PTEs for the PTEs that would have to be stored in • physical memory – more levels of indirection make it • easier to dynamically allocate pages

  18. Title • Bullet

More Related