1 / 12

Montecito and POWER4

Montecito and POWER4. Chris Thomas Chris Chaney 9/7/2005. Outline. Background Design Summary Montecito POWER4 Comparisons Memory hierarchy Threading. Background. Transistors are providing decreasing returns for exploiting ILP Lots of TLP available in commercial workloads

qiana
Télécharger la présentation

Montecito and POWER4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Montecito and POWER4 Chris Thomas Chris Chaney 9/7/2005

  2. Outline • Background • Design Summary • Montecito • POWER4 • Comparisons • Memory hierarchy • Threading

  3. Background • Transistors are providing decreasing returns for exploiting ILP • Lots of TLP available in commercial workloads • Power envelope is a major design constraint • Memory latency is an increasing factor in performance

  4. Montecito • 1.72 Billion transistors • 100W • ~27 MB of total cache (L1-L3) • 1.8GHz • Dual In-order Core, each core dual-threaded • 6 issue

  5. Montecito • Cache (per core) • 16KB L1 (I & D), write through L1D • 1MB L2I (parity), 256KB L2D • 12 MB L3 (unified, Pellston) • L3 is asynchronous • Other arrays are parity/ECC protected • Off chip bandwidth 10.66 GB/s • almost double from previous Itanium2

  6. Montecito • TLP • TMT in the core • SMT in the memory system

  7. Montecito • Power • Would be 300W w/o power management • Foxton • Dynamically scales voltage and frequency • Removed clock from L3 accesses (saves 10W)

  8. POWER4 • 174 Million Transistors • Up to 128 MB total cache per module • 1.1-1.3 Ghz, deeply pipelined • 4 single thread dual core chips per module • 8 issue (peak) • Support for glueless SMPs up to 4 chips

  9. POWER4 • Memory Hierarchy • each processor has dedicated 64KB L1I, 32KB L1D • Write through L1's, parity protected • each chip shares 1.5 MB L2, ECC • split into 3 banks, w/ separate cache controllers • L3 off chip, up to 32 MB per chip (eDRAM)

  10. POWER4 • Memory Hierarchy Continued • Coherency takes place at L2, enhanced MESI protocol • IO handled in separate chip, connected via GX bus • bus frequencies scale w/ core frequency

  11. POWER4 • Instruction Grouping • Helps to simplify tracking for precise interrupts • Groups of up to five instructions • Groups execute in order • Many cases cause instructions to issue one by one

  12. Conclusions • No benchmarks were presented • High ILP processors now also exploiting TLP • No large instruction windows

More Related