1 / 20

Class Representation For Advanced VLSI Course

Class Representation For Advanced VLSI Course. Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation of the POWER5 TM Microprocessor J. Clabes1, J. Friedrich1, M. Sweet1, J DiLullo1, S. Chu1, D. Plass2, J.

brandy
Télécharger la présentation

Class Representation For Advanced VLSI Course

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation of the POWER5TM Microprocessor J. Clabes1, J. Friedrich1, M. Sweet1, J DiLullo1, S. Chu1, D. Plass2, J. Dawson2, P. Muench2, L. Powell1, M. Floyd1, B. Sinharoy2, M. Lee1, M. Goulet1, J. Wagoner1, N. Schwarz1, S. Runyon1, G. Gorman1, P. Restle3, Kalla1, J. McGill1, S. Dodson1 1IBM System Group, Austin, TX 2IBM System Group, Poughkeepsie, NY 3IBM Research, Yorktown Heights, NY IEEE International Solid-State Circuits Conference 2004 Winter 2004

  2. Outline • Motivation • Background • Threading Fundamentals • Enhancement SMT Implementation in POWER5 • Memory Subsystem Enhancements • Power Efficiency • Additional SMT Considerations • Summary

  3. Microprocessor Design Optimization Focus Areas • Motivation … • Memory latency • Increased processor speeds make memory appear further away • Longer stalls possible • Branch processing • Mispredict more costly as pipeline depth increases resulting in stalls and wasted power • Predication drives increased power and larger chip area • Execution Unit Utilization • Currently 20-25% execution unit utilization common • Simultaneous multi-threading (SMT) and POWER architecture address these areas

  4. POWER4 --- Shipped in Systems December 2001 • Background … • Technology: 180nm lithography, Cu, SOI • POWER4+ shipping in 130nm today • 267mm2 185M transistors • Dual processor core • 8-way superscalar • Out of Order execution • Load / Store units • 2 Fixed Point units • 2 Floating Point units • Logical operations on Condition Register • Branch Execution unit • > 200 instructions in flight • Hardware instruction and data prefetch

  5. POWER5 --- The Next Step • Background … • Technology: 130nm lithography, Cu, SOI • 389mm2 276M Transistors • Dual processor core • 8-way superscalar • Simultaneous multithreaded (SMT) core • Up to 2 virtual processors per real processor • Natural extension to POWER4 design

  6. System-level view of POWER5 • Background …

  7. Multi-threading Evolution • Threading …

  8. Changes Going From ST to SMT Core • Enhancement … • SMT easily added to Superscalar Micro-architecture • Second Program Counter (PC) added to share I-fetch bandwidth • GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread) • Completion logic replicated to track two threads • Thread bit added to most address/tag buses

  9. POWER5 Resources Size Enhancements • Enhancement … • Enhanced caches and translation resources • I-cache: 64 KB, 2-way set associative, LRU • D-cache: 32 KB, 4-way set associative, LRU • First level Data Translation: 128 entries, fully associative, LRU • L2 Cache: 1.92 MB, 10-way set associative, LRU • Larger resource pools • Rename registers: GPRs, FPRs increased to 120 each • L2 cache coherency engines: increased by 100% • Enhanced data stream prefetching • Memory controller moved on chip

  10. Thread Priority • Enhancement … • Instances when unbalanced execution desirable • No work for opposite thread • Thread waiting on lock • Software determined non uniform balance • Power management • … • Solution: Control instruction decode rate • Software/hardware controls 8 priority levels for each thread

  11. Modifications to POWER4 System Structure • Memory …

  12. Power Efficient Design Implementation • Power … • DC power mitigation • 􀀗Leverage triple Vt technology • Decrease low Vt usage by 90% • Increase high Vt usage by 30% • 􀀗Leverage triple Tox technology • Thick Tox usage for decoupling capacitors • 􀂃 AC power mitigation • 􀀗Minimal usage of dynamic circuits • 􀀗Reduce loading on clock mesh • 􀀗Incorporation of dynamic clock gating

  13. Thermal control logic and sample thermal response. • Power …

  14. 16-way Building Block • Additional …

  15. POWER5 Multi-Chip Module • Additional … • 95mm % 95mm • Four POWER5 chips • Four cache chips • 4,491 signal I/Os • 89 layers of metal

  16. 64-way SMP Interconnection • Additional … • Interconnection exploits enhanced distributed switch • All chip interconnections operate at half processor frequency and scale with processor frequency

  17. POWER4 and POWER5 Storage Hierarchy • Additional …

  18. POWER Server Roadmap • Additional …

  19. Summary • Summary … • First dual core SMT microprocessor • Extended SMP to 64-way • Operating in laboratory • Power dynamically managed with no performance penalty • Implementation permits future technology scalability from • circuit and power perspective • Innovative approach leveraging technology with system • focus for high performance in a power efficient design

  20. Other References • [1] R. Kalla , B. Sinharoy , J. Tendler , “IBM POWER5 CHIP : A DUAL-CORE MULTITHREADED PROCESSOR” , IEE Computer Society , MARCH-APRIL 2004 • [2] R. Kalla , IBM System Group , “IBM’s POWER5 Design and Methodology” , IBM Corporation 2003

More Related