1 / 29

CS 7810 Lecture 16

CS 7810 Lecture 16. Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995. Processor Under-Utilization. Wide gap between average processor utilization and peak processor utilization

Télécharger la présentation

CS 7810 Lecture 16

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

  2. Processor Under-Utilization • Wide gap between average processor utilization • and peak processor utilization • Caused by dependences, long latency instrs, • branch mispredicts • Results in many idle cycles for many structures

  3. Superscalar Utilization Thread-1 Time V waste H waste • Suffers from horizontal waste • (can’t find enough work in a cycle) • and vertical waste (because of • dependences, there is nothing to • do for many cycles) • Utilization=19% • vertical:horizontal waste = 61:39 Resources (e.g. FUs)

  4. Chip Multiprocessors Thread-1 Thread-2 Time V waste H waste • Single-thread performance goes • down • Horizontal waste reduces Resources (e.g. FUs)

  5. Fine-Grain Multithreading Thread-1 Thread-2 Time V waste H waste • Low-cost context-switch at a fine • grain • Reduces vertical waste Resources (e.g. FUs)

  6. Simultaneous Multithreading Thread-1 Thread-2 Time V waste H waste • Reduces vertical and horizontal • waste Resources (e.g. FUs)

  7. Pipeline Structure Private/ Shared Front-end I-Cache Bpred Front End Front End Front End Front End Private Front-end Rename ROB Execution Engine Regs IQ Shared Exec Engine DCache FUs What about RAS, LSQ?

  8. Chip Multi-Processor Private Front-end I-Cache Bpred Front End Front End Front End Front End Private Front-end Rename ROB Exec Engine Exec Engine Exec Engine Exec Engine Regs IQ Private Exec Engine DCache FUs

  9. Clustered SMT Front End Front End Front End Front End Clusters

  10. Evaluated Models • Fine-Grained Multithreading • Unrestricted SMT • Restricted SMT • X-issue: A thread can only issue up to X instrs in a cycle • Limited connection: each thread is tied to a fixed FU

  11. Results • SMT nearly eliminates horizontal waste • In spite of priorities, single-thread performance degrades (cache contention) • Not much difference between private and shared caches – however, with • few threads, the private caches go under-utilized

  12. Comparison of Models • Bullet

  13. CMP vs. SMT

  14. CS 7810 Lecture 16 Exploiting Choice: Instruction Fetch and Issue on an Implementable SMT Processor D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm Proceedings of ISCA-23 June 1996

  15. New Bottlenecks • Instruction fetch has a strong influence on total • throughput • if the execution engine is executing at top speed, it is often hungry for new instrs • some threads are more likely to have ready instrs than others – selection becomes important

  16. SMT Processor Multiple RAS Multiple PCs More registers Multiple Renames and ROBs

  17. SMT Overheads • Large register file – need at least 256 physical • registers to support eight threads • increases cycle time/pipeline depth • increases mispredict penalty • increases bypass complexity • increases register lifetime • Results in 2% performance loss

  18. Base Design • Front-end is fine-grain multithreaded, rest is SMT • Bottlenecks: • Low fetch rate (4.2 instrs/cycle) • IQ is often full, but only half the issue bandwidth is being used

  19. Fetch Efficiency • Base case uses RoundRobin.1.8 • RR.2.4: fetches four instrs each from two threads • requires a banked organization • requires additional multiplexing logic • Increases the chances of finding eight instrs without • a taken branch • Yields instrs in spite of an I-cache miss • RR.2.8: extends RR.2.4 by reading out larger line

  20. Results

  21. Fetch Effectiveness • Are we picking the best instructions? • IQ-clog: instrs that sit in the issue queue for ages; • does it make sense to fetch their dependents? • Wrong-path instructions waste issue slots • Ideally, we want useful instructions that have short • issue queue lifetimes

  22. Fetch Effectiveness • Useful instructions: throttle fetch if branch mpred • probability is high  confidence, num-branches • (BRCOUNT), in-flight window size • Short lifetimes: throttle fetch if you encounter a • cache miss (MISSCOUNT), give priority to threads • that have young instrs (IQPOSN)

  23. ICOUNT • ICOUNT: priority is based on number of unissued • instrs  everyone gets a share of the issueq • Long-latency instructions will not dominate the IQ • Threads that have high issue rate will also have • high fetch rate • In-flight windows are short and wrong-path instrs • are minimized • Increased fairness  more ready instrs per cycle

  24. Results Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)

  25. Reducing IQ-clog • IQBUF: a buffer before the issue queue • ITAG: pre-examine the tags to detect I-cache • misses and not waste fetch bandwidth • OPT_last and SPEC_last: lower issue priority • for speculative instrs • These techniques entail overheads and result in • minor improvements

  26. Bottleneck Analysis • The following are not bottlenecks: issue bandwidth, • issue queue size, memory thruput • Doubling fetch bandwidth improves thruput by • 8% -- there is still room for improvement • SMT is more tolerant of branch mpreds: perfect • prediction improves 1-thread by 25% and 8-thread • by 9% -- no speculation has a similar effect • Register file can be a huge bottleneck

  27. IPC vs. Threads vs. Registers

  28. Power and Energy • Energy is heavily influenced by “work done” and • by execution time  compared to a single-thread • machine, SMT does not reduce “work done”, but • reduces execution time  reduced energy • Same work, less time  higher power!

  29. Title • Bullet

More Related