1 / 30

Hyper-Threading , Chip multiprocessors and both

Hyper-Threading , Chip multiprocessors and both. Zoran Jovanovic. To Be Tackled in Multithreading. Review of Threading Algorithms Hyper-Threading Concepts Hyper-Threading Architecture Advantages/Disadvantages. Threading Algorithms. Time-slicing

diamond
Télécharger la présentation

Hyper-Threading , Chip multiprocessors and both

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic

  2. To Be Tackled in Multithreading • Review of Threading Algorithms • Hyper-Threading Concepts • Hyper-Threading Architecture • Advantages/Disadvantages

  3. Threading Algorithms • Time-slicing • A processor switches between threads in fixed time intervals. • High expenses, especially if one of the processes is in the wait state. Fine grain • Switch-on-event • Task switching in case of long pauses • Waiting for data coming from a relatively slow source, CPU resources are given to other processes. Coarse grain

  4. Threading Algorithms (cont.) • Multiprocessing • Distribute the load over many processors • Adds extra cost • Simultaneous multi-threading • Multiple threads execute on a single processor without switching. • Basis of Intel’s Hyper-Threading technology.

  5. Hyper-Threading Concept • At each point of time only a part of processor resources is used for execution of the program code of a thread. • Unused resources can also be loaded, for example, with parallel execution of another thread/application. • Extremely useful in desktop and server applications where many threads are used.

  6. Quick Recall: Many Resources IDLE! For an 8-way superscalar. From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. Slide source: John Kubiatowicz

  7. (a) (b) (c) (d) • A superscalar processor with no multithreading • A superscalar processor with coarse-grain multithreading • A superscalar processor with fine-grain multithreading • A superscalar processor with simultaneous multithreading (SMT)

  8. Simultaneous Multithreading (SMT) Example: new Pentium with “Hyperthreading” Key Idea: Exploit ILP across multiple threads! • i.e., convert thread-level parallelism into more ILP • exploit following features of modern processors: • multiple functional units • modern processors typically have more functional units available than a single thread can utilize • register renaming and dynamic scheduling • multiple instructions from independent threads can co-exist and co-execute!

  9. Hyper-Threading Architecture • First used in Intel Xeon MP processor • Makes a single physical processor appear as multiple logical processors. • Each logical processor has a copy of architecture state. • Logical processors share a single set of physical execution resources

  10. Hyper-Threading Architecture • Operating systems and user programs can schedule processes or threads to logical processors as if they were in a multiprocessing system with physical processors. • From an architecture perspective we have to worry about the logical processors using shared resources. • Caches, execution units, branch predictors, control logic, and buses.

  11. Power 5 dataflow ... • Why only two threads? • With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck • Cost: • The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

  12. Advantages • Extra architecture only adds about 5% to the total die area. • No performance loss if only one thread is active. Increased performance with multiple threads • Better resource utilization.

  13. Disadvantages • To take advantage of hyper-threading performance, serial execution can not be used. • Threads are non-deterministic and involve extra design • Threads have increased overhead • Shared resource conflicts

  14. Multicore Multiprocessors on a single chip

  15. Basic Shared Memory Architecture • Processors all connected to a large shared memory • Where are caches? P2 P1 Pn interconnect memory • Now take a closer look at structure, costs, limits, programming CS267 Lecture 6

  16. P P n 1 $ $ Bus I/O devices Mem What About Caching??? • Want High performance for shared memory: Use Caches! • Each processor has its own cache (or multiple caches) • Place data from memory into cache • Writeback cache: don’t send all writes over bus to memory • Caches Reduce average latency • Automatic replication closer to processor • More important to multiprocessor than uniprocessor: latencies longer • Normal uniprocessor mechanisms to access data • Loads and Stores form very low-overhead communication primitive • Problem: Cache Coherence! Slide source: John Kubiatowicz

  17. u = ? u = ? u = 7 5 4 3 1 2 u u u :5 :5 :5 Example Cache Coherence Problem P P P 2 1 3 $ $ $ • Things to note: • Processors could see different values for u after event 3 • With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when • How to fix with a bus: Coherence Protocol • Use bus to broadcast writes or invalidations • Simple protocols rely on presence of broadcast medium • Bus not scalable beyond about 64 processors (max) • Capacity, bandwidth limitations I/O devices Memory Slide source: John Kubiatowicz

  18. Limits of Bus-Based Shared Memory I/O MEM ° ° ° MEM Assume: 1 GHz processor w/o cache => 4 GB/s inst BW per processor (32-bit) => 1.2 GB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate => 80 MB/s inst BW per processor => 60 MB/s data BW per processor • 140 MB/s combined BW Assuming 1 GB/s bus bandwidth \ 8 processors will saturate bus 140 MB/s ° ° ° cache cache 5.2 GB/s PROC PROC CS267 Lecture 6

  19. Cache Organizations for Multi-cores • L1 caches are always private to a core • L2 caches can be private or shared • Advantages of a shared L2 cache: • efficient dynamic allocation of space to each core • data shared by multiple cores is not replicated • every block has a fixed “home” – hence, easy to find the latest copy • Advantages of a private L2 cache: • quick access to private L2 – good for small working sets • private bus to private L2  less contention

  20. A Reminder: SMT (Simultaneous Multi Threading) SMT vs. CMP

  21. SMT A Single Chip MultiprocessorL. Hammond at al. (Stanford), IEEE Computer 97 Superscalar (SS) • For Same area (a billion tr. DRAM area) • Superscalar and SMT: Very Complex • Wide • Advanced Branch prediction • Register Renaming • OOO Instruction Issue • Non-Blocking data caches CMP

  22. SS and SMT vs. CMP • CPU Cores: Three main hardware design problems (of SS and SMT): • Area increases quadratically with core complexity • Number of Registers O(Instruction window size) • Register ports - O(Issue width) • CMP solves this problem (~ linear Area to Issue width) • Longer Cycle Times • Long Wires, many MUXes and crossbars • Large buffers, queues and register files • Clustering (decreases ILP) or Deep Pipelining (Branch mispredication penalties) • CMP allows small cycle time (with little effort) • Small and fast • Relies on software to schedule • Poor ILP • Complex Design and Verification

  23. SMT SS and SMT vs. CMP • Memory: • 12 issue SS or SMT require multiport data cache (4-6 ports) • 2 X 128 Kbyte (2 cycle latency) • CMP 16 X 16 Kbyte (single cycle latency), but secondary cache is slower (multiport) • Shared memory: write through caches CMP

  24. Performance comparison • Compress: (Integer apps) Low ILP and no TLP • • Mpeg-2: (MMedia apps) High ILP and TLP and moderate memory requirement (parallelized by hand) • + SMT utilizes core resources better • + But CMP has 16 issue slots instead of 12 • • Tomcatv: (FP applications) Large loop-level parallelism and large memory bandwidth (TLP by compiler) • + CMP has large memory bandwidth on primary cache - SMT fundamental problem: unified and slow cache • • Multiprogram: Integer multiprogramming workload, all computation-intensive (Low ILP, High PLP)

  25. CMP Motivation • How to utilize available silicon? • Speculation (aggressive superscalar) • Simultaneous Multithreading (SMT, Hyperthreading) • Several processors on a single chip • What is a CMP (Chip MultiProcessor)? • Several processors (several masters) • Both shared and distributed memory architectures • Both homogenous and heterogeneous processor types • Why? • Wire Delays • Diminishing of Uniprocessors • Very long design and verification times for modern processors

  26. A Single Chip MultiprocessorL. Hammond at al. (Stanford), IEEE Computer 97 • TLP and PLP become widespread in future applications • Various Multimedia applications • Compilers and OS • Favours CMP • CMP: • Better performance with simple hardware • Higher clock rates, better memory bandwidth • Shorter pipelines • SMT: has better utilizations but CMP has more resources (no wide-issue logic) • Although CMP bad for no TLP and ILP (compress), SMT and SS not much better

  27. A Reminder: SMT (Simultaneous Multi Threading) CMP SMT • Pool of execution units (Wide machine) • Several Logical processors • Copy of State for each • Mul. Threads are running concurrently • Better utilization and Latency Tolerance • Simple Cores • Moderate amount of parallelism • Threads are running concurrently on different cores

  28. SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point Schedulers Schedulers L2 Cache and Control Uop queues Uop queues L2 Cache and Control Rename/Alloc Rename/Alloc BTB Trace Cache uCode ROM BTB Trace Cache uCode ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread 2 Thread 4

More Related