1 / 14

On-chip Parallelism

On-chip Parallelism. Alvin R. Lebeck CPS 220/ECE 252. Administrivia. Projects Presentations Dec 5 & 7 Documents ~10 pages Good writing is important Progress is important Final is Dec 11 (7pm to 10pm). Multithreaded Processors. Exploit thread-level parallelism to improve performance

darby
Télécharger la présentation

On-chip Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252

  2. Administrivia Projects • Presentations Dec 5 & 7 • Documents ~10 pages • Good writing is important • Progress is important • Final is Dec 11 (7pm to 10pm) CPS 220

  3. Multithreaded Processors • Exploit thread-level parallelism to improve performance • Multiple Program Counters • Thread • independent programs (multiprogramming) • threads from same program CPS 220

  4. Deneclor HEP • General purpose scientific computer • Organized as MP • up to 16 processors • each processor multithreaded • up to 128 memory modules • up to 4 I/O cache modules • Three-input switches and chaotic routing CPS 220

  5. HEP Processor Organization • Multiple contexts (threads) • each has own Program Status Word (PSW) • PSWs circulate in control loop • control and data loops pipelined 8 deep • PSW in control can circulate no faster than data in data loop • PSW at queue head fetches and starts execution of next instruction • Clock period: 100ns • 8 PSWs in control loop => 10MIPS • Each thread gets 1/8 the processor • Maximum performance per thread => 1.25 MIPS (And they tried to sell as supercomputer) CPS 220

  6. Horizontal Waste Verticle Waste Simultaneous Multithreading • Goal: use hardware resources more efficiently • especially for superscalar processors • Assume 4-issue superscalar • Alpha 21464 Thread Instruction CPS 220

  7. Operation of Simultaneous Multithreading • Standard multithreading can reduce verticle waste • Issue from multiple threads in same cock cycle • Eliminate both horizontal and verticle waste • Larger Register Files Thread Instructions Thread Instructions Standard Multithreading Simultaneous Multithreading CPS 220

  8. Limitations of SuperScalar Architectures Instruction Fetch • branch prediction • alignment of packet of instructions Dynamic Instruction Issue • Need to identify ready instructions • Rename Table • No compares • Large number of ports (Operands x Width) • Issue Queue Size • n x Q x O x W 1 bit comparators (src and dest) • Quadratic increase in queue size with issue width • PA-8000 20% of die area to issue queue (56 instruction window) CPS 220

  9. SuperScalar Limitations (Continued) Instruction Execute • Register File • more rename registers • more access ports • complexity quadratic with issue width • Bypass logic • complexity quadratic with issue width • wire delays • Functional Units • replicate • add ports to data cache (complexity adds to access time) CPS 220

  10. Why Single Chip MP? • Technology Push • Benefits of wide issue are limited • Decentralized microarchitecture: easier to build several simple fast processors than one complex processor • Application Pull • Applications exhibit parallelism at different grains • < 10 instructions per cycle (Integer codes) • > 40 instructions per cycle (FP loops) CPS 220

  11. I-Cache (32 KB) External Interface Instruction Fetch TLB Instruction Decode & Rename D-Cache (32 KB) L2 Cache (256 KB) 21 mm Clocking & Pads Reorder Buffer, Instruction Queues, and Out-of-Order Logic Integer Unit Floating Point Unit A 6-Way SuperScalar Processor 21 mm CPS 220

  12. A 4 x 2 Single Chip Multiprocessor 21 mm Icache 1 Icache 2 External Interface Processor #1 Processor #2 L2 Cache (256 KB) Dcache 1 Dcache 2 21 mm Clocking & Pads Dcache 3 Dcache 4 L2 Communication Crossbar Processor #3 Processor #4 Icache 3 Icache 4 CPS 220

  13. Performance Comparison CPS 220

  14. Summary of Performance • 4 x 2 MP works well for coarse grain apps • How well would Message Passing Architecture do? • Can SUIF handle pointer intensive codes? • For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue CPS 220

More Related