140 likes | 290 Vues
This presentation discusses the latest findings in on-chip parallelism, particularly focusing on multithreaded processors and their organization. We explore techniques to exploit thread-level parallelism for improved performance, detailing architectures capable of handling multiple program counters and context management. Additionally, it covers simultaneous multithreading and its implications on resource efficiency, as well as challenges within superscalar architectures, including instruction fetch, dynamic instruction issue, and execution limitations. The seminar emphasizes critical writing skills, progress updates, and deadlines for final projects, emphasizing the importance of clear communication in technical presentations.
E N D
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252
Administrivia Projects • Presentations Dec 5 & 7 • Documents ~10 pages • Good writing is important • Progress is important • Final is Dec 11 (7pm to 10pm) CPS 220
Multithreaded Processors • Exploit thread-level parallelism to improve performance • Multiple Program Counters • Thread • independent programs (multiprogramming) • threads from same program CPS 220
Deneclor HEP • General purpose scientific computer • Organized as MP • up to 16 processors • each processor multithreaded • up to 128 memory modules • up to 4 I/O cache modules • Three-input switches and chaotic routing CPS 220
HEP Processor Organization • Multiple contexts (threads) • each has own Program Status Word (PSW) • PSWs circulate in control loop • control and data loops pipelined 8 deep • PSW in control can circulate no faster than data in data loop • PSW at queue head fetches and starts execution of next instruction • Clock period: 100ns • 8 PSWs in control loop => 10MIPS • Each thread gets 1/8 the processor • Maximum performance per thread => 1.25 MIPS (And they tried to sell as supercomputer) CPS 220
Horizontal Waste Verticle Waste Simultaneous Multithreading • Goal: use hardware resources more efficiently • especially for superscalar processors • Assume 4-issue superscalar • Alpha 21464 Thread Instruction CPS 220
Operation of Simultaneous Multithreading • Standard multithreading can reduce verticle waste • Issue from multiple threads in same cock cycle • Eliminate both horizontal and verticle waste • Larger Register Files Thread Instructions Thread Instructions Standard Multithreading Simultaneous Multithreading CPS 220
Limitations of SuperScalar Architectures Instruction Fetch • branch prediction • alignment of packet of instructions Dynamic Instruction Issue • Need to identify ready instructions • Rename Table • No compares • Large number of ports (Operands x Width) • Issue Queue Size • n x Q x O x W 1 bit comparators (src and dest) • Quadratic increase in queue size with issue width • PA-8000 20% of die area to issue queue (56 instruction window) CPS 220
SuperScalar Limitations (Continued) Instruction Execute • Register File • more rename registers • more access ports • complexity quadratic with issue width • Bypass logic • complexity quadratic with issue width • wire delays • Functional Units • replicate • add ports to data cache (complexity adds to access time) CPS 220
Why Single Chip MP? • Technology Push • Benefits of wide issue are limited • Decentralized microarchitecture: easier to build several simple fast processors than one complex processor • Application Pull • Applications exhibit parallelism at different grains • < 10 instructions per cycle (Integer codes) • > 40 instructions per cycle (FP loops) CPS 220
I-Cache (32 KB) External Interface Instruction Fetch TLB Instruction Decode & Rename D-Cache (32 KB) L2 Cache (256 KB) 21 mm Clocking & Pads Reorder Buffer, Instruction Queues, and Out-of-Order Logic Integer Unit Floating Point Unit A 6-Way SuperScalar Processor 21 mm CPS 220
A 4 x 2 Single Chip Multiprocessor 21 mm Icache 1 Icache 2 External Interface Processor #1 Processor #2 L2 Cache (256 KB) Dcache 1 Dcache 2 21 mm Clocking & Pads Dcache 3 Dcache 4 L2 Communication Crossbar Processor #3 Processor #4 Icache 3 Icache 4 CPS 220
Performance Comparison CPS 220
Summary of Performance • 4 x 2 MP works well for coarse grain apps • How well would Message Passing Architecture do? • Can SUIF handle pointer intensive codes? • For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue CPS 220