On-chip Parallelism

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

Administrivia • Today simultaneous multithreading, MP on a chip • project presentations (10-15 minutes) • midterm II, Wed April 29, in class • project write-up due Friday May 1 Noon • approximately 8 pages CPS 221

Review: Software Coherence Protocols Requires • Access Control • Messaging System • small control messages • large bulk transfer • Programmable Processor • Support for Protocol operations Questions • Kernel-based vs. User-Level? • Integration of processor with other requirements? CPS 221

P P P P $ $ $ $ P $ Review: Typhoon • Fully Integrated (processor, access control, NI) Mem RTLB NI CPS 221

Software Fine-Grain Access Control • Low cost, can run on network of workstations • Flexibility of Software protocol processing • Like SW Dirty Bits, but more general • Foreach load/store, check access bits • if access fault invoke fault handler • Lookup Options • table lookup (Blizzard-S) • magic cookie (Shasta, Blizzard-COW) • Instrumentation Options • compiler • executabe editing CPS 221

Blizzard-S • Supports Tempest Interface • Executable Editing (EEL) • Fast Table Lookup • mask, shift, add CPS 221

Shasta • Executable Editing (variant of ATOM) • Magic Cookie ld r1, r2[300] if r1 == magic_cookie do_out_of_line_check(x); add r3, r1, r4 • Incorporates several optimizations • code scheduling • batching checks (refs to same cache lines) • 3% overhead on uniprocessor code • Multiple coherence granularity • Supports Release Consistency CPS 221

Future Directions • Simultaneous Multithreading • Single-Chip MP • MultiScalar Processors (Wednesday) CPS 221

Multithreaded Processors • Exploit thread-level parallelism to improve performance • Multiple Program Counters • Thread • independent programs (multiprogramming) • threads from same program CPS 221

Deneclor HEP • General purpose scientific computer • Organized as MP • up to 16 processors • each processor multithreaded • up to 128 memory modules • up to 4 I/O cache modules • Three-input switches and chaotic routing CPS 221

HEP Processor Organization • Multiple contexts (threads) • each has own Program Status Word (PSW) • PSWs circulate in control loop • control and data loops pipelined 8 deep • PSW in control can circulate no faster than data in data loop • PSW at queue head fetches and starts execution of next instruction • Clock period: 100ns • 8 PSWs in control loop => 10MIPS • Each thread gets 1/8 the processor • Maximum performance per thread => 1.25 MIPS (And they tried to sell as supercomputer) CPS 221

Horizontal Waste Verticle Waste Simultaneous Multithreading • Goal: use hardware resources more efficiently • especially for superscalar processors • Assume 4-issue superscalar Thread Instruction CPS 221

Operation of Simultaneous Multithreading • Standard multithreading can reduce verticle waste • Issue from multiple threads in same cock cycle • Eliminate both horizontal and verticle waste Thread Instructions Thread Instructions Standard Multithreading Simultaneous Multithreading CPS 221

Limitations of SuperScalar Architectures Instruction Fetch • branch prediction • alignment of packet of instructions Dynamic Instruction Issue • Need to identify ready instructions • Rename Table • No compares • Large number of ports (Operands x Width) • Reorder Buffer • n x Q x O x W 1 bit comparators (src and dest) • Quadratic increase in queue size with issue width • PA-8000 20% of die area to issue queue (56 instruction window) CPS 221

SuperScalar Limitations (Continued) Instruction Execute • Register File • more rename registers • more access ports • complexity quadratic with issue width • Bypass logic • complexity quadratic with issue width • wire delays • Functional Units • replicate • add ports to data cache (complexity adds to access time) CPS 221

Why Single Chip MP? • Technology Push • Benefits of wide issue are limited • Decentralized microarchitecture: easier to build several simple fast processors than one complex processor • Application Pull • Applications exhibit parallelism at different grains • < 10 instructions per cycle (Integer codes) • > 40 instructions per cycle (FP loops) CPS 221

I-Cache (32 KB) External Interface Instruction Fetch TLB Instruction Decode & Rename D-Cache (32 KB) L2 Cache (256 KB) 21 mm Clocking & Pads Reorder Buffer, Instruction Queues, and Out-of-Order Logic Integer Unit Floating Point Unit A 6-Way SuperScalar Processor 21 mm CPS 221

A 4 x 2 Single Chip Multiprocessor 21 mm Icache 1 Icache 2 External Interface Processor #1 Processor #2 L2 Cache (256 KB) Dcache 1 Dcache 2 21 mm Clocking & Pads Dcache 3 Dcache 4 L2 Communication Crossbar Processor #3 Processor #4 Icache 3 Icache 4 CPS 221

Performance Comparison CPS 221

Summary of Performance • 4 x 2 MP works well for coarse grain apps • How well would Message Passing Architecture do? • Can SUIF handle pointer intensive codes? • For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue CPS 221

On-chip Parallelism

On-chip Parallelism

Presentation Transcript

Networks-on-Chip

Networks-on-Chip

Network-on-chip

ChIP on ChIP

Network-on-Chip

Networks-on-Chip

System On Chip

Network-on-Chip

Exploiting Parallelism on GPUs

On Chip Bus

On Chip Bus

Networks on Chip

On disjoint access parallelism

System on Chip

Networks-on-Chip

On-Chip Communication: Networks on Chip (NoCs)

On-chip Parallelism

Networks-on-Chip

System On Chip

PARALLELISM PARALLELISM PARALLELISM

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors

Networks-on-Chip