Multithreading

Multithreading

Throughput per Cycle One Operation Latency in Cycles Little’s Law Parallelism = Throughput * Latency • To maintain throughput T/cycle when each operation has latency L cycles, need T*L independent operations • For fixed parallelism: • decreased latency allows increased throughput • decreased throughput allows increased latency tolerance

Types of Parallelism Time Time Data-Level Parallelism (DLP) Pipelining Time Time Thread-Level Parallelism (TLP) Instruction-Level Parallelism (ILP)

Issues in Parallel Machine Design • Communication how do parallel operations communicate data results? • Synchronization how are parallel operations coordinated? • Resource Management how are a large number of parallel tasks scheduled onto finite hardware? • Scalability how large a machine can be built?

Flynn’s Classification (1966) • Broad classification of parallel computing systems based on number of instruction and data streams • SISD: Single Instruction, Single Data conventional uniprocessor • SIMD: Single Instruction, Multiple Data • one instruction stream, multiple data paths • distributed memory SIMD (MPP, DAP, CM-1&2, Maspar) • shared memory SIMD (STARAN, vector computers) • MIMD: Multiple Instruction, Multiple Data • message passing (Transputers, nCube, CM-5) • non-cache-coherent shared memory (BBN Butterfly, T3D) • cache-coherent shared memory (Sequent, Sun Starfire, SGI Origin) • MISD: Multiple Instruction, Single Data no commercial examples

Array Controller Inter-PE Connection Network PE PE PE PE PE PE PE PE Control Data Mem Mem Mem Mem Mem Mem Mem Mem SIMD Architecture • Central controller broadcasts instructions to multiple processing elements (PEs) • Only requires one controller for whole array • Only requires storage for one copy of program • All computations fully synchronized

+ + + + + + Vector Register Machine Scalar Registers Vector Registers r15 v15 r0 v0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR v1 Vector Arithmetic Instructions VADD v3, v1, v2 v2 v3 [0] [1] [VLR-1] Vector Load and Store Instructions VLD v1, r1, r2 v1 Memory Base, r1 Stride, r2

V0 V. Length V. Mask V1 V2 V3 V4 S0 V5 S1 V6 S2 V7 S3 S4 S5 S6 S7 Int Add A0 A1 Int Logic A2 A3 Int Shift A4 A5 Pop Cnt A6 A7 CIP Cray-1 (1976) Vi Vj 64 Element Vector Registers Vk Single Port Memory 16 banks of 64-bit words + 8-bit SECDED 80MW/sec data load/store 320MW/sec instruction buffer refill FP Add Sj FP Mul ( (Ah) + j k m ) Sk FP Recip Si 64 T Regs (A0) Si Tjk ( (Ah) + j k m ) Aj Ai 64 T Regs (A0) Ak Addr Add Bjk Ai Addr Mul NIP 64-bitx16 LIP 4 Instruction Buffers memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)

MIMD Machines • Message passing • Thinking Machines CM-5 • Intel Paragon • Meiko CS-2 • many cluster systems (e.g., IBM SP-2, Linux Beowulfs) • Shared memory • no hardware cache coherence • IBM RP3 • BBN Butterfly • Cray T3D/T3E • Parallel vector supercomputers (Cray T90, NEC SX-5) • hardware cache coherence • many small-scale SMPs (e.g. Quad Pentium Xeon systems) • large scale bus/crossbar-based SMPs (Sun Starfire) • large scale directory-based SMPs (SGI Origin)

. . . P13 P1 P5 P9 P14 P10 P2 P6 P15 P3 P7 P11 P16 P4 P8 P12 MIMD Architectures P1 P2 P10 Shared Bus Distributed • Memory Contention • Communication Contention • Communication Latency Time

TPU 0 TPU1 TPU2 TPU3 Icache TLB L1 D-cache L2 cache SMT Architectural Abstraction • Alpha 21364: 1 CPU with four Thread Processing Units (TPUs) • Shared hardware resources

Instruction Issue Time Reduced function unit utilization due to dependencies

Superscalar Issue Time Superscalar leads to more performance, but lower utilization

Predicated Issue Time Adds to function unit utilization, but results are thrown away

Chip Multiprocessor Time Limited utilization when only running one thread

Fine Grained Multithreading Time Intra-thread dependencies still limit performance

Simultaneous Multithreading Time Maximum utilization of function units by independent operations

SMT Microarchitecture Changes • Multiple PCs • Control to decide how to fetch from • Separate return stacks per thread • Per-thread reorder/commit/flush/trap • Thread id with BTB • Larger register file More things outstanding

Decode/Map Queue Fetch Reg Read Execute Dcache/Store Buffer Reg Write Retire PC RegisterMap Regs Regs Dcache Icache Basic Out-of-Order Pipeline Thread-blind

Decode/Map Queue Fetch Reg Read Execute Dcache/Store Buffer Reg Write Retire PC RegisterMap Regs Regs SMT Pipeline Dcache Icache

Performance [Tullsen et. al. ISCA ’96]

Multiprogrammed Workload

Multithreaded Applications

RR=Round Robin RR.X.Y X – threads do fetch in cycle Y – instructions fetched/thread Optimizing: Fetch Freedom [Tullsen et. al. ISCA ’96]

ICOUNT – priority to thread with fewest pending instrs BRCOUNT MISSCOUNT IQPOSN – penalize threads with old instrs (at front of queues) Optimizing: Fetch Algorithm [Tullsen et. al. ISCA ’96]

Effects of Thread Interference In Shared Structures • Inter-thread Cache Interference • Increased Memory Requirements • Interference in Branch Prediction Hardware

Inter-thread Cache Interference • Threads share the cache, so more threads, lower hit-rate • Two reasons why this is not a significant problem: • The L1 Cache miss can almost be entirely covered by the 4-way set associative L2 cache • Out-of-order execution, write buffering and the use of multiple threads allow SMT to hide the small increases of additional memory latency • 0.1% speed up without inter-thread cache miss

Increased Memory Requirements • More threads are used, more memory references per cycle • Bank conflicts in L1 cache account for the most part of the memory accesses • It is ignorable: • For longer cache line: gains due to better spatial locality out-weighted the costs of L1 bank contention • 3.4% speedup if no inter-thread contentions

Interference in Branch Prediction Hardware • Since all threads share the prediction hardware, it will experience inter-thread interference • This effect is ignorable since: • The speedup out-weighed the additional latencies • From one to eight threads, branch and jump misprediction rates range from 2.0%-2.8% (branch) 0.0%-0.1% (jump)

Quiescing Idle Threads • Problem: Spin looping thread consumes resources • Solution: Provide quiescing operation that allows a TPU to sleep until a memory location changes

Discussion Points? • Does SMT reduce the demands of the ILP uArch? • Would it be possible to tolerate a branch misprediction with SMT parallelism? How? • Would it be possible for an SMT machine to deadlock? livelock? • What is the cost of adding SMT? • What is the effect on single-thread performance?

Multithreading

Multithreading

Presentation Transcript

Multithreading

Multithreading

Multithreading

Multithreading Tutorial

Multithreading

Multithreading

Multithreading

Multithreading

Hardware Multithreading

Multithreading

Multithreading

Multithreading

MULTITHREADING PROGRAMMING

Multithreading

Multithreading

Multithreading

Multithreading