The Standford Hydra CMP

The Standford Hydra CMP Lance Hammond Benedict A. Hubbert Michael Siu Manohar K. Prabhu Michael Chen Kunle Olukotun Presented by Jason Davis

Hydra CMP with 4 MIPS Processors L1 cache for each CPU and L2 cache that holds the permanent states Why? Moore’s law is reaching its end Finite amount of ILP TLP (Thread Level Parallelism) vs ILP in pipelined architecture CMP can use ILP as well (TLP and ILP are orthogonal) Wire Delay Design Time (CPU core doesn’t need to be redesigned) just increase the number Problems Integration densities just now giving reasons to consider new models Difficult to convert uniprocessor code Multiprogramming is hard Introduction

Base Design • 4 MIPS Cores (250 MHz) • Each core: • L1 Data Cache • L1 Primary Instruction Cache • Share a single L2 Cache • Virtual Buses (pipelined with repeaters) • Read bus (256 bits) • Acts as general purpose system bus for moving data between CPUs, L2, and external memory • Wide enough to handle entire cache line (CMP explicit gain, multiprocessor systems would require too many pins • Write bus (64 bits) • Writes directly from 4 CPUs to L2 • Pipelined to allow for single-cycle occupancy (not a bottleneck) • Uses simple invalidation for caches (broadcast invalidates all other L1s) • L2 Cache • Point of communication (10-20 cycles) • Bus Sufficient for 4-8 MIPS cores, more need larger system buses

Base Design

Parallel Software Performance

Thread Speculation • Takes sequence of instructions on normal program and arbitrarily breaks it into a sequenced group of threads • Hardware must track all interthread dependencies to insure program acts the same way • Must re-execute code that follows a data violation based upon a true dependency • Advantages: • Does not require synchronization (different than enforcing dependencies on multiprocessor systems) • Dynamic (done at runtime) so programmer only needs to consider for maximum performance • Conventional Parallelizing compilers miss a lot of TLP because synchronization points must be inserted where dependencies can happen and not just where they do happen • 5 Issues to address:

Thread Speculation 1. Forward data between parallel threads 2. Detect when reads occur to early (RAW) 3. Safely Discard speculative state after violations

Thread Speculation 4. Retire speculative writes in correct order (WAW hazard) 5. Provide Memory renaming (WAR hazards)

Hydra Speculation Implementation • Takes care of the 5 issues: • Forward data between parallel threads: • When thread writes to bus, newer threads that need the data have their current cache lines for that data invalidated • On miss in L1, access L2, write buffers of current or older thread replaces data returned from L2 byte-byte • Detect when read occurs too early: • Primary cache bits are set to mark possible violations, if write to that address of an earlier thread invalidates – Violation detected and thread is restarted. • Safely discard speculative states after violation: • Permanent state kept in L2, any L1 lines that are speculative data are invalidated, L2 buffer for thread is discarded (permanent state not effected)

Hydra Speculation Implementation • Place speculative writes in memory in correct order: • Separate speculative data L2 buffers kept for each thread • Must be drained into L2 in original sequence • Thread sequencing system also sequences the buffer draining • Memory Renaming: • Each CPU can only read data written by itself or earlier threads • Writes from later threads don’t cause immediate invalidations (since writes from these threads should not be visible yet) • Ignored invalidations are recorded with pre-invalidate bit • If thread accesses L2 it must only access data it should be able to see from itself or earlier L2 buffers • When current thread completes all currently pre-invalidated lines are check against future threads for violations

Hydra Speculation Implementation

Speculation Performance

Prototype • MIPS-based RC32364 • SRAM macro cells • 8-Kbyte L1 data and instruction caches • 128 Kbytes L2 • Die is 90 mm^2, .25-micron process • Have a verilog model, moving to physical design using synthesis • Central Arbritration for Buses will be the most difficult part, hard to pipeline, must accept many requests, and must reply with grant signals

Prototype

Hydra CMP High performance Cost effective alternative to large chip single processors Similar die area can achieve similar to uniprocessor performance on integer programs using thread speculation Multiprogrammed or High Parallelism can do better then single processor Hardware Thread-Speculation is not cost intensive, and can give great gains to performance Conclusion

Questions

The Standford Hydra CMP

The Standford Hydra CMP

Presentation Transcript

The Hydra Lab

HYDRA AVENUE

Hydra

The Constellation of Hydra

HYDRA VILLAGE

Hydra

Hydra

The Hydra

Hydra- Rina

hydra EYE

The Hydra

Aqua / Hydra Water

Hydra Constellation

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

CMP

Data Speculation Support for a Chip Multiprocessor (Hydra CMP)

(a) Hydra

HYDRA

Hydra Facial

Hydra facial