A Heterogeneous Lightweight Multithreaded Architecture

A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block University of Notre Dame MTAAP 2007,CA

Outline • Heterogeneous Lightweight Multithreaded Architecture • Simulation environments, benchmarks and results • Conclusions and future work

Architecture Highlights • Processing-In-Memory(PIM) Based • Effectively attack memory wall problem • Highly multithreaded • Successfully hide large latencies and contentions • Heterogeneous, Supports Extended Memory Semantics (EMS) • Extremely low overhead on context switch and synchronization

Multithreaded Processors • Multithreading reduces the processor idle time • Thread context is part of the processor Multithreading Machines 1960s CDC 6600 1970s I/O Processor for the Space Shuttle 1980s Denelcor HEP 1990s Cray/Tera MTA 2000+ Cray Eldorado 2000+ Intel Xeon 2000+ Sun Niagara Single Threaded Multithreaded

Lightweight Threads • Thread context (frame) is 32 double words (256 bytes) • Two double words are reserved for the thread status; 30 general purpose registers. • No other per thread state, easy for multithreading . • Frames are stored in memory (No Register File) • Registers are aliases for memory locations

Lightweight Multithreading • Thread creation is fast and inexpensive - single instruction • Contrast with pthread creation - kernel intervention and as many as 10,000’s of instructions • Unbounded Multithreading • Threads are part of the memory system rather than the processor state. • “Unlimited number” of threads per processor. • Many opportunities for issuing an instruction. • Ultra-lightweight Processing • Unbounded Multithreadingrequires low overhead thread management and synchronization • At the memory bank, Greater data bandwidth,Low overhead

Heterogeneous Architecture • Issue instruction from ready threads on each clock cycle • Architectural support for low overhead thread management Heterogeneous Architecture Lightweight Processor Chip (LPC)

64 bits of data/metadata Extension bit Extended Memory Semantics (EMS) • Memory subsystem is constructed of 65 bit dwords • 64 bits of data • 1 extension bit;1: dword is Full, 0: dword is empty • Extends Cray MTA E/F bits • Full/Empty: Contains data or not • Extra states: Metadata can contain frame pointer • Same semantics apply to thread registers

Single Producer/ Consumer on EMS • LWP behavior for load_fe with A empty. • Location A changes state to “FVE: forward value, leave empty” • Content of A is the target address of the forward operation (all registers also have a memory address).

Completing the Load • How does the LWP complete the load_fe? • store_ef arrives at A • Data associated with store is returned to T2:R2 – this completes the load_fe • Location A changes to the empty state.

A More Complex Situation • Consider a multiple producer/consumer problem such as locks. • Multiple threads (more than 3) all attempt to acquire the lock. • Memory requests will be queued up at the target location • EMS handlerthread needed to handle the bookkeeping

EMS Handler Overhead • Invoking a EMS handler • Synchronized memory operations beyond the hardware supported single producer/consumer scenario • Overhead • Creating the handler threads • To queue up memory requests, handlers need to spin on the target memory address to get exclusive access • Significant overhead on LWP CPU time, NoC traffic and memory bandwidth • How to alleviate the overhead?

Ultra-Lightweight Processor • Alleviate burden from LWP • For thread synchronization and management, Complex atomic memory operations • Simple design, Minimal circuitry • At the memory bank, Greatest data bandwidth (wide-word),no NoC traffic when accessing memory. • Multithreaded

Large-scale system Large-scale system

Outline • Heterogeneous Lightweight MultithreadedArchitecture • Simulation environments, benchmarks and results • Conclusion and future work

Simulation Environment DimC – Diminished C - An extension of the ANSI C - Expose low level architectural features - Support lightweight multithreading SALT -Simulator for the Analysis of LWP Timings -Contains LWPs, ULWPs, NoC and memory subsystems.

Benchmark Suite • Two categories of irregular problems. • Complicated control structures such as recursion. • Such programs can achieve decent performance on conventional architectures but need great effort. • Not necessarily Invoking EMS handler or ULWP • N-Queens, Fibonacci • Complicated control structures and dynamic data structures • Very hard to parallelize effectively on conventional SMPs. • EMS handler or ULWP support is necessary • Competing agents, SAT solver kernel

N-Queens • Find all solutions to the problem of placing N queens on an N*N chessboard such that no queen can attack another. • Irregular problems with dynamic parallel recursion , • Thread behavior is hard to predict.

Competing Agents • Multiple agents attempt to update a shared memory location simultaneously • Each agent is implemented by a single thread. All threads are evenly distributed over four LWPs inside a single LPC • Complicated control structures and dynamic data structures • Using separate synchronized load/stores • To characterize the effectiveness of the ULWP in reducing the cost of synchronization.

SAT Solver/zChaff • SAT-Boolean satisfiability problem (from propositional logic) • fundamental to many problems in automated reasoning, CAD, CAM, machine vision, database, robotics, IC design, computer architecture, and network design. • Given a boolean formula (usually in CNF) , check whether an assignment of boolean truth values to the variables in the formula exists, such that the formula evaluates to true. • For example, the CNF formula, x1 is true and x3 is false, then all three clauses are satisfied,regardless of the value of x2. • zChaff , the modern variants of the DPLL algorithm, is used to implement SAT solver.

N-Queens • Successfully deploy all the parallelism • Completely dynamic, Ideal speedup • Saturation is only due to small data set • Good performance can be achieved on conventional SMPs but need great extra effort

Competing Agents • EMS handler is the bottleneck in high contention situation • Heterogeneous architecture can achieve unbounded scalability • High contention is not a problem any more in the heterogeneous architecture

SAT Solver/zChaff on Conventional SMPs • Parallel implementation lead to performance degeneration • The more processors, the worse performance • Very hard to achieve good performance on conventional SMPs Data from Parallel Multithreaded Satisfiability Solver: Design and Implementation By Yulik Feldman, etc. @ Intel

SAT Solver/zChaff on Heterogeneous architecture • Ideal speedup • saturation is only due to small data set • Successfully deployed all the parallelism Speedup Speedup Over serial version

Outline • Heterogeneous Lightweight MultithreadedArchitecture • Simulation environments, benchmarks and results • Conclusions and future work

Conclusions • The Heterogeneous Lightweight Multithreaded Architecture • is a good solution for irregular problem that are hard/impossible to parallelize over conventional SMPs • Has very low overhead on context switching and synchronization • Can successfully hide latencies and contentions • Can provide unbounded multithreading and scalability • Can deploy all possible parallelism inside an irregular problem

Future Work • Provide standard language support • Benchmark suites • Large-scale system performance • Comparison with conventional large-scale systems

Acknowledgments • DARPA • This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under its Contract No. NBCH3039003. • University of Notre Dame • Caltech/JPL • Cray

Thank you!

A Heterogeneous Lightweight Multithreaded Architecture

A Heterogeneous Lightweight Multithreaded Architecture

Presentation Transcript

Y-Comm: A new architecture for heterogeneous networking

CS184c: Computer Architecture [Parallel and Multithreaded]

LEAP: A Precise Lightweight Framework for Enterprise Architecture

Towards a Heterogeneous Computer Architecture for CACTuS

CS184c: Computer Architecture [Parallel and Multithreaded]

Lightweight Architecture

Trace-Level Speculative Multithreaded Architecture

CS184c: Computer Architecture [Parallel and Multithreaded]

CS184c: Computer Architecture [Parallel and Multithreaded]

CS184c: Computer Architecture [Parallel and Multithreaded]

CS184c: Computer Architecture [Parallel and Multithreaded]

CS184c: Computer Architecture [Parallel and Multithreaded]

A LEGO-like Lightweight Component Architecture for Organic Computing

DIRAC: A Scalable Lightweight Architecture for High Throughput Computing

Heterogeneous Missions Accessibility Context and Architecture

Lightweight Architecture Market Research Report 2018

CS184c: Computer Architecture [Parallel and Multithreaded]

CS184c: Computer Architecture [Parallel and Multithreaded]

A Multithreaded Architecture

A New Architecture for Heterogeneous Networking

CS184c: Computer Architecture [Parallel and Multithreaded]