140 likes | 259 Vues
The ESW paradigm highlights advancements in instruction-level parallelism (ILP) with a focus on decentralized resources and minimized communication costs. The approach identifies the limitations of centralized resources and seeks to optimize execution through a dataflow and sequential model. It emphasizes speculative memory disambiguation and introduces small dynamic windows to enhance scalability and efficiency. Leveraging a custom simulator with MIPS architecture, this framework reveals promising results in both in-order and out-of-order execution, aiming for a more effective use of registers and instruction handling.
E N D
The ESW Paradigm Manoj Franklin & Guirndar S. Sohi 05/10/2002
Observations • Large exploitable ILP, theoretically • Close instructions dependent; parallelism possible further down stream • Centralized resources is bad • Minimizing comm cost is important
What about others? • Dataflow model + most general • unconventional PL paradigm • comm cost can be high • SS, VLIW (sequential) + temporal locality • large centralized HW • compiler too dumb • not scalable • ESW = dataflow + sequential
Design Goals • Decentralized resources • Minimize wasted execution • Speculative memory address disambiguation • realizability Replace large dynamic window with many small ones
How it works • Basic window • Single entry, loop-free, call-free block • Equal, superset or subset of basic block • Execute basic windows in parallel • Multiple independent stages • Complete with branch prediction, L1 cache, reg file…etc.
Dist Inst Supply Optimization: Snooping on L2-L1 Cache traffic
Dist Inter-Inst Comm • Architecture: • dist. future file • create/use masks for dep. check • Observation: • Register use mostly within basic block • The rest in subsequent blocks
Dist DMem System • Problem: • Addr. space large, can’t create/use mask • Need to maintain consistency between multiple copies • Solution: ARB
ARB • - Bits cleared upon commit • Restart stages when dependency violated • When load, forward values from ARB if already exists Q. What happens when ARB’s full?
Simulation Environment • Custom simulator using MIPS R2000 pipeline • Up to 2 inst fetch/decode/issued/ per IE • Up to 32 inst per basic window • 4K word L1 cache, 64KB L2 DM Cache (100% hit rate, what??) • 3-bit counter branch prediction
Results • Optimizations: • Moving up instruction • Expand basic window (in eqntott and expresso) Basic window <= basic block But is 100% cache hit rate reasonable?
Discussion • Compare this to CMP? RAW? • Does the trade-off strike a balance?
New Results (1) In order execution
New Results (2) Out of order execution