110 likes | 205 Vues
Simple Processor Architectures Modeling Using YAPI. Sam Williams EE249 Project December 3, 2002 Mentors: Trevor Meyerowitz & Kees Vissers. Introduction. YAPI is a Philips Kahn process network library Ported to the Metropolis meta model and extended JAVA like interface
E N D
Simple Processor Architectures Modeling Using YAPI Sam Williams EE249 Project December 3, 2002 Mentors: Trevor Meyerowitz & Kees Vissers EE249 Project
Introduction • YAPI is a Philips Kahn process network library • Ported to the Metropolis meta model and extended • JAVA like interface • Allows for determinism, flexibility and concurrent processing • This Project uses the library to model various processor architectures EE249 Project
Mapping YAPI (KPN) toProcessor Architectures • YAPI Process • Parameterizable • Threadlike behavior • Sequential Coding • Unidirectional Ports • Read/Write ordering • defined by user • Internal state // constructor ... // execute while(true){ ... } Pipeline Stage • Combinational logic and arrays can be • represented with process code and arrays. • Pipeline registers are just one element • deep FIFO’s • Fanout is accomplished via multiple copies • of the FIFO • How can we guarantee bounded length • (typical for processors) in an • unbounded model? • YAPI Channel • Unbounded FIFO • Elements of type Object • Point to point mapping (no broadcast) • Nonblocking writes • Blocking reads C/L RF … Inputs from previous stage Outputs to next stage Pipeline registers Pipelined Processor • YAPI Netlist • Create Processes • Create Channels • Map processes through channels • Can be algorithmic F D E M W EE249 Project
Instruction Set Architecture(used in all examples) Core Instructions 0x00 - nop 0x01 - bz rt,ofs16 0x02 - bnz rt,ofs16 0x0a - msync 0x0b - msyncr // wait for all memory reads to finish 0x0c - msyncw // wait for all memory writes to finish 0x0d - poison 0x0e - flush 0x0f - halt Memory Instructions 0x10 - lw rd,ldOfs8(rs) 0x11 - sw rt,stOfs8(rs) 0x12 - ldw rd,ldOfs8(rs) // loads 2 words 0x13 - sdw rt,stOfs8(rs) // stores 2 words 0x14 - lqw rd,ldOfs8(rs) // loads 4 words 0x15 - sqw rt,stOfs8(rs) // stores 4 words Scalar Arithmetic 0x20 - li rd,imm16 0x21 - add rd,rs,rt 0x22 - addi rd,rs,imm8 0x23 - sub rd,rs,rt 0x24 - subi rd,rs,imm8 0x25 - sr rd,rs,rt 0x26 - sri rd,rs,imm8 0x27 - sl rd,rs,rt 0x28 - sli rd,rs,imm8 0x29 - mul rd,rs,rt 0x2a - madd rd,rs,rt (rd+=rs*rt) Packed Double Word Arithmetic (ie take upper 7bits of reg address) and take reg,reg+1 from RF 0x30 - dwli vrd,imm16 0x31 - dwadd vrd,vrs,vrt 0x32 - dwaddi vrd,vrs,imm8 0x33 - dwsub vrd,vrs,vrt 0x34 - dwsubi vrd,vrs,imm8 0x35 - dwsr vrd,vrs,vrt 0x36 - dwsri vrd,vrs,imm8 0x37 - dwsl vrd,vrs,vrt 0x38 - dwsli vrd,vrs,imm8 0x39 - dwmul vrd,vrs,vrt 0x3a - dwmadd vrd,vrs,vrt (vrd+=vrs*vrt) Packed Quad Word Arithmetic (ie take upper 6bits of reg address) and take reg,reg+1,reg+2,reg+3 from RF 0x40 - qwli vrd,imm16 0x41 - qwadd vrd,vrs,vrt 0x42 - qwaddi vrd,vrs,imm8 0x43 - qwsub vrd,vrs,vrt 0x44 - qwsubi vrd,vrs,imm8 0x45 - qwsr vrd,vrs,vrt 0x46 - qwsri vrd,vrs,imm8 0x47 - qwsl vrd,vrs,vrt 0x48 - qwsli vrd,vrs,imm8 0x49 - qwmul vrd,vrs,vrt 0x4a - qwmadd vrd,vrs,vrt (vrd+=vrs*vrt) • RISC Load/Store Architecture • 32b instruction word • 256 x 32b General Purpose Register File • Physical memory model (Harvard style split cache) • Scheduling (dynamic/static) is dependent on micro architecture • Examples were either scalar or superscalar (VLIW falls in between those extremes) • Branch delay slot and prediction are dependent on micro architecture • No Kernel mode support EE249 Project
In-order Statically-Scheduled Scalar Processor (DLX like) F D E M W • Learning example • Each channel contains a single 32b integer • Instruction word is passed down the pipeline in addition to operands and intermediate results • Processes slept a number of cycles equal to their depth (no prefilling of channels) and don’t read until data is guaranteed to be present • Feedback channels stabilize the system ensuring finite FIFO length • Reads and Writes were placed ad hoc inside each process (resulted in several unexpected deadlock cases requiring fixes) • Future examples use a coding style of {write, read, compute, iterate} to avoid deadlock • this example was more headache than benefit vs. RTL 5 Stage Architecture Fetch Decode Execute IC RF Memory DC 4 Process YAPI Netlist EE249 Project
Simplify, Abstract, and add Features • Add hazard detection and bubble • insertion (stalls) • Parameterize the pipeline depth • Prefill pipeline to desired depth FRxW IC RF Hazard DC Branch Predict • Single process • Read operands and pass them and • the instruction word down pipeline • (multiple channels) • Execute upon receipt • Write to register file • Add a branch predictor • Pass prediction and PC down pipeline • (new channels) • Resolve branch when it commits • If mispredicted, nullify pipeline depth instructions and reset PC EE249 Project
Out-of-Order Architectures RWI Fetch RF Rename IC • N-way Superscalar processor was realized by changing the depth of the instruction channel from Fetch (it became an instruction buffer) • Depth of a channel can be viewed as width if multiple reads are performed per cycle • Feedback control included the number of instructions issued • The number of functional units is independent from the fetch width MU0 IUn IU0 Arb RS RS RS DC • An arbitrator could have be added to minimize the • number of point to point channels or common data • busses • Tomasulo style register renaming • Use virtual register allocation • Arbitrary number of integer units • Broadcast nature handled with multiple copies of each channel • Use arrays of ports and algorithms in the netlist to connect them • Use Object nature of channels to encapsulate an instruction in-flight into an object – ReservationStationEntry • Functional units are highly parameterizable (entries, depth, etc…) • Stations stabilize system by replying with the number of free entries and memory of instructions issued on the last iteration • ReadWriteIssue must also maintain knowledge of which instructions can be issued to the heterogeneous functional units • Reorder buffer wasn’t implemented in-order to move on to more complex architectures • Stations are almost identical, so a Station class was created, from which integer units, or a memory unit could be extended … EE249 Project
Single-Process Superscalar Out-of-Order (unimplemented variants) • One feedback channel per pipeline • All reservations stations are internal • Perhaps more complicated • Simulations should be faster SSOOO IC Rename RS’s DC RF • C++ based example • Pipelines are modeled internally • with queues • All reservations stations are internal • Even more complicated • Simulations should be even faster SSOOO IC Rename RS’s DC RF EE249 Project
Ring Architectures P0 P1 IC IC RF RF Rename Rename • The basic architecture has two rings (issue and commit) that spin in opposite directions • Each processor can issue to the ringif there is space • Each functional unit takes a token off the ring, if it can execute that instruction. It commits results to the ring if there is space • ReservationStationEntry Object further abstracted to arbitrary number of operands and arbitrary number of results • As issue and commit tokens counter rotate through a node, matched renames initiate transfer of values • The rings never stall. The functional units can only stall themselves • Best analogy is a “Lazy Susan” table (center of table rotates) MU0 IU2 IU1 IU3 FP1 IU0 • Superscalar and vector processors can be built by using multiple tokens per ring (depth is width) and decomposing vector instructions into scalar instructions - an unimplemented solution would be to nibble on tokens as they pass RS RS RS RS RS RS DC • Multiprocessor system can be realized by adding another processor to the ring, more functional units, and tagging instructions with a CPUID • An additional ring, which spins in the same direction as the issue ring, can be added to further minimize latencies. Any number of rings can be added (any order is valid) • Motivation • Global signals should be avoided (as they can easily become a critical paths and limit frequency) • Global knowledge should also be avoided (as it can result in significant hardware) • In some cases, additional latency can be tolerated or neutralized with higher frequency • Interface is key: RingInterface is extended into Station and Processors Classes, Stations are extend into Integer, Float, and Memory units EE249 Project
Future Work • Processing Networks • Rings can be extended to arbitrary network or hierarchy of heterogeneous network architectures • In general a set of dispatch units communicate with a set of functional units through an overlapping issue and commit network • Estimation • Currently time is based on iterations of processes • Extensive parameterization could allow for multivalued/multivariable functions derived from a few points of HDL coding or experience to estimate frequency, area or power • AST manipulation could back annotate all processes the design with the relative numbers • Common interface (FIFO’s/Objects) should allow co-simulation of both RTL and YAPI behavioral models • As each behavioral model is completed it could be handed off for high-level synthesis or RTL design. • Synthesis to gates would allow refinement of the design (add or delete pipeline stages or reservation station entries, or any other parameter • Compilers and Benchmarks • True performance analysis requires benchmarks and a compiler to map to the ISA used by the behavioral model • Abstraction from ISA to intermediate representation • Given the parameters passed to each process, the capabilities of the functional units, and the network to interconnect them, perhaps performance number could be estimated from an intermediate representation instead of having to map/compile it to a specific ISA EE249 Project
Conclusions • YAPI (KPN) is very useful for construction of complex architectures • Flops/Registers are mapped into YAPI channels • Arrays and combinational logic are mapped into metropolis code in a YAPI process • Feedback can stabilize the system, and thus ensure fixed length channels • Internal Pipelines can be implemented as either feedback channels or internal queues • For a given architecture, parameterization of processes can allow easy exploration of the design space • Encapsulation of an instruction in-flight can minimize the number of channels required, and improve flexibility • Consistent interfaces across all processes can allow module reuse and the potential for HDL/behavioral co-simulation • Complex architectures (OOO/Rings/etc..) can easily be constructed and simulated EE249 Project