High-level Specification and Efficient Implementation of Pipelined Circuits

High-level Specification and Efficient Implementation of Pipelined Circuits Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Overall Goal Efficient, Synchronous, Parallel Implementation in Synthesizable Verilog Modular, Asynchronous, Sequential Specification

Specification Language Concepts • State (Registers, Memory) • Queues (Conceptually Unbounded Length) • Modules • Read inputs from queues and state • Write outputs to queues and state

Module Example r0 0 Register File r1 43 r2 100 r3 84 <jz r0>,<inc r1> <inc r2 100>, <inc r3 84> Input Queue Output Queue Register Operand Fetch Module

Module Example r0 0 Register File r1 43 r2 100 r3 84 r1 <inc r1> <jz r0> <inc r2 100>, <inc r3 84> Input Queue Output Queue Register Operand Fetch Module

Module Example r0 0 Register File r1 43 r2 100 r3 84 r1 43 <inc r1 43> <jz r0> <inc r2 100>, <inc r3 84> Input Queue Output Queue Register Operand Fetch Module

Module Example r0 0 Register File r1 43 r2 100 r3 84 <jz r0> <inc r1 43>, <inc r2 100>, <inc r3 84> Output Queue Input Queue Register Operand Fetch Module

Module Behavior • Each module has a set of update rules • Each Update Rule Consists of • Precondition • Action (set of updates) • Rule is enabled (and can execute) if precondition is true in current state • When rule executes, atomically applies updates in action to produce new state

Update Rules in Example “If an increment instruction is at the head of the input queue and there is no RAW hazard, then atomically remove the instruction from the queue, fetch the value from the register file, and append the instruction with the register value into the output queue” <INC r> = head(iq) and notin(oq, <INC r _>) iq = tail(iq), oq = append(oq, <INC r rf[r]>); “If a jump on zero instruction is at the head of the input queue and there is no RAW hazard, then atomically remove the instruction from the queue, fetch the value from the register file, and append the instruction with the register value into the output queue” <JZ r l> = head(iq) and notin(oq, <INC r _>)  iq = tail(iq), oq = append(oq, <JZ rf[r] l>);

From Modules to Systems • System is a set of Modules • Access same Registers and Memories • Also communicate via Queues • Behavior of System • Update rules from all Modules • Queues Provide Modularity • Decouple Modules • Enable Independent Development • Promote Reusable Modular Designs

Example System Specification • Instruction Fetch Module TRUEiq = append(iq,im[pc]), pc = pc + 1; • Register Operand Fetch Module <INC r> = head(iq) and notin(rq, <INC r _>) iq = tail(iq), rq = append(rq, <INC r rf[r]>); <JZ r l> = head(iq) and notin(rq, <INC r _>)  iq = tail(iq), rq = append(rq, <JZ rf[r] l>); • Compute and Writeback Module <INC r v> = head(rq)  rf = rf[r = v+1], rq = tail(rq); <JZ v l> = head(rq) and (v == 0)  pc = l, iq = nil, rq = nil; <JZ v l> = head(rq) and (v !=0) rq = tail(rq);

Abstract Model of Execution • Conceptually, system execution is a sequence of rule executions • while TRUE choose an enabled rule execute rule obtain new state • Concepts in Abstract Execution Model • Rules execute atomically • Rules execute asynchronously • Rules execute sequentially • Unbounded Queues

Synthesis Algorithm

Key Challenge • Specification Language • Sequential, atomic, asynchronous semantics • Conceptually unbounded queues • Implemented Circuit • Coordinated parallel execution • Finite length queues

Initial Synthesis Algorithm • Symbolically Execute Rules in Order • Each rule starts with result from previous rule • Obtain Expressions for New Values of Registers, Memories, and Queues • Generate Combinational Circuit that Produces New Values • Each clock cycle circuit computes new values, writes new values back • Every rule gets a chance to execute, every clock cycle! SE0 SE1 SE2 SE3 Rule 1 Rule 2 Rule 3

Properties of Initial Algorithm • Preserves Semantics of Specification • Independent Rules Execute Concurrently • But May Have Long Clock Cycle • Output of each preceding rule fed in as input to next rule • Data traverses ALL rules (and pipeline stages) in a single cycle! • Solution: Relaxation

Relaxation • for each rule Ri with precondition Pi for each variable instance vi in precondition Pi replace vi with its earliest safe version ... Rk-1: Pk-1 -> vk = ... ... Ri : Pi(vi,...) -> ... ... • vk safe for viif either • Pi[vk/vi] implies Pi • (Pi,Pk-1) mutually exclusive 0 1 2 => 0 1 2 3 3

Relaxation Result • Relaxation exposes additional parallelism • Queues separate pipeline stages • Items traverse one stage per clock cycle • Safety: If a rule executes in new system • Then it also executes in old system • And it generates same result • Liveness: After relaxation, all rules test initial state • If rule enabled in old system but not in new system, then • Some rule executes in new system

Global Scheduling • Issue: • Conceptually unbounded queues • Finite hardware buffers • Solution: Modify append rules s.t. no queue exceeds its specified length • Challenge: • Schedule maximum number of rules • Rules can insert into full queues if within length at the end of clock cycle

Global Scheduling • Assumption: queues start within length at beginning of cycle • Goal: generate circuit that makes queues remainwithin length at end of cycle • Basic Approach: • Before enabled rule executes • Be sure will be room for result in output queues at end of clock cycle • Key Idea: a rule can insert into a queue as long as enough following rules remove from it

2 4 5 3 6 GS: Basic Concepts • Rule-Queue Graph • Nodes of 2 types: rules and queues • Edge from rule node to queue node if rule inserts into queue • Edge from queue node to rule node if rule removes from queue • In Example: 1 iq rq

Acyclic Rule-Queue Graphs • Process Rules in Topological Sort Order • Augment execution precondition • If rule inserts into a queue, require that either • there is room in queue when rule executes or • future rules will execute and remove items to make room in queue • Each queue has counter of number of elements in queue at start of cycle • Combinational logic tracks queue insertions and deletions • GS algorithm generates the control signals for the combinational logic

Pipeline Implications • Counter becomes presence bit for single element queues • Additional preconditions can be viewed as pipeline stall logic • Design can be written to generate pipeline forwarding/bypassing instead of stall

Global Scheduling: Example • For length(iq) = 1, length(rq) = 1 • R0 executes and appends to iq if: • P1’ || P2’ || P4’ OR • iq0 = nil • R4 doesn’t insert into queues • => P4’ = P4 • Apply same rationale for R1 & R2: • R1 executes and appends to rq if: • P4 || P3’ || P5’ • rq0 = nil • R3 and R5 don’t insert into queues • => P3’ = P3, P5’ = P5 IQ0 P0 IQ1 IQ0 ~ P1[IQ0/IQ1], ~ P2[IQ0/IQ2] P1[IQ0/IQ1] IQ2 tail(IQ1) IQ0 IQ2 P4 P4 P2[IQ0/IQ2] nil IQ5 tail(IQ1) IQ3 nil IQ5 P4 nil IQ5 • GS1(rq) = GS2(rq) = (rq0 = nil) || P4 || P3 || P5 • GS0(iq) = (iq0 = nil) || P4 || (P1 || P2)  [(rq0 = nil) || P3 || P5] = • = (iq0 = nil) || P4 || P1 || P2

Cyclic Rule-Queue Graphs • Cyclic Graphs lead to Cyclic Dependences • Rule 1 depends on rule 2 to remove an item from a queue • But rule 2 depends on rule 1 to remove an item from another queue • Algorithm from acyclic case would generate recursive preconditions Queue x rule 1 rule 2 Queue y

Cyclic R-Q Graphs: Example • Let P1’ = P1  GS1 • Assumption: R1 executes (P1’ = TRUE) • Find group of rules that must fire together • P1’ = P1  [(x=nil) || P2’] = = P1  [(x=nil) || P2  [(y=nil) || P1’]] • No need to explore P1’ further (P1’ = TRUE) => P1’ = P1  [(x=nil) || P2] Queue x rule 1 rule 2 Queue y

Solution to Cyclic Dependence Problem • Key Idea: no deadlock if we can coordinate removals and insertions from/to all queues in cycle s.t. removals make room for insertions • Groups of rules must execute together • Use depth-first search on rule-queue graph to find cyclic groups • Augment preconditions to allow all rules in cycle to execute together • Extensions include paths into and out of cyclic group

Cyclic R-Q Graphs: Algorithm SymbolicExecution(Ri, CrtPath)for each queue q that Ri inserts into for each rule Rj that inserts/removes in/from qnewRj = if Rj  CrtPath then TRUE rule already examined else SymbolicExecution(Rj)newCrtPath = if Rj  CrtPath then CrtPath else CrtPath  Rjreplace Rj’ with newRj in GSi(q)GSi = GSi(q) Ri’ = Ri  GSi q

Symbolic Execution • Substitute out all intermediate versions of variables • Obtain expression for last version of each variable • Each expression defines new value of corresponding variable

Optimizations • Optimize expressions from symbolic execution • CSE: avoid unnecessary replication of HW • Mutual Exclusion Testing: • Eliminate computation of values that never occur in practice as result of mutually exclusive preconditions

Verilog Generation • Synthesize HW directly from expressions: • Each queue as one or more registers • Each memory variable as library block • Each state variable as one or more registers, depending on type • Each expression as combinational logic that feeds back into corresponding registers

Experimental Results • We have implemented synthesis system • Used system to generate synthesizable Verilog for several specifications (map effort medium, area effort low) Architecture Cycle (MHz) Area RISC Pipelined Processor 88.89 23195.25 SCU RTL 98 DSP 90.91 22999.50 Benchmark Cycle (MHz) Area Bubblesort 107.06 5434 Butterfly 104.42 5411 Filter 105.01 3757

Conclusion • Starting Point: (Good for Designer) Modular, Asynchronous, Sequential Specification with Conceptually Infinite Queues • Ending Point: (Good for Implementation) Efficient, Synchronous, Globally Scheduled, Parallel Implementation with Finite Queues in Synthesizable Verilog • Variety of Techniques: • Symbolic Execution • Global Scheduling

High-level Specification and Efficient Implementation of Pipelined Circuits

High-level Specification and Efficient Implementation of Pipelined Circuits

Presentation Transcript

Efficient Implementation

2013 Specification Implementation

Liberty National High level implementation plan

High Rep Rate Circuits 2 and 3 specification/Wish List

PZ4: High-Level Implementation Flow

Project CUMULUS HIGH LEVEL FUNCTIONAL SPECIFICATION

Pipelined Implementation Part I

High-Level Spectral ATPG for Gate-level Circuits

High-Level Implementation of Consistency Techniques

Pipelined Implementation

Specification and Implementation of Abstract Data Types

Domain Names Implementation and specification

Pipelined Implementation Part II

Data Structures Specification and Implementation

Pipelined Implementation Part II

Project CUMULUS HIGH LEVEL FUNCTIONAL SPECIFICATION

High Level OpenCL Implementation

High-Level Synthesis: Creating Custom Circuits from High-Level Code

From High-level Haskell to Efficient Low-level Code

Compilers as Collaborators and Competitors of High-Level Specification Systems

Specification to Implementation

High-level Specification and Efficient Implementation of Pipelined Circuits