1 / 35

A Synthesis Algorithm for Modular Design of Pipelined Circuits

A Synthesis Algorithm for Modular Design of Pipelined Circuits. Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology. Overall Goal. Efficient, Synchronous, Parallel Implementation in Synthesizable Verilog. Modular,

ringo
Télécharger la présentation

A Synthesis Algorithm for Modular Design of Pipelined Circuits

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Synthesis Algorithm for Modular Design of Pipelined Circuits Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

  2. Overall Goal Efficient, Synchronous, Parallel Implementation in Synthesizable Verilog Modular, Asynchronous, Sequential Specification

  3. ... ... Example: Specification PC inc inc RF wen IM br RESET RESET Register Operand Fetch Compute and Writeback Instruction Fetch

  4. Specification Properties • Basic Concepts • State • Registers and Memories • Conceptually Infinite Queues • Modules (state transformers) • Queues Provide Modularity • Decouple Modules • Enable Independent Development • Promote Reusable Modular Designs

  5. Example: Implementation PC inc inc RF wen IM br RESET RESET Register Operand Fetch Compute and Writeback Instruction Fetch

  6. Implementation Issues • Synthesizing Efficient Combinational Logic • Queue Finitization • Synchronous Global Scheduling

  7. Specification Language

  8. Type Declarations type reg = int(3), val = int(8), loc = int(8); type ins = <INC reg> | <JZ reg loc>; type irf = <INC reg val> | <JZ val loc>; State Declarations var PC: loc, IM: ins[N], RF: val[8]; var IQ: queue(ins), RQ: queue(irf);

  9. Modules • Each module is set of update rules • Each Update Rule Consists of • Precondition • Action (set of updates) • Rule is enabled (and can execute) if precondition is true in current state • When rule executes, atomically applies updates in action to produce new state

  10. Update Rules And Modules • Instruction Fetch Module TRUE iq = append(iq,im[pc]), pc = pc + 1; • Register Operand Fetch Module <INC r> = head(iq) and notin(rq, <INC r _>) iq = tail(iq), rq = append(rq, <INC r rf[r]>); <JZ r l> = head(iq) and notin(rq, <INC r _>) iq = tail(iq), rq = append(rq, <JZ rf[r] l>);

  11. Update Rules And Modules • Compute and Writeback Module <INC r v> = head(rq) rf = rf[r = v+1], rq = tail(rq); <JZ v l> = head(rq) and (v == 0) pc = l, iq = nil, rq = nil; <JZ v l> = head(rq) and (v !=0) rq = tail(rq);

  12. Abstract Model of Execution • Conceptually, system execution is a sequence of rule executions • while TRUE choose an enabled rule execute rule obtain new state • Concepts in Abstract Execution Model • Rules execute atomically • Rules execute asynchronously • Rules execute sequentially

  13. Synthesis Algorithm

  14. Synthesis Algorithm • Starting Point • Asynchronous, sequential abstract execution • Conceptually infinite queues • Goal: efficient synchronous global schedule • In each clock cycle, multiple rules execute • Synchronously and Concurrently • (pipeline stages move together) • Implement each queue with a finite hardware buffer • Can read and write buffer in same cycle

  15. Basic Idea • At Each Clock Cycle • Check each rule to see if enabled • If so, atomically update state to reflect execution • For each variable, generate expression that specifies new value at end of cycle • Challenge: sequential, atomic semantics for rules • Solution: symbolic rule execution

  16. Final Result for PC if (<JZ v location> = rq and v == 0) new pc = location else if ((iq == nil) or (<INC r> = iq and <INC r _> != rq) or (<JZ r l> = iq and <INC r _> != rq)) new pc = pc+1 else new pc = pc

  17. Algorithm Outline • Rule Numbering: for symbolic execution • Relaxation: shorten critical path by testing intial state • Queue Finitization: ensure rules execute only if will be room for the result in output queues • Symbolic Execution • Optimizations • Synthesizable Verilog Generation: from optimized expressions

  18. Rule Numbering • Goal: • Resolve Conflicts Between Parallel Rule Executions • Approach: • For each state variable, number versions according to order • Feed results of previous rule into next rule

  19. TRUE  iq1 = append(iq0, im[pc0]), pc1 = pc0+1; • <INC r> = head(iq1) and notin(rq1,<INC r _>)  iq2 = tail(iq1), rq2 = append(rq1, <INC r rf1[r]>); • <JZ r l> = head(iq2) and notin(rq2, <INC r _>)  iq3 = tail(iq2), rq3 = append(rq2, <JZ rf2[r] l>); • <INC r v> = head(rq3)  rf4 = rf3[r  v+1], rq4 = tail(rq3); • <JZ v l> = head(rq4) and v != 0  rq5 = tail(rq4); • <JZ v l> = head(rq5) and v == 0  pc6 = l, rq6 = nil, iq6 = nil;

  20. Relaxation • Issue: rule numbering may produce long clock cycle • Solution:for each rule Ri with precondition Pi for each variable instance vi in precondition Pi replace vi with its earliest safe version ... Rk-1: Pk-1 -> vk = ... ... Ri : Pi(vi,...) -> ... ... • vk safe for viif either • Pi[vk/vi] implies Pi • (Pi,Pk-1) mutually exclusive 0 1 2 => 0 1 2 3 3

  21. Relaxation Result • Queues separate pipeline stages • Items traverse one stage per clock cycle • Safety: If a rule executes in new system • Then it also executes in old system • And it generates same result • Liveness: After relaxation, all rules test initial state • If rule enabled in old system but not in new system, then • Some rule executes in new system

  22. TRUE  iq1 = append(iq0, im[pc0]), pc1 = pc0+1; • <INC r> = head(iq0) and notin(rq0,<INC r _>)  iq2 = tail(iq1), rq2 = append(rq1, <INC r rf1[r]>); • <JZ r l> = head(iq0) and notin(rq0, <INC r _>)  iq3 = tail(iq2), rq3 = append(rq2, <JZ rf2[r] l>); • <INC r v> = head(rq0)  rf4 = rf3[r  v+1], rq4 = tail(rq3); • <JZ v l> = head(rq0) and v != 0  rq5 = tail(rq4); • <JZ v l> = head(rq0) and v == 0  pc6 = l, rq6 = nil, iq6 = nil;

  23. Queue Finitization • Issue: • Conceptually unbounded queues • Finite hardware buffers • Assumption: queues start within length at beginning of cycle • Goal: generate circuit that makes queues remainwithin length at end of cycle • Basic Approach: • Before enabled rule executes • Be sure will be room for result in output queues at end of clock cycle

  24. 2 4 5 3 6 Queue Finitization Algorithm • Build Producer-Consumer Graph • Nodes are rules • Edge between rules if first inserts into queue and second removes from queue • In Example: 1

  25. Acyclic Graphs • Process Rules in Topological Sort Order • Augment execution precondition • If rule inserts into a queue, require that either • there is room in queue when rule executes or • future rules will execute and remove items to make room in queue • Each queue has counter of number of elements in queue at start of cycle • Combinational logic tracks queue insertions and deletions

  26. Example Instruction Fetch Rule After Queue Finitization Empty(iq0) or (<JZ v l> = head(rq0) and v == 0) or <INC r> = head(iq0) and notin(rq0,<INC r _>) or <JZ r l> = head(iq0) and notin(rq0, <INC r _>) iq1 = append(iq0, im[pc0]); pc1 = pc0+1;

  27. Pipeline Implications • Counter becomes presence bit for single element queues • Additional preconditions can be viewed as pipeline stall logic • Design can be written to generate pipeline forwarding/bypassing instead of stall

  28. Cyclic Graphs • Cyclic Graphs lead to Cyclic Dependences • Rule 1 depends on rule 2 to remove an item from a queue • But rule 2 depends on rule 1 to remove an item from another queue • Algorithm from acyclic case would generate recursive preconditions rule 1 rule 2

  29. Solution to Cyclic Dependence Problem • Groups of rules must execute together • Use depth-first search on producer-consumer graph to find cyclic groups • Augment preconditions to allow all rules in cycle to execute together • Extensions include paths into and out of cyclic group

  30. Symbolic Execution • Substitute out all intermediate versions of variables • Obtain expression for last version of each variable • Each expression defines new value of corresponding variable

  31. Optimizations • Optimize expressions from symbolic execution • CSE: avoid unnecessary replication of HW • Mutual Exclusion Testing: • Eliminate computation of values that never occur in practice as result of mutually exclusive preconditions

  32. Symbolic Execution with Optimization Final result for rq, assuming single item queues if <JZ v l> = head(rq) and v==0 new rq = nil else if <INC r>=head(iq) and notin(rq,<INC r _>) new rq = <INC r rf[r]> else if <JZ r l>=head(iq) and notin(rq,<INC r _>) new rq = <JZ rf[r] l> else new rq = nil

  33. Verilog Generation • Synthesize HW directly from expressions: • Each queue as one or more registers • Each memory variable as library block • Each state variable as one or more registers, depending on type • Each expression as combinational logic that feeds back into corresponding registers

  34. Experimental Results • We have implemented synthesis system • Used system to generate synthesizable Verilog for several specifications (map effort medium, area effort low, constraints 10ns) Benchmark Cycle Time Area (cells) Bubblesort 9.34ns ~370 Butterfly 9.57ns ~412 Processor 11.28ns ~387 Filter 9.51ns ~252

  35. Conclusion • Starting Point: (Good for Designer) Modular, Asynchronous, Sequential Specification with Conceptually Infinite Queues • Ending Point: (Good for Implementation) Efficient, Synchronous, Globally Scheduled, Parallel Implementation with Finite Queues in Synthesizable Verilog • Variety of Techniques: • Symbolic Execution • Queue Finitization

More Related