1 / 24

Architecture-Level Synthesis for Automatic Interconnect Pipelining

Architecture-Level Synthesis for Automatic Interconnect Pipelining. Jason Cong, Yiping Fan , Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles {cong, fanyp, zhiruz}@cs.ucla.edu. Funded by GSRC, NSF, and Altera Corp. Outline. Motivation

janae
Télécharger la présentation

Architecture-Level Synthesis for Automatic Interconnect Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architecture-Level Synthesis for Automatic Interconnect Pipelining Jason Cong, Yiping Fan, Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles {cong, fanyp, zhiruz}@cs.ucla.edu Funded by GSRC, NSF, and Altera Corp.

  2. Outline • Motivation • Our contributions • RDR-Pipe micro-architecture • Regular Distributed Register micro-architecture with interconnect pipelining • Synthesis flow and algorithms • MCAS-Pipe: automatic interconnect pipelining and sharing • Experimental results • Conclusions

  3. Interconnect Bottleneck in Nanometer Designs • Challenge: single-cycle full chip communication will be no longer possible • Not supported by the current CAD toolset 5 cycles • ITRS’01 0.07um Tech • 5.63 GHz across-chip clock • 800 mm2 (28.3mm x 28.3mm) • IPEM BIWS estimations • Buffer size: 100x • Driver/receiver size: 100x • Semi-global layer (Tier 3) • Can travel up to 11.4mm in one cycle • Need 5 clock cycles From corner to corner 4 cycles 3 cycles 2 cycles 1 cycle 28.3 11.4 22.8 0

  4. Related Work • Retiming with placement or floorplanning • Retiming + multilevel partitioning [Cong et al, ICCAD’00] and coarse placement [Cong et al, DAC’03] • Retiming + floorplanning [Chong & Brayton, IWLS’01] • Retiming + placement for FPGAs [Singh & Brown, FPGA’02] • Global wire pipelining in ItaniumTM processor • [McInerney et al. ISPD’00] • Buffer and flip-flop insertion in RTL • [Lu et al. DATE’02] • [Cocchini, ICCAD’02]

  5. In a loop, 4 logic cells, 2 registers • Cell delay = 1ns • Interconnect delay = 1ns • DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2 = 4ns • Clock period  4ns Limitation during Logic/Physical Level to Explore Multicycle Communication • Minimum clock period achievable by logic optimization is bounded by max. delay-to-register (DR) ratio of the loops in the circuits [Papaefthymiou, MST’94] • Interconnect pipelining by flip-flop insertion ? • Requires considerable amount of manual rework on the original RTL descriptions

  6. Our Approach • Consideration of multicycle communication during architectural (or behavioral) synthesis • [Cong et al, ISPD’03] [Cong et al. ICCAD’03] • Regular Distributed Register (RDR) micro-architecture • Highly regular • Direct support of multicycle on-chip communication • MCAS: Architectural Synthesis for Multi-cycle Communication • Efficiently maps the behavioral descriptions to RDR uArch • Integrates architectural synthesis (e.g. resource binding, scheduling) with physical planning • This work • Extension of RDR and MCAS for interconnect pipelining

  7. Outline • Motivation • Our contributions • RDR-Pipe micro-architecture • Regular Distributed Register micro-architecture with interconnect pipelining • Synthesis flow and algorithms • MCAS-Pipe: automatic interconnect pipelining and sharing • Experimental results • Conclusions

  8. … Reg. file Reg. file … Reg. file Island FSM FSM FSM LCC LCC LCC 2 cycles 1 cycle K cycle …. 2 cycle FSM Local Computational Cluster (LCC) Hi K cycles Global Interconnect MUL MUX 1 cycle … … … Reg. file Reg. file Reg. file ALU FSM Wi FSM FSM LCC LCC LCC Regular Distributed Register Micro-Architecture • Distribute registers to each “island” • Choose the island size such that local computation and communication in each island can be done in a single cycle • Use register banks: registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island

  9. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Wiring Overhead in RDR Designs • Data transfers r1r3 and r2r4 are overlapped • Two dedicated global wires are needed + ALU1 r1 + r1 r2 r2 r3 r3 r4 MUL1 Interconnects with delay of 2 cycles r4 * + * ALU1 MUL1 Sender register Receiver register

  10. Pipeline Register Station (PRS) 3 1 2 4 PRS PRS FSM FSM FSM Reg. File LCC LCC LCC 3 2 1 V channel H channel PRS PRS FSM FSM FSM LCC LCC LCC 6 4 5 Architectural Solution: RDR-Pipe • Keep the intra-island structures • Inter-island pipeline register station (PRS) for global communications • PRS performs autonomous store-and-forward • Synchronous design • No global control signal needed for PRS

  11. + ALU1 r1 + r1 r1 r3 r2 r3 r4 MUL1 2 cycle communication r4 * Sender register Receiver register + * ALU1 MUL1 Pipeline register Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Reducing Wiring Overhead in RDR-Pipe • Data transfers are pipelined • One wire with a pipeline register is enough

  12. Synthesis Flow: MCAS-Pipe System • Global interconnect sharing • After scheduling and functional unit binding • Before register and port binding • Enable multiple data communications to shar a physical link (a wire with pipeline registers) • Advantages over MCAS • Expect to reduce global wiring demand • No multicycle path constraint needed C / VHDL MCAS-Pipe CDFG generation CDFG Resource allocation & Functional unit binding ICG Scheduling-driven placement Locations Placement-driven rescheduling & rebinding Global interconnect sharing Register and port binding Datapath & FSM generation RTL VHDL & Floorplan constraints

  13. Pipeline register Sender register Receiver register pg Cycle 1 Cycle 2 pe Cycle 3 Cycle 4 Cycle 5 Cycle 6 ce cg Cycle 7 Conflicted data transfers A B D = 2 pg Cycle 1 A B ce D = 2 Cycle 2 pe,pg cg ce pe pe Cycle 3 cg pg Cycle 4 • Now, two producer registers can be merged, since their life-times become compatible Cycle 5 • Only one physical link is required to support the scheduled data transfers Cycle 6 ce cg Cycle 7 Compatible data transfers Global Interconnect Sharing • Two physical links are needed to support the concurrent data transfers A B D = 2 pe ce pg cg

  14. Global Pipelined Interconnect Minimization • Definitions • Data links: pipelined global interconnects • Channel: set of data links between two islands • Width of a channel: number of its data links • Data transfer: movement of data from a producer to a consumer • Architectural assumption • Channels cannot share interconnects • Theorem • Global pipelined interconnects are minimized if and only if the width of every channel is minimized

  15. Transfer Scheduling for a Single Channel • A decision problem formulation • Given: • A channel (A, B)containing m data links • A data transfer set {e | pe A and ce B}, where each transfer e is associated with an arrival time T(pe)+1, a deadline T(ce)-D(A, B), and unit effective occupancy time • Fact: for every time slot, at most one transfer can be issued on a data link • Objective: to find a feasible transfer schedule on these data links • Transfer scheduling is polynomial solvable • A special real-time scheduling problem [J. Blazewicz, 1979] • Binary search for minimum feasible channel width m • For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nlogn) • Overall time complexity: O(nlog2n)

  16. Data Link 1 Data Link 1 Data Link 2 1 3 4 5 2 6 • Ordered by left edge EDF-Based Transfer Scheduling Example Data Link 2 Time slot Time slot • Successfully scheduling onto 2 data links 1 1 2 5 2 3 4 6 3 4 5 6 • Ordered by Earliest-Deadline-First 4 1 3 5 2 ? • Failed for 2 data links!

  17. Outline • Motivation • Our contributions • RDR-Pipe micro-architecture • Regular Distributed Register micro-architecture with interconnect pipelining • Synthesis flow and algorithms • MCAS-Pipe: automatic interconnect pipelining and sharing • Experimental results • Conclusions

  18. Experiment Settings C / VHDL CDFG generation Functional unit allocation & binding uArch. spec. Target clock period Conventional flow Scheduling-driven placement Placement-driven rebinding & rescheduling Conventional Scheduling MCAS flow Global interconnect sharing MCAS-Pipe flow Register and port binding Datapath & Control generation RTL VHDL files (for all flows) Floorplan constraints (for MCAS and MCAS-Pipe); Multicycle path constraints (for MCAS only) Altera QuartusII + Stratix

  19. Experimental Results: Register and LE Usage • Design environment: Altera QuartusII, Stratix EP1S40 • MCAS vs. Conventional flow: • Uses more registers and logic elements (LE) • MCAS-Pipe vs. MCAS: • Slightly more registers, and comparable logic element cost

  20. Experimental Results: Performance • Design environment: Altera QuartusII, Stratix EP1S40 • MCAS vs. Conventional flow: • 36% reduction in clock period and 30% in total latency • MCAS-Pipe vs. MCAS: • Comparable design performance (4% better) Total latency Clock period

  21. Interconnect Structure of Altera’s Stratix Global: H24 H8 H4 Local: LL, LO Global:V16 V4 V8

  22. Experimental Results: Wirelength • Wire types • LL, LO: local wires; H4, V4, H8, V8: short global wires • V16, H24: long global wires • MCAS-Pipe vs. MCAS: • 28.8% long global wires reduction, 19.3% total wirelength reduction

  23. Conclusions • High-level automatic on-chip interconnect pipelining • RDR-Pipe: extension of RDR micro-architecture • Micro-architecture supporting interconnect pipelining • MCAS-Pipe: enhancement of MCAS synthesis system • Add in a novel global interconnect sharing algorithm to effectively reduce the global wiring • Experimental results • Matches or exceeds the RDR-based approach in performance • Greatly reduces wiring demand

  24. Thank you

More Related