1 / 27

Wei Zhang † , Li Shang ‡ and Niraj K. Jha †

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture. Wei Zhang † , Li Shang ‡ and Niraj K. Jha † Dept. of Electrical Engineering Princeton University †

tierra
Télécharger la présentation

Wei Zhang † , Li Shang ‡ and Niraj K. Jha †

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha† Dept. of Electrical EngineeringPrinceton University† Dept. of Electrical and Computer Engineering Queen’s University ‡

  2. Outline • Temporal Logic Folding • Background on NRAMs • Overview for hybrid NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006) • NanoMap: Design Optimization Flow • Experimental Results • Conclusions

  3. Temporal Logic Folding • Basic idea: Use run-time reconfiguration to realize different functions in the same resource every few cycles LUT 1 LUT 1 LUT 2 LUT 2 LUT 3 LUT 3 LUT 1 LUT 2 LUT 3 MEM i =abc’ l =(I’+e’+f’)h’ OUT =d’g’+l

  4. Overview of NATURE • Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits • Fine-grain reconfiguration (even cycle-by-cycle) and logic folding • Area-delay trade-off flexibility • More than an order of magnitude increase in logic density • More than an order of magnitude reduction in area-time product • Comparisons assume NRAMs/ CMOS logic implemented in the same technology • Non-volatility: useful in low power & secure processing CMOS fabrication compatible NRAM-based Run-time reconfiguration NATURE Temporal logic folding Logic density Design flexibility

  5. Overview of NATURE (Contd.) • Challenges in nano-circuits/architectures • Many programmable nanofabrics proposed: Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc. • Lack of a mature fabrication process • Fabrication defects and run-time failures (between 1% and 10%) • Regular, reconfigurable architectures, such as an FPGA, favored • Facilitates fabrication • Fault tolerance through reconfiguration • NATURE: fabricatable using CMOS-compatible fabrication process

  6. NRAMTM by Nantero • Non-volatile nanotube random-access memory (NRAM) • Mechanically bent or not: determines bistable on/off states • Same/opposite voltage added to change the state • CMOS-compatible fabrication process • 10 Gbit NRAMs already fabricated: ready to be commercialized in the near future Source: http://www.nantero.com/nram.html

  7. NRAMs • Properties of NRAMs • Non-volatile • Similar speed to SRAM • Similar density to DRAM • Chemically and mechanically stable • NATURE not tied to NRAMs • Phase change RAM • Magnetoresistive RAM • Ferroelectric RAM

  8. Architecture of NATURE • Island-style logic blocks (LBs) connected by various levels of interconnects • An LB contains a super macroblock (SMB) and a local switch matrix

  9. Architecture of a Super Macroblock (SMB) • n1macroblocks (MBs) comprise an SMB:here n1 = 4

  10. Architecture of a Macroblock (MB) • n2 logic elements (LEs) comprise an MB:here n2 = 4

  11. Logic Element (Basic Configuration) • An LE implements a computation and contains: • An m-input look-up table (LUT) • l flip-flops • Input to flip-flop selected between LUT output and a primary input

  12. Folding Levels • Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs • Level-p folding: LE reconfiguration after the execution of p LUT computations • Reconfiguration time: 160ps • Larger folding level, typically delay decrease, area increase (a) level-1 folding (b) level-2 folding

  13. Design Optimization Flow: NanoMap • Optimize and implement design on NATURE • Integrate temporal logic folding • Choose a proper folding level • Use force-directed scheduling (FDS) technique to balance resource usage across folding cycles • Input design specified in register-transfer level (RTL) and/or gate-level VHDL

  14. Motivational Example • Different planes should have same number of folding stages to guarantee global synchronization • Key issue: how to achieve the optimization objective • Appropriate folding level • Assign the logic to folding stages Level 1 register Logic in Plane Folding stage Plane cycle Folding cycle Plane Level 2 register

  15. Motivational Example (Contd.) • Example optimization objective • Minimize circuit delay under an area constraint of 32 LEs • Assume each LE contains one LUT and two flip-flops: 32 LEs provide 32 LUTs and 64 flip-flops 8 LUTs Logic depth: 4 50 LUTs 14 flip-flops Plane depth: 9 38 LUTs Logic depth: 7

  16. Iterative Design Flow • Start with initial guess for folding level and iteratively refine it • Large folding level -> better circuit delay, but large area cost • Initial #folding stages: • Initial folding levels: • Partition RTL modules into a series of connected LUT clusters • logic depth at most equal to the folding level • Significantly speeds up the mapping procedure

  17. Iterative Design Flow (Contd.) • Cluster size should be smaller than the area constraint 34 LUTs > 32 LUTs Level-5 folding Level-4 folding

  18. Solution for the Example • Three folding stages using level-4 folding • 32 LEs required for mapping the RTL circuit; area constraint satisfied • Circuit delay = 3 * folding cycle delay

  19. NanoMap: Flow Diagram Input network Output 1 reconfiguration bits Optimization Module Routing 16 objective Circuit parameter library search Final routing 2 using VPR router Folding level 15 computation User 3 constraint Final placement using modified VPR RTL module partition placer Logic Mapping 4 14 Yes No Perform logic folding ? No Satisfy delay 5 constraints ? Yes 12 Schedule each LUT / Temporal placement LUT cluster Delay estimation using FDS 6 11 Yes Map each 7 No Placement LUT / LUT cluster to routable ? SMBs Temporal clustering 10 7 Fast placement Satisfy area No Refine No using modified VPR constraints ? placement ? placer 8 13 Yes Yes 9

  20. Force-Directed Scheduling • Perform FDS on RTL modules partitioned into LUTs/LUT clusters • Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage • Model resource usage as a force: F = Kx • K: distribution graphs (DGs) that describe the probability of resource usage • Aim of FDS: minimize force, indicating minimum increase in resource usage • LE usage depends on LUT computations and register storage operations:two DGs needed

  21. Temporal Clustering • For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs • Unpacked LUT with a maximal number of inputs selected as initial seed • New LUTs with high attractions to the seed selected and assigned to the SMB • Attractions depend on timing criticality and input pin sharing • Considers attractions across all the folding cycles

  22. Placement and Routing • VPR (U. Toronto) modified to perform placement and support temporal logic folding • Simulated annealing approach • Cost function computed across the folding stages • Routing using VPR router performed hierarchically, considering direct link, length-1, length-4 and global interconnects

  23. Experimental Setup • Instance of architecture: • 4 MBs in an SMB • 4 LEs in an MB • LEs contain a 4-input LUT and 2 flip-flops • Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs • Results based on 100nm technology parameters to implement CMOS logicand NRAMs

  24. #LE * Delay adv. for AT opt. No folding k enough k = 16 18 16 14 12 10 8 6 4 2 0 ex1 ex2 FIR c5315 Paulin ASPP4 Biquad (normalized to no-folding) Experimental Results (Contd.) 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 2 1 1 2 2 1 2 2 2 2 1 1

  25. LE utilization around 100% 50% reduced need for a deep interconnect hierarchy for level-1 vs. no-folding – indicates trading interconnect area for NRAM area advantageous Experimental Results (Contd.) Improvement under AT optimization for RTL Benchmarks

  26. Experimental Results (Contd.) • Flexibility in choosing the best folding level and performing area-delay trade-offs • Mapping results for typical optimizations using Paulin benchmark as an example Typical optimizations

  27. Conclusions • NATURE: A new high-performance run-time reconfigurable architecture • NanoMap: an integrated optimization design flow for NATURE • Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages • Can be very useful for cost-conscious embedded systems and improvement of future FPGAs • Non-volatility: helpful in secure and low power processing

More Related