1 / 45

Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip. Mohamed ABDELFATTAH Vaughn BETZ. Outline. 1. Why NoCs on FPGAs?. 2. Hard/soft efficiency gap. 3. Integrating hard NoCs with FPGA. Outline. 1. Why NoCs on FPGAs?. Motivation. Previous Work. 2. Hard/soft efficiency gap.

Télécharger la présentation

Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip Mohamed ABDELFATTAH Vaughn BETZ

  2. Outline 1 Why NoCs on FPGAs? 2 Hard/soft efficiency gap 3 Integrating hard NoCs with FPGA

  3. Outline 1 Why NoCs on FPGAs? Motivation Previous Work 2 Hard/soft efficiency gap 3 Integrating hard NoCs with FPGA

  4. 1. Why NoCs on FPGAs? Motivation Logic Blocks Switch Blocks Wires Interconnect

  5. 1. Why NoCs on FPGAs? Motivation Logic Blocks Switch Blocks • Hard Blocks: • Memory • Multiplier • Processor Wires

  6. 1. Why NoCs on FPGAs? Motivation 1600 MHz Hard Interfaces DDR/PCIe .. Logic Blocks 800 MHz Switch Blocks Interconnect still the same • Hard Blocks: • Memory • Multiplier • Processor Wires 200 MHz

  7. 1. Why NoCs on FPGAs? Motivation 1600 MHz • Bandwidth requirements for hard logic/interfaces • Timing closure DDR3 PHY and Controller PCIe Controller 800 MHz 200 MHz Gigabit Ethernet

  8. 1. Why NoCs on FPGAs? Motivation • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization DDR3 PHY and Controller PCIe Controller Gigabit Ethernet

  9. 1. Why NoCs on FPGAs? Motivation • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated DDR3 PHY and Controller PCIe Controller Gigabit Ethernet

  10. 1. Why NoCs on FPGAs? Motivation • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated • Low-level interconnect hinders modularity: • Parallel compilation • Partial reconfiguration • Multi-chip interconnect DDR3 PHY and Controller PCIe Controller Gigabit Ethernet

  11. Source: Google Earth Los Angeles Barcelona Keep the “roads”, but add “freeways”. Logic Cluster Hard Blocks

  12. 1. Why NoCs on FPGAs? FPGA with NoC NoC • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated • Low-level interconnect hinders modularity: • Parallel compilation • Partial reconfiguration • Multi-chip interconnect DDR3 PHY and Controller Router forwards data packet PCIe Controller Links Router moves data to local interconnect Routers Gigabit Ethernet

  13. 1. Why NoCs on FPGAs? FPGA with NoC • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated • Low-level interconnect hinders modularity: • Parallel compilation • Partial reconfiguration • Multi-chip interconnect DDR3 PHY and Controller PCIe Controller • High bandwidth endpoints known • Pre-design NoC to requirements Gigabit Ethernet • NoC links are “re-usable” • Latency-tolerant communication • NoC abstraction favors modularity

  14. 1. Why NoCs on FPGAs? FPGA with NoC • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated • Low-level interconnect hinders modularity: • Parallel compilation • Partial reconfiguration • Multi-chip interconnect DDR3 PHY and Controller PCIe Controller Gigabit Ethernet • Latency-tolerant communication • NoC abstraction favors modularity

  15. 1. Why NoCs on FPGAs? Hard vs. Soft • Implementation options: • Soft Logic (LUTs, .. ) • Hard Logic (unchangeable) • Mixed Soft/Hard DDR3 PHY and Controller PCIe Controller Soft NoC Hard NoC Efficiency Configurability Gigabit Ethernet  Investigate the hard vs. soft tradeoff for NoCs (area/delay)

  16. 1. Why NoCs on FPGAs? Previous Work • FPGA-tuned Soft NoCs: • LiPar (2005), NoCeM (2008), Connect (2012) • Hard NoCs: • Francis and Moore (2008): Exploring Hard and Soft Networks-on-Chip for FPGAs • Applications that leverage NoCs: • Chung et al. (2011): CoRAM: An In-Fabric Memory Architecture for FPGA-based Computing • Our Contributions: • Quantify area/performance gap of hard and soft NoCs • Investigate how this impacts NoC design (hard/soft) • Integrate hard NoC with FPGA fabric

  17. Outline 1 Why NoCs on FPGAs? 2 Hard/soft efficiency gap Results NoC Architecture Methodology Soft NoC design Area/Speed Efficiency Gap 3 Integrating hard NoCs with FPGA

  18. 2. Hard/Soft Efficiency Router Microarchitecture • NoC =Routers+ Links • State-of-the-art router architecture from Stanford: • Acknowledge that the NoC community have excelled at building a router: We just use it • To meet FPGA bandwidth requirements: High-performance router • A complex router includes a superset of NoC components that may be used: More complete analysis • Split router into 5 Components 

  19. 2. Hard/Soft Efficiency Router – 5 Components

  20. 2. Hard/Soft Efficiency Router – 5 Components Multi-Queue Buffer = Memory +CIControl Logic • Port Width • Buffer depth • Number of VCs Input Modules

  21. 2. Hard/Soft Efficiency Router – 5 Components Multiplexers Logic + crowded interconnect • Port Width • Number of Ports Crossbar

  22. 2. Hard/Soft Efficiency Router – 5 Components Retiming Register Registers + little control logic • Port Width • Number of VCs Output Modules

  23. 2. Hard/Soft Efficiency Router – 5 Components Allocators Arbiters = Logic + Registers • Number of Ports • Number of VCs

  24. 2. Hard/Soft Efficiency Design Space 4 Parameters 5 Components Port Width Input Module Crossbar Number of Ports Number of VCs Output Module Buffer Depth VC Allocator SW Allocator

  25. 2. Hard/Soft Efficiency Methodology • Post-routing FPGA (soft) area and delay • Post-synthesis ASIC (hard) area and delay • Both TSMC 65 nm technology (Stratix III) • Verify results against previous FPGA:ASIC comparison by Kuon and Rose Per Router Component

  26. 2. Hard/Soft Efficiency 3 Options for Buffer on FPGA • Relatively small memories • Critical component in router design • 3 options for FPGA: One per LUT Registers 640 bits LUTRAM 9 Kbits Block RAM • Area of each implementation option 

  27. 2. Hard/Soft Efficiency 3 Options for Buffer on FPGA Width = 32 Bits Another logic cluster used

  28. 2. Hard/Soft Efficiency 3 Options for Buffer on FPGA • Relatively small memories • 3 options for implementation on FPGA One per LUT 0.77 Kbit/mm2 Registers 640 bits 23 Kbit/mm2 LUTRAM 9 Kbits 142 Kbit/mm2 Block RAM • 16% utilized BRAM more area efficient than fully used LUTRAM (Valid for Stratix III) • LUTRAM could win for some points in other FPGAs Use BRAM for FPGA (soft) implementation Soft

  29. 2. Hard/Soft Efficiency Results – High Port Count 60X – 170X 24X – 94X High port count inefficient in soft Soft

  30. 2. Hard/Soft Efficiency Results – Width 72X 26X – 17X High port count inefficient in soft  Width scales better Soft

  31. 2. Hard/Soft Efficiency Results – Deep Buffers Filling up the BRAM Buffer depth is free on FPGAs when using BRAM Soft

  32. 2. Hard/Soft Efficiency Soft Router Design • Design recommendations based on FPGA silicon area • Supported by delay measurements Buffer depth is free on FPGAs when using BRAM High port count inefficient in soft  Width scales better Use BRAM for FPGA (soft) implementation Soft Soft Soft

  33. 2. Hard/Soft Efficiency Results – Area Memory = Logic + Registers

  34. 2. Hard/Soft Efficiency Results – Delay

  35. Outline 1 Why NoCs on FPGAs? 2 Hard/soft efficiency gap 3 Integrating hard NoCs with FPGA Hard NoC + FPGA Wiring Conclusion Future Work

  36. 3. Hard NoC with FPGA What to harden? 40% 50% Total Area Critical Path 10% Results suggest hardening Crossbar and Allocators  Mixed hard/soft implementation

  37. 3. Hard NoC with FPGA Mixed Implementation • For a typical router .. • 5 ports • 32 bits wide • 2 VCs • 10 buffer words Mixed not worth hardening How to connect hard and soft? How efficient is mixed/hard after doing that? ? ? Hard Soft

  38. 3. Hard NoC with FPGA Integrating a Hard Router Logic clusters FPGA Router Router Logic • Same I/O mux structure as a logic block – 9X the area • Conventional FPGA interconnect between routers

  39. 3. Hard NoC with FPGA Integrating a Hard Router FPGA 730 MHz Router • Same I/O mux structure as a logic block – 9X the area • Conventional FPGA interconnect between routers

  40. 3. Hard NoC with FPGA Integrating a Hard Router FPGA Router Assumed a mesh  Can form any topology

  41. 3. Hard NoC with FPGA Integrating a Hard Router 64-node NoC on Stratix V Router Provides 47 GB/s peak bisection bandwidth Very Cheap! Less than cost of 3 soft nodes Hard NoC + Soft Interconnect is very compelling

  42. 1 Why NoCs on FPGAs? • Big city needs freeways to handle traffic • Solve communication problems for a large/heterogeneous FPGA: • Timing Closure – Interconnect Scaling – Modular Design 2 Hard/soft efficiency gap • A hard NoC is on average 30X smaller and 3.6X faster than soft • Crossbars and allocators worst – Input buffer best • An efficient soft NoC: • Uses BRAMs – Large width, low Port Count – Deep buffers 3 Integrating hard NoCs with FPGA • Mixed implementation does not make sense • Integrated fully hard NoC with FPGA fabric (for NoC Links) • 22X area improvement over soft • Reaches max. FPGA frequency (4.7X faster than soft) • 64-node NoC = 0.6% of total FPGA area (Stratix V)

  43. 3. Hard NoC with FPGA Future Work • Power analysis • More hardening: • Dedicated inter-router links (hard wires) • Clock domain crossing hardware • How do traffic hotspots (DDR/PCIe) influence NoC design? • Latency insensitive design methodology that uses NoC • CAD tool changes for a NoC-based FPGA

  44. Thank You! mohamed@eecg.utoronto.ca

More Related