1 / 16

Design for Performance Optimization of Virtex Devices

Mayo Clinic's SPPDG specializes in risk reduction, proof-of-concept, prototypes, and pushing technology limits. Their experience with Virtex FPGAs includes high-speed designs up to 80 Gbps. Challenges with V2 and V4 devices are detailed, emphasizing the need for expert consultation during design phases. Power integrity, signal integrity analysis, and clocking are crucial aspects dealt with. Detailed analysis of power delivery, clock recovery, and core processing are key considerations. RocketIO operation pitfalls, power consumption issues, and power delivery challenges are addressed.

hoodd
Télécharger la présentation

Design for Performance Optimization of Virtex Devices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design for PerformanceOptimization of Virtex Devices Steve Currie – 6/26/2006 Mayo Clinic SPPDG 507-538-5460 currie.steven@mayo.edu

  2. About the Mayo SPPDG(Special Purpose Processor Development Group) • Not generally a “product delivering” organization • Risk reduction efforts, proof-of-concept, prototypes, test vehicles • Evaluate emerging technologies • Push existing technology to the limits • Commonly-known strengths • Power integrity analysis (power delivery design/analysis) • Signal integrity analysis • High-speed design/test • e.g., DC to 80 Gbps logic

  3. Experience with Virtex FPGAs • Done: 10Gbps in V2 (XC2V6000 -6, BF957) • 16-bit LVDS busses @ 690 Mbps • Soft-SERDES implementation • Multiple clock domains • Doing: 50Gbps in V4 (FX100 & FX140 in FF1517 pacakge) • SPI-5 (16+1 RocketIO @ 3.125 Gbps) • Dual 50 Gbps “lite” interfaces: 8+8 RocketIO @ 6.25 Gbps • 400+ bit, ~200 MHz SRAM interface • DDR2, QDR2, 84+ Gbps • Using nearly all IO in all banks • Phase-shifted IO reaching the different memory modules • Heavy internal resource utilization (% TBD)

  4. 19837 POE Prototype Two front panel and top down view

  5. General Concerns – 1 • % utilization of IO (SSO), core (timing/placement) • Higher speed processing and throughput could require intense IO operation (either “more” or “each-faster”) • Complex core processing at high speed requires extensive pipelining, or perhaps duplicate processing functions – internal timing becomes challenging! • Jitter/clocking • High speed clock sources, fanout (routing, buffering), multiplication, clean-up • Clock recovery circuits • SSO, power delivery • Aggressive decoupling competes with power supply stability • Rules of thumb break down as utilization % increases

  6. General Concerns – 2 • Power-on reset, initial conditions • Large current spikes coming out of configuration/”reset” • Defining initial conditions of “analog” elements • Large/wide bus termination • Discrete termination wastes less power than active termination, but at the cost of large footprint • Competes with power delivery system components • Could move to buried resistors, but there lies another set or problems • “Secret” how-tos and inconsistent documentation • Many details of RocketIO operation were [mis]documented in the various documents available • We utilized an existing Titanium Support contract to get the “truth” • 3rd-party IP often needed to push basic capability to acceptable performance • Attempting to saturate gigabit-ethernet with Xilinx “included” TCP/IP stack vs pay-for option • Appnotes and their boundaries (assumptions/limitations) should be thoroughly understood before being used – don’t expect “cut and paste” simplicity

  7. Our V2-Specific Challenges • Multiple clock domains inside the part • Location of global clock pins vs DCMs, etc. • Unusable jitter with clock multipliers • Clean-up PLLs off-chip • LVDS busses near their speed limits • Needed soft-SERDES macro and precise clock-to-data alignment • Using a large % of the chip resources complicates timing • Hand-placement often required to make timing • Xilinx Titanium Support provided very valuable in-depth knowledge and, hence, solutions to some problems • Having consultation during the design phase is better than having them debug/patch after the problems exist

  8. Our V4-Specific Challenges – 1 • Core speed didn’t scale up from V2 as other capabilities did • We were hoping for 400 MHz, which appears unlikely • Requires “dual-parallel” data path at ½-rate inside, which increases the core-usage • Package design is good, but power delivery recommendations don’t suit complex designs • Evaluation boards don’t follow these recommendations • SSO is still a problem, and the somewhat cumbersome SSO calculator is critical to make this work • Thorough power-delivery system analysis (HFSS, SiWave) requires knowledge of the package construction (and on-chip/package decoupling) which is difficult to acquire (NDA, etc.) • Crosstalk analysis shows need for painful routing of memory IO • RocketIO require a significant power filtering network for each transceiver (whether each transceiver is used, or not), further complicating an already dense layout

  9. Our V4-Specific Challenges – 2 • Power consumption • RocketIO were planned to be 10+ Gbps, hence they consume more power than if they had been designed for the current-errata maximum: 3.125 Gbps (and “Step 1” maximum: 6.25 Gbps) • Initial estimates showed 35 Watts per-FPGA for our desired capability – now a cooling challenge as well • Power delivery system • No room for discrete termination AND decoupling, thus active termination (even with the power cost) is preferred over the problems with buried resistors (cost/debug) • RocketIO usage requires 8b/10b per latest errata • Effectively reduces throughput capacity by 20% • Eliminates SPI-5 and 8-bit, 50Gbps interfaces • Run-length problem, but 8b/10b also is DC-free: overkill • Could consider custom encoding scheme, but the 8b/10b is a “free” hard-macro in the RocketIO (fast, no extra resources used) • Limited channel-bonding capabilities • Must do channel bonding in the core for unsupported interface protocols (increased power, core-usage)

  10. Our V5–Specific Concerns • NDA-protected conversations have made us fond of the V5 roadmap, but there are concerns • Schedule and feature-set reliability • V4 slipped/changed repeatedly… what to expect from V5? • Implied SEE sensitivity with the addition of configuration frame ECC (post-configuration checking) – a 65nm problem?

  11. Problem Summary • Signal Integrity Analysis • Lots of SSO, dense routing, crosstalk (non LVDS data paths) • RocketIO link analysis • All require I/O spice models from Xilinx which must first be validated against hardware • Also require interconnect models (transmission lines) • Power analysis/integrity • Power supply selection must tie in with decoupling design • Very low impedance power delivery helps with SSO, but is problematic for power supplies (extensive analysis of package, board, decoupling, supply required) • Internal timing constraints and problems • Need for “hands on” place/route inside FPGAs to get peak performance • Design consultation might be appropriate (we used Xilinx Titanium Service) • Architecture design for lowest clock jitter • Clock circuitry is different from V2-V2P-V4-V5 • Need “inside” knowledge: design consultation, again • ChipScope is a good internal debugging tool

  12. One Specific Problem:High Speed Bus Clock/Data Alignment • Problem: Multi-bit data bus and clock are captured at target FPGA with imperfect alignment • A V2 solution: xapp268 • Assumes all clock and data signals that make up a bus arrive “close” in phase, and uses DCM-delay to sample the clock with itself to find the “middle” of the clock for capture alignment • Clever, but isn’t finding the center of the data window • Requires global clock input and DCM for xapp to work as intended • Global clock input needed per bus – not so easy • A more data-centric solution • Measure goodness of DLL setting by checking bit error rates on the data bits • Identifies the best clock to data alignment based off an “averaged” data window • Uses upstream data generation, local data-compare • More core resources used, but supports very high speeds, large/small busses, worse-matched routing, etc.

  13. High Speed BusClock to Data Alignment • Problem: Multibit data bus and clock are captured at target FPGA with imperfect alignment • V4 solutions • IDELAY and ISERDES • Per-bit clock to data alignment capability and hard SERDES macro • New clock resources compared to V2 (PMCD) • V5 solutions • IDELAY + ODELAY • New/changed clock resources compared to V4

  14. Summary • Rules of thumb don’t cut it • Analysis and design are required to provide the kind/quantity of clean power needed for large, heavily utilized devices at high speed • Signal integrity analysis is required for dense routing and fast signals • Devices change significantly from family to family • Unless you want to be an expert with each, hire design consultation • What was once hard may become easy, but it also means that what once worked might not any longer (design reuse) • Data paths get more complicated with speed • Must manage clock/data alignment • Framing is required to properly align busses • Proper signal-integrity methodology becomes mandatory • Power consumption is significant, but must be clean as well • Requires simultaneous analysis of package, board, and other power-delivery system components • RocketIO require an extensive power filter network • Clock architecture • RocketIO: Recovered clocks, dedicated MGTCLK inputs (what frequency is best for the PLLs?) and the problems (e.g., jitter) with each requires intimate knowledge of the FPGA architecture • General: must understand on-chip clock resources and their limitations (“geographic” restrictions, jitter requirements OR jitter generated) • Communications protocol implementations are somewhat limited • Hard-macros cater to a select set of protocols • Intrinsic performance limitations make some implementations improbable (E.g., SPI-5)

More Related