Download
the first fully programmable system solution designed specifically for intellectual property n.
Skip this Video
Loading SlideShow in 5 Seconds..
The first fully programmable system solution designed specifically for intellectual property. PowerPoint Presentation
Download Presentation
The first fully programmable system solution designed specifically for intellectual property.

The first fully programmable system solution designed specifically for intellectual property.

139 Vues Download Presentation
Télécharger la présentation

The first fully programmable system solution designed specifically for intellectual property.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Redefining the FPGA The first fully programmable system solution designed specifically for intellectual property.

  2. Agenda • Technology Roadmap • Redefining the FPGA • Architecture Overview • The CLB Tile, Vector Based Interconnect, Internal Bus Support, SelectRAM+, Clocking & DLLs, SelectI/O, Thermal Management & The SelectMap Interface • Software & Cores Support • Summary - A System Level Solution

  3. Virtex 1 Million+ System Gates with High Performance System Solution 5LM - 0.25µm (7LM - 0.18µm) XC4000XV 3LM - 0.25µm (XC40250XV) Density/Performance XC4000XL 3LM - 0.35µm (XC4085XL) XC4000EX 2LM - 0.5µm (XC4036EX) XC4000E 2LM - 0.5µm (XC4025E) 1995 1997 1998 1999 1996 Technology Roadmap

  4. SRAM Cache (Mbytes) Chip 1 Chip 2 133MHz SDRAM LVCMOS 2x CLK 1x CLK Low Voltage CPU SSTL3 LVTTL GTL+ "Virtex moves FPGAs from glue to system component” High Speed System Backplane Redefining the FPGA 3 4 1 2

  5. System Memory 2 System Timing System Integration 1 3 System Interfaces 4 Redefining the FPGA Value Extends Beyond the Socket

  6. Extremely Dense 50,000 to 1,000,000 System Gates 1,728 to 27,648 Logic Cells Vector Based Interconnect 2ns 2ns 2ns Redefining the FPGA Advanced Process Technology Allows for Almost 10x the Density of Today’s FPGAs System Integration 1 Predictable Routing Delays Produce a Core Friendly Architecture With Fast Place & Route Times High Performance Routing

  7. RAMB4_S4_S16 WEA ENA RSTA CLKA ADDRA[9:0] DIA[3:0] DOA[3:0] System Memory 2 WEB ENB RSTB CLKB ADDRB[7:0] DIB[15:0] DOB[15:0] 200 MHz Distributed SelectRAM 200 MHz Block SelectRAM Bytes Kilobytes 200 MHz Access to External Memory Megabytes Redefining the FPGA

  8. Multiplication, Division & Phase Generation 45 MHz (Divide by 2) DLL 90 MHz 180 MHz (Multiply by 2) CLK DLL System Timing 3 DLL CLKDLL CLK0 CLK90 Virtex CLKIN CLK180 CLK270 CLKFB CLK2X RST CLKDV LOCKED Redefining the FPGA ZeroDelay Clock Distribution & System Synchronization Route to Other Devices

  9. 5.0V 1.8V 3.3V 2.5V External Devices PCI GTL GTL+ SSTL AGP HSTL Backplanes System Interfaces 4 Redefining the FPGA SelectI/O Allows Connection Directly to External Signals of Varied Voltages & Thresholds Future Standards Can be Supported Without Having to Make Silicon Changes

  10. Redefining the FPGA 1 • System Integration • Intellectual Property is Critical for High Density Design & Must Drop in Easily Without Penalty Across an Entire Family • System Memory • Memory Bandwidth is Always Key • Size & Depth Requirements Vary Depending on the Application • System Timing • Chip to Chip Performance Typically Limits System Speeds • Clock Skew is an Important Factor in High Performance Systems • System Interfaces • Process Technology Leads to Mixed Voltage Systems • High performance, Lower Power Signal Standards Have Emerged 2 3 4

  11. New Modules VHDL Design Environment Verilog Design Environment CoreGen IP Modules AllianceCore Designer #1 Designer #2 DSP FIFO 133Mhz SDRAM Design Reuse Giga-bit Ethernet CPU Virtex LogiCore 160 MHz I/O 133 MHz Memory 1 Million+ System Gates 66Mhz PCI Redefining the FPGA

  12. Extremely Dense 50,000 to 1,000,000 System Gates 1,728 to 27,648 Logic Cells System Performance & Features 160 MHz+ System Performance Multiple DLLs & Block SelectRAM Supports Multiple I/O Standards Internal Performance & Features 100 MHz+ at 3 to 4 Logic Levels TBUFs & Distributed SelectRAM Superior Intellectual Property Infrastructure - CoreGen & Web Proven Software Flows for High Density & Performance - M1.5 IP Software System Building Blocks Fast, Flexible I/Os Segmented Routing 4-Input LUT Architecture Leading Edge Process Technology The World’s First Fully Programmable System-Level Architecture Redefining the FPGA

  13. RAMB4_S4_S16 WEA ENA RSTA CLKA ADDRA[9:0] DIA[3:0] DOA[3:0] Vector Based Interconnect WEB ENB RSTB CLKB ADDRB[7:0] DIB[15:0] DOB[15:0] 2ns 2ns 2ns 3 1 2 The CLB Tile Block SelectRAM CLKDLL CLK0 CLK90 Distributed SelectRAM CLKIN CLK180 CLK270 CLKFB SelectMAP Configuration CLK2X RST CLKDV GTL GTL+ AGP LOCKED 5.0V 1.8V 3.3V 2.5V 4 PCI SSTL HSTL SelectI/O DLL Architecture Overview Thermal Management

  14. Extremely Dense 50,000 to 1,000,000 System Gates System Integration 1 1,728 to 27,648 Logic Cells Vector Based Interconnect 2ns 2ns 2ns The CLB Tile Advanced Process Technology Allows for Almost 10x the Density of Today’s FPGAs Predictable Routing Delays Produce a Core Friendly Architecture With Much Faster Place & Route Times High Performance Routing

  15. CLB Tile is Composed of a Switch Matrix, Configurable Logic Block, and Associated General Routing Resources All CLB Inputs Have Access to Interconnect on All 4 Sides CLB is Divided into Two Identical Slices Wide Single CLB Functions Slices Have a Bit Pitch of 2 Fast Local Feedback Within the CLB & Direct Connects to Adjacent Horizontal Neighbors INTERNAL BUSSES DIRECT CONNECT DIRECT CONNECT The CLB Tile

  16. CLB Slice Slice LUT LUT LUT LUT PRE PRE D D Q Q CE CE CLR CLR Carry Carry Carry Carry PRE PRE D D Q Q CE CE CLR CLR Simplified CLB Structure 2 Slices in Each CLB • Virtex Slice is Similar in Contents to the Current XC4000 CLB • 2 BUFTs Associated with Each CLB, Accessible by All 8 CLB Outputs

  17. COUT YB 1 LUT/RAM/ROM/SHIFT 0 1 G1 A1 O Y G2 A2 G3 A3 * G4 A4 0 WS DI 1 S D YQ Q BY CE CLK R Data In Write Strobe Multiplex Logic Logic CE SR GSR F5 from other slice XB Position of F5 tap on other slice 1 WS DI 0 1 X A1 F1 O F2 A2 F3 A3 S * D Q XQ A4 F4 0 CE LUT/RAM/ROM/SHIFT 1 R * Controlled by the same pair of memory cells BX ** Implemented as extra inputs on the BX input mux *** CLK and SR inputs are common to both slices 1 0 CIN Detailed Slice Structure

  18. 2.5ns CLB Slice Slice 1.1ns LUT 1.1ns LUT 0.3ns Wide Single CLB Functions Implement 13-Input Functions in a Single CLB • Builds on XC4000 Architecture 9-Input Function • 2 Logic Levels and 1 Local Interconnect Yield a 2.5ns Max Delay

  19. Slice Features • Two 4-Input LUTs in Each Slice • Includes 2 Highly Flexible Sequential Elements • Dedicated Logic for 4x1 & 8x1 Muxes • Fast Look Ahead Carry Logic • Dedicated Multiplier Fabric • New SelectShift Feature • Create Shift Registers up to 16 Cycles Deep in a Single 4-Input LUT • 4-Input LUTs can be used as Distributed SelectRAM • Same as XC4000 Synchronous Modes - Single & Dual Port

  20. Sequential Elements Can be Flip-flops or Latches 2 in Each Slice, 4 in Each CLB Can be Sourced from LUTs or an Independent CLB Input Separate Set & Reset Controls Controls Can be Synchronous or Asynchronous GSR Can be Used for Power On Set/Reset All Controls Can be Inverted Controls are Shared Within Each Slice FDRSE D S Q CE R FDCPE D PRE Q CE CLR LDCPE D PRE Q CE G CLR Flexible Sequential Elements

  21. Primary Use of XC4000 HMAP was to Implement a 2x1 Mux Dedicated Muxes are Faster & More Space Efficient Space Freed Up is Used for Muxes & Other Special Logic MUXF5 Can be Used to Combine the Two LUTs in a Slice to Create a 4x1 Mux or Any Function of 5 Inputs MUXF6 Can be Used to Combine the Two Slices in a CLB to Create an 8x1 Mux or Any Function of 6 Inputs CLB Slice MUXF6 MUXF5 Slice MUXF5 LUT LUT LUT LUT Fast Efficient Muxes

  22. Fast Look Ahead Carry Logic Simple, Fast & Complete Arithmetic Logic • Vertical, Up Only Carry Direction • Look Ahead Carry Implementation Yields 32-Bit Counters & Arithmetic Functions that Perform at 100MHz+ • Discrete XOR Component for Single Level Sum Completion • 2 Separate Carry Chains in CLB Allow for 3 Operand Functions

  23. LUT A CY_MUX CO S DI CI CY_XOR MULT_AND A x B LUT B LUT Dedicated Multiplier Fabric Highly Efficient ‘Shift & Add’ Implementation • Logic Added for Implementation of Binary Tree Style Multipliers • 30% Reduction in Area for a 16x16 Multiply & 1 Less Logic Level

  24. Dynamically Addressable Shift Registers - DASRs Ultra-Efficient Programmable Clock Cycle Delay Serial In, Serial Out, Clock, Clock Enable, and Shift Depth Address Single LUT Maximum Cycle Delay of 16 Cascade DASRs for Cycle Delays Greater than 16 CLB Flip-Flops Can be Used for Other Functions or to Add to DASR Depth LUT IN D D D D Q Q Q Q CE CE CE CE CE CLK OUT CLB Slice Slice LUT LUT LUT LUT DEPTH[3:0] SelectShift

  25. 12 Cycles 64 64 Operation A Operation B 4 Cycles 8 Cycles Operation C 3 Cycles 9-Cycle Imbalance 3 Cycles SelectShift • Register Rich FPGAs Allow for the Addition of Pipeline Stages to Increase Throughput • Data Paths Must be Balanced to Maintain Desired Functionality

  26. 12 Cycles 64 64 Operation A Operation B 4 Cycles 8 Cycles Operation C Operation D - NOP 3 Cycles 9 Cycles Paths Statically Balanced 12 Cycles SelectShift • SelectShift Feature of the 4-Input LUT Can be Used to Create NOPs • Above Example Uses 64 LUTs to Replace 576 Flip-flops (64*9)

  27. 12 Cycles 64 64 # NOP Cycles Operation A Operation B 4 Cycles 8 Cycles 1/10 Cycles Operation C 3 Cycles Operation D - NOP Paths Dynamically Balanced 3 Cycles SelectShift (continued) SelectShift Depth Can be Dynamically Changed • Above uses 64 LUTs to Replace 704 Flip-flops & 64 2x1 Muxes Paths Statically Balanced

  28. Internal Bus Support • One Pair of BUFTs Associated with Each CLB • Same ‘Pitch’ as Slice Carry Logic - 2 Bits/Slice • Each BUFT has an Independent Control Input • All CLB Outputs can Source Either BUFT Data Input • Combine BUFTs to Create Wide Muxes • Replace LUT Based Mux Logic to Increase Density • Much Faster than Previous Architectures • Approximately 10ns to Span Entire XCV1000 - 96 Columns • Ties Groups of 4 BUFTs with Bi-directional Look Ahead Scheme Similar to Slice Carry Logic

  29. Internal Bus Support • And-Or Implementation Replaces Three-State Drivers • Simultaneously Driving BUFTs will not Cause Contention • Capacitance of Entire Load Reduced Dramatically • Slow, Power Hungry Pullups & Weak Keepers Unnecessary • Output Flexibility • Removal of Pullups Allows for Outputs to Span Rows • Segments of 4 Columns Allow for Many Outputs Per Row

  30. General Purpose Routing Routing Delay Depends on Radial Distance Routing Structure Designed to Handle High Fanout Nets 1000+ Loads - Sub 10ns Much More Predictable Predictability is Critical for Core Integration & Reuse Optimized for 5 Layer Metal Vector Based Interconnect 2ns 2ns 2ns CLB Array High Performance Routing

  31. Segmented Routing Architecture Allows For Optimal Connection Delay, Power, Capacitance & Resource Utilization Combined With Timing Driven Place & Route Yields Superior Path Delays Increasing Device Utilization Does Not Decrease Design Performance Resource Mix Optimized for Large Devices - Optimized for 5 LM Algorithmically Friendly Structure Significant Compile Time Reduction Without Performance Penalty INTERNAL BUSSES DIRECT CONNECT DIRECT CONNECT High Performance Routing

  32. High Performance Routing • Advanced Local CLB Routing • Massive Hierarchical General Routing Resources Designed For Speed • 24 Singles, 72 Hexes, 12 Longs per Tile (4KXL: 8 Singles, 4 Doubles, 12 Quads, 12 Longs per Tile) • Selective Connectivity Between Resource Types to Limit Loading • Longs and Hexes Can be Used as Secondary Global Resources for Clocks and Controls With Sub 10ns Delays • Special Backbone Routing in Top and Bottom I/O Edges to Connect Vertical Longs to Create Low Skew Resources • Increased Switch Matrix Connectivity • Higher Connectivity Eliminates Congestion

  33. Each LUT Output Can Connect to the Three Other LUTs 100ps to 300ps Maximum Delay Create 13-Input Functions Within the Same CLB - 2.5ns Total Delay Synthesis Tools Use FastConnects on Critical Paths IMUX Receives 96 Connections from General Routing Matrix (GRM) Highly Exhaustive Connection Matrix OMUX Equivalent to 8-bit 13x1 Mux All 8 Outputs Connect to the GRM 2 Outputs Can be Used to Connect Directly to the Horizontal Neighbors All Outputs Can Feed the 2 BUFTs CLB Slice LUT LUT LUT LUT Slice Advanced Local CLB Routing

  34. Routing Needs Based On XCV1000 Loading of Resources Minimized While Connectivity Increased Both Long Lines & Hexes are Buffered To Reduce RC Delays Longs Have Access Every 6 Tiles Hexes Have Access at Ends & Middle Special Hexes Added to Top and Bottom to Create High Fanout Resources with Vertical Long Lines Horizontal Singles Connect Directly to Vertical Long Lines for Fast Control Signal Distribution Massive Hierarchical Resources

  35. Previous Families Use Planar Pipulation Allows for Routing Along Same Channel Restricts Connectivity of Dissimilar Resources Virtex Devices Use Non-Planar Pipulation Allows for Routing Across Resource Types Longs Drive Hexes, Hexes Drive Hexes and Singles, Singles drive Singles and CLB IMUXs - Vertical Hexes Drive CLB Controls Inputs As Well CLB OMUXs Drives All Types Switch Matrix Connectivity Determines Design Routabilty Increased Switch Matrix Connectivity Alleviates Congestion Planar pipulation Non-Planar pipulation Increased Matrix Connectivity

  36. RAMB4_S4_S16 WEA ENA RSTA CLKA ADDRA[9:0] DIA[3:0] DOA[3:0] System Memory 2 WEB ENB RSTB CLKB ADDRB[7:0] DIB[15:0] DOB[15:0] 200 MHz Distributed SelectRAM 200 MHz Block SelectRAM Bytes Kilobytes 200 MHz Access to External Memory Megabytes SelectRAM+

  37. SelectRAM+ Hierarchy • Distributed SelectRAM • Proven Synchronous RAM of the XC4000 Families • 16x1 Implemented in a LUT - 4 in Each CLB • 32x1 Implemented in a Slice - 2 in Each CLB • Ideal for DSP Applications • Block SelectRAM • True Dual Port, Fully Synchronous RAM • 4096-Bit Block Configurable in Widths From 1 to 16 • Ideal for Data Buffers & FIFOs • Fast Access to External RAM • 133MHz Direct Interface to SSTL3, 3.3V Synchronous DRAM

  38. Builds on XC4000 Tradition Synchronous Write Asynchronous Read No Asynchronous Write Use a Single LUT to Create a RAM16X1S Use a Pair of LUTs to Create a RAM32X1S or RAM16X1D RAM16X1D Comes With One R/W Address & One Read Only Address Accompanying Flip-Flops Can Be Used to Register Read RAM16X1S D WE WCLK A0 O LUT LUT LUT A1 A2 A3 RAM32X1S RAM16X1D D D WE WE Slice WCLK WCLK A0 O A0 SPO A1 A1 A2 A2 A3 A3 A4 DPRA0 DPO DPRA1 DPRA2 DPRA3 Distributed SelectRAM

  39. True Dual Port Synchronous RAM 2 R/W Ports with Independent Controls Synchronous Read & Write Block Count Increases With FPGA Size 8 Blocks in the XCV50 - 32Kb 32 Blocks in the XCV1000 - 128Kb Located on Left & Right Sides with 1 Block Every 4 Rows Flexible 4096-Bit Block Variable Aspect Ratio Each Port can be a Different Width Synchronous Reset & INIT Values State Machines, Decodes, Etc Sub-10ns Cycle Time For All Widths RAMB4_S#_S# WEA ENA RSTA CLKA ADDRA[#:0] DIA[#:0] DOA[#:0] WEB ENB RSTB CLKB ADDRB[#:0] DIB[#:0] DOB[#:0] Allowed Widths Block SelectRAM

  40. Library Name Specifies Port Configuration RAMB4_S4_S16 WEA WEB Port A In 1K-Bit Depth Port A Out 4-Bit Width ENA ENB DOA[3:0] RSTB RSTA CLKA CLKB ADDRA[9:0] ADDRB[7:0] DIB[15:0] DIA[3:0] Port B Out 16-Bit Width Port B In 256-Bit Depth DOB[15:0] Block SelectRAM Each Dual Port can be configured with a different width

  41. The Dual Ports Access the Same 4096 Bits Combine Blocks For Additional Depth & Width The Depth/Width Ratio Determines How the Bits are Accessed For Example: A RAMB4_S4_S16 Has a 1kx4 Port & a 256x16 Port Provides Easy Data Width Conversion Without Any Additional Logic 4096-Bit Storage When Viewed by a Port Configured as 1kx4 4096-Bit Storage When Viewed by a Port Configured as 256x16 Block SelectRAM

  42. RAMB4_S1 FFFXXXXX 4095 FFEXXXXX 4094 0 WE FFDXXXXX 4093 1 EN 0 RST Subdivide 32-Bit Address Space into 4096 1MB Blocks Enable DO Clock CLK A[31:20] ADDR[11:0] N/C DI[7:0] 002XXXXX 0002 001XXXXX 0001 000XXXXX 0000 Using a DLL, the Enable is Available Only 5.1ns After the Rising Edge of the External System Clock Block SelectRAM Build State Machines & PROM Based Address Decodes

  43. Multiplication, Division & Phase Generation 45 MHz (Divide by 2) DLL 90 MHz 180 MHz (Multiply by 2) CLK DLL System Timing 3 DLL CLKDLL CLK0 CLK90 Virtex CLKIN CLK180 CLK270 CLKFB CLK2X RST CLKDV LOCKED Clocking & DLLs ZeroDelay Clock Distribution & System Synchronization Route to Other Devices

  44. General Clock Support • 4 Dedicated Global Low Skew Buffers • Dedicated Input Pin - Intended to Distribute Clocks Only • 66 MHz PCI Performance With 500ps Maximum Skew • 3ns TSetup /0ns THold - Input IOB Flip-flop with No Data Delay • 6ns TClock2Out - Output IOB Flip-flop • 24 Additional Shared Resources • Intended to Distribute Low Skew/High Fanout Signals • Distribute Control Signals Across the Device under 10ns • additional clocks, clock enables, three-state controls & resets • 4 Delay Lock Loops on Each Device • 100% Digital Implementation • 2 Global Buffers Associated with Each DLL Pair

  45. DLLs use Programmable Delay Line in Conjunction with Control Logic that Selects the Delay to Match the Distribution PLLs use Programmable Oscillators in Conjunction with Phase Detectors & Filters to Phase Adjust the Clock CLKIN CLKOUT CLKOUT Programmable Delay Line Programmable Oscillator Clock Distribution Clock Distribution Control Logic CLKIN Control Logic CLKFB CLKFB DLLs Versus PLLs • Both types are used to remove clock delay & provide additional clocking functionality • Frequency synthesis, Phase adjustment & clock conditioning • Both can be implemented using either analog or digital logic

  46. DLLs Versus PLLs • The Oscillator Used in a PLL Inherently Introduces Instability & Phase Error • The DLL Architecture is Unconditionally Stable and Does Not Accumulate Phase Error • It is Generally Accepted that DLLs are Better for Delay Compensation and Clock Conditioning • PLLs Typically Have an Advantage When Performing Frequency Synthesis and Can Operate Over a Larger Input Clock Frequency

  47. Virtex Speedup Tc2o Zero-Delay Internal Clock Buffer Clock Phase Synthesis For Use Internally Or Externally Virtex Clock Multiplication & Division For Use Internally Or Externally Clock Mirror Zero-Delay Board Clock Buffer DLL Functions

  48. DLL Functions • Speedup Tc2o by Eliminating Clock Distribution Delay • Generate Phase Shifted Clocks • Perform Clock Multiplication & Division • Cleanup Clocks with 50/50 Duty Cycle Correction • Generate Clock Lock for Internal & External Use • Can Require Configuration to Synchronize with DLL Lock • DLL Feedback can be Connected Internally or Externally • Can be Used to Create Clock Mirrors & Perform System Synchronization

  49. Tclock = 0ns D Q > DLL OUT CLKext Tc2q + Tout = Tc2o CLKint DLL Tc2o Speedup • Nullify Clock Delay - Fast Tc2o on XCV1000 • External CLKext pin and Internal CLKint pin are Aligned • 2.5ns Setup/0.0ns Hold & 3.5ns Tc2o on All Devices • Optional Duty Cycle Correction • 50/50 Duty Cycle Correction Applied when Specified • Not sensitive to clock input noise - use standard cans

  50. Coarse Phase Shifts Available 0°, 90°, 180°, and 270° Available for Internal & External Use 50/50 Duty Cycle Correction Available 100MHz - 180° Phase Shift DLL 100 MHz (0 Phase) 100 MHz (180° Shift) DLL Phase Shift