1 / 47

CPRE 583 Reconfigurable Computing Lecture 3: Wed 9/1/2010 (Reconfigurable Computing Hardware)

CPRE 583 Reconfigurable Computing Lecture 3: Wed 9/1/2010 (Reconfigurable Computing Hardware). Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA. http://class.ece.iastate.edu/cpre583/. Announcements/Reminders.

jayme
Télécharger la présentation

CPRE 583 Reconfigurable Computing Lecture 3: Wed 9/1/2010 (Reconfigurable Computing Hardware)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPRE 583Reconfigurable ComputingLecture 3: Wed 9/1/2010(Reconfigurable Computing Hardware) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ece.iastate.edu/cpre583/

  2. Announcements/Reminders • Send me top 3 choices for topic for mini literary survey • PowerPoint tree due: Fri 9/17 by class, so try to have to me by 9/16 night. My current plan is to summarize some of the classes findings during class. • Final 5-10 page write up on your tree due: Fri 9/24 midnight.

  3. Overview • Logic • Interconnect/Routing • Optimized resources • Adders, Multipliers • Memory • System-on-chip building blocks • Example Commercial FPGA structure

  4. What you should learn • Basic understanding of the major components that make up an FPGA device.

  5. Basic FPGA Architectural Components CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates Configurable Logic Block (CLB)

  6. Computational Fabric - LUT ABCD Z 0000 0001 1110 1111 LUT = Look up Table Z A 4-LUT B C D ABCD Z X000 X001 X010 X101 X110 X111 0 1 0 0 1 1 ABCD Z ABCD Z 0000 0001 1110 1111 0 0 0 1 0000 0001 1110 1111 0 1 1 1 B A A Z Z AND OR Z B B 2:1 Mux C 1 0 C C D D D

  7. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to OR 32-bits Draw 32 1

  8. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to OR 32-bits Draw 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

  9. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

  10. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

  11. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B Write out the Truth table C D ABCD Z How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

  12. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B Write out the Truth table C D ABCD Z How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 0 0 0 0 0 0 0 0 0 0 0 0 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

  13. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B Write out the Truth table C D ABCD Z How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

  14. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How could one build a 4-LUT? 4 ABCD 1x16 Memory 0 0 0 0 0 1 16:1 Mux Z

  15. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many different 4 input functions can a 4-LUT implement? 4 216 = 65536 ABCD 1x16 Memory 0 0 0 0 0 1 16:1 Mux Z

  16. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many different N input functions can a N-LUT implement? 4 ABCD 1x16 Memory 0 0 0 0 0 1 16:1 Mux Z

  17. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many different N input functions can a N-LUT implement? N ABCD 1x16 Memory 0 0 0 0 0 1 16:1 Mux Z

  18. Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many different N input functions can a N-LUT implement? N = 22N ABCD 1x2N Memory N = 4 0 0 0 0 0 1 216 =224=65536 16:1 Mux Z

  19. Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Microprocessor 1024-bits

  20. Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 1024-bits

  21. Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 4 op 1024-bits 3 A 3 B 3 4 op 3 3 A B 3

  22. Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op A 3 10-LUT Microprocessor 3 3 B 1024-bits op A 3 3 3 B 4 op 3 A 3 B 3

  23. Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 4 op op A A 3 3 10-LUT Microprocessor 3 3 3 3 B B 1024-bits 4 op 3 A 3 3 B

  24. Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits

  25. Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

  26. Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT A 10-LUT B Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

  27. Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT AND 4 A 10-LUT 1 Bit logic and constants 1024-bits OR Area that was required using 2-LUTS (A and “1100”) or (B or “1000”) 0 OR 4 B It’s much worse, each 10-LUT only has one output

  28. Computational Fabric - DFF Z • LUTs are fine for implementing any arbitrary combinational logic (output is ONLY a function of its inputs) function. But what about sequential logic (output is a function of input AND previous state information)? A 4-LUT B C D Need Memory!!

  29. Computational Fabric - DFF Z(t) A 4-LUT B Z(t+1) C DFF D DFF = D Flip Flop Detect the pattern “1101” 0/0 1/0 1/0 Input/output 1/1 1 11 110 1101 0/0 1/0 0/0 Start

  30. Computational Fabric - DFF Z(t) A 4-LUT B Z(t+1) C DFF D DFF = D Flip Flop Increase circuit performance (pipelining) 4 LUT delays per output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D 1 DFF delay per output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D

  31. Communication: Interconnect & Routing Need a mechanism to move results of computation around. CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

  32. Communication: Interconnect & Routing Need a mechanism to move results of computation around. Nearest Neighbor: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

  33. Communication: Interconnect & Routing Need a mechanism to move results of computation around. Nearest Neighbor: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Segmented: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

  34. Communication: Interconnect & Routing Need a mechanism to move results of computation around. Nearest Neighbor: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Segmented: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Hierarchical: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

  35. Optimized Resources: Dedicated Logic LUTs + DFFs can implement any arbitrary digital logic. But not optimally (ASICs give make much better use of silicon area for Power, Speed, routing resources) • Arithmetic • Add, Multiply • On chip memory • System on chip building blocks • Processor, PCI-express, Gigabit Ethernet, ADC, etc.

  36. Optimized Resources: Dedicated Logic Fast Addition generate propagate logic Carry out Two output LUT A3 6-LUT c4 Carry Look Ahead Sum 3 B3 A2 B2 6-LUT A2 A1 P1 CLB P2 B1 Sum 2 B2 Carry1 A1 G1 B1 Carry 2 G2 Carry1 6-LUT A1 A1 Sum 1 Sum 1 B1 CLB Sum 2 B1 Carry1 Dedicated routing resources Carry in

  37. Optimized Resources: Dedicated Logic Embedded Memory 96 bits, 300 MHz 8 12

  38. Optimized Resources: Dedicated Logic Embedded Memory 18 Kbits, 550 MHz 8 Dedicated memory block 12

  39. Optimized Resources: Dedicated Logic Multiplication 18x18 multiply Virtex-5 (6-LUTs) Very rough estimate of Silicon area comparison (assuming SX95 andLX110 have about the same die size) 6-LUT 6-LUT 18x18 Multiplier In other word you can replace one LUT based 18x18 multiplier With 100 dedicated 18x18 Multipliers!!! 6-LUT 6-LUT

  40. Optimized Resources: Dedicated Logic Processor MicroBlaze soft-core PowerPC hard-core • 500 MHz • Super scalor • Highspeed 2x5 switch fabric • 250 MHz • Simple scalar

  41. Optimized Resources: Dedicated Logic System on Chip Dedicated Logic Reconfigurable Logic RAM ADC Sensor Matrix Multiplier Coprocessor Sensor Motor Data Buffer PID Controller Ethernet MAC Also see Actel Fusion:http://www.actel.com/products/fusion/default.aspx

  42. Xilinx CLB Architecture • Virtex 5 FPGA User Guide

  43. Questions/Comments/Concerns

  44. Computational Fabric - LUT • N-Lut, 3,4…6,…8-LUT • AND, XOR, NOT • Exercises • How many 4-LUTs to OR 32 bits (draw) • How many 4-LUTs to AND 2 bits with the OR of these 32 bits (draw) • Draw the truth table for the 4-LUT that gives the final output • How could one implement a LUT (Memory + MUX) • How many ways can a 4-LUT be programmed • How many ways can a N-LUT be programmed • Granularity trade-off: Functionality vs. propagation delay (2-LUT -> CPU), bit-level vs. datapath

  45. Computational Fabric - DFF • Enable building circuits that can store information (sequential circuits, state machines) • Enables pipelining to increase operating frequency/ throughput

  46. Communication: Interconnect & Routing • Need a mechanism to move the results of a LUT to other LUTs. • Island stale (Array of CB) • Nearest neighbor (paper on reconfigure arch that uses this) • Not scalable (large delays, and uses logic elements for routing?) • Segmented (different length for latency trade-off) • Multi hop scales < O(N)? • Avoid using logic • Hierarchical (good for apps with lots of local communication and little remote communication) • Typical an FPGA silicon area will be 10% logic and 90% interconnect!!

  47. Optimized Resources: Hard Cores • LUTs + DFFs can implement any arbitrary digital logic. But not optimally (ASICs give make much better use of silicon area for Power, Speed, routing resources) • Arithmetic • Add, Mult • On chip memory • System on chip building blocks • Processor, PCI-express, Gigbit Ethernet, A/D

More Related