470 likes | 665 Vues
CPRE 583 Reconfigurable Computing Lecture 3: Wed 9/1/2010 (Reconfigurable Computing Hardware). Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA. http://class.ece.iastate.edu/cpre583/. Announcements/Reminders.
E N D
CPRE 583Reconfigurable ComputingLecture 3: Wed 9/1/2010(Reconfigurable Computing Hardware) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ece.iastate.edu/cpre583/
Announcements/Reminders • Send me top 3 choices for topic for mini literary survey • PowerPoint tree due: Fri 9/17 by class, so try to have to me by 9/16 night. My current plan is to summarize some of the classes findings during class. • Final 5-10 page write up on your tree due: Fri 9/24 midnight.
Overview • Logic • Interconnect/Routing • Optimized resources • Adders, Multipliers • Memory • System-on-chip building blocks • Example Commercial FPGA structure
What you should learn • Basic understanding of the major components that make up an FPGA device.
Basic FPGA Architectural Components CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates Configurable Logic Block (CLB)
Computational Fabric - LUT ABCD Z 0000 0001 1110 1111 LUT = Look up Table Z A 4-LUT B C D ABCD Z X000 X001 X010 X101 X110 X111 0 1 0 0 1 1 ABCD Z ABCD Z 0000 0001 1110 1111 0 0 0 1 0000 0001 1110 1111 0 1 1 1 B A A Z Z AND OR Z B B 2:1 Mux C 1 0 C C D D D
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to OR 32-bits Draw 32 1
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to OR 32-bits Draw 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B Write out the Truth table C D ABCD Z How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B Write out the Truth table C D ABCD Z How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 0 0 0 0 0 0 0 0 0 0 0 0 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B Write out the Truth table C D ABCD Z How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How could one build a 4-LUT? 4 ABCD 1x16 Memory 0 0 0 0 0 1 16:1 Mux Z
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many different 4 input functions can a 4-LUT implement? 4 216 = 65536 ABCD 1x16 Memory 0 0 0 0 0 1 16:1 Mux Z
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many different N input functions can a N-LUT implement? 4 ABCD 1x16 Memory 0 0 0 0 0 1 16:1 Mux Z
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many different N input functions can a N-LUT implement? N ABCD 1x16 Memory 0 0 0 0 0 1 16:1 Mux Z
Computational Fabric - LUT LUT = Look up Table Z A 4-LUT B C D How many different N input functions can a N-LUT implement? N = 22N ABCD 1x2N Memory N = 4 0 0 0 0 0 1 216 =224=65536 16:1 Mux Z
Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Microprocessor 1024-bits
Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 1024-bits
Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 4 op 1024-bits 3 A 3 B 3 4 op 3 3 A B 3
Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op A 3 10-LUT Microprocessor 3 3 B 1024-bits op A 3 3 3 B 4 op 3 A 3 B 3
Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 4 op op A A 3 3 10-LUT Microprocessor 3 3 3 3 B B 1024-bits 4 op 3 A 3 3 B
Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits
Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)
Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT A 10-LUT B Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)
Granularity of Computation Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT AND 4 A 10-LUT 1 Bit logic and constants 1024-bits OR Area that was required using 2-LUTS (A and “1100”) or (B or “1000”) 0 OR 4 B It’s much worse, each 10-LUT only has one output
Computational Fabric - DFF Z • LUTs are fine for implementing any arbitrary combinational logic (output is ONLY a function of its inputs) function. But what about sequential logic (output is a function of input AND previous state information)? A 4-LUT B C D Need Memory!!
Computational Fabric - DFF Z(t) A 4-LUT B Z(t+1) C DFF D DFF = D Flip Flop Detect the pattern “1101” 0/0 1/0 1/0 Input/output 1/1 1 11 110 1101 0/0 1/0 0/0 Start
Computational Fabric - DFF Z(t) A 4-LUT B Z(t+1) C DFF D DFF = D Flip Flop Increase circuit performance (pipelining) 4 LUT delays per output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D 1 DFF delay per output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D
Communication: Interconnect & Routing Need a mechanism to move results of computation around. CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
Communication: Interconnect & Routing Need a mechanism to move results of computation around. Nearest Neighbor: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
Communication: Interconnect & Routing Need a mechanism to move results of computation around. Nearest Neighbor: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Segmented: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
Communication: Interconnect & Routing Need a mechanism to move results of computation around. Nearest Neighbor: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Segmented: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Hierarchical: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
Optimized Resources: Dedicated Logic LUTs + DFFs can implement any arbitrary digital logic. But not optimally (ASICs give make much better use of silicon area for Power, Speed, routing resources) • Arithmetic • Add, Multiply • On chip memory • System on chip building blocks • Processor, PCI-express, Gigabit Ethernet, ADC, etc.
Optimized Resources: Dedicated Logic Fast Addition generate propagate logic Carry out Two output LUT A3 6-LUT c4 Carry Look Ahead Sum 3 B3 A2 B2 6-LUT A2 A1 P1 CLB P2 B1 Sum 2 B2 Carry1 A1 G1 B1 Carry 2 G2 Carry1 6-LUT A1 A1 Sum 1 Sum 1 B1 CLB Sum 2 B1 Carry1 Dedicated routing resources Carry in
Optimized Resources: Dedicated Logic Embedded Memory 96 bits, 300 MHz 8 12
Optimized Resources: Dedicated Logic Embedded Memory 18 Kbits, 550 MHz 8 Dedicated memory block 12
Optimized Resources: Dedicated Logic Multiplication 18x18 multiply Virtex-5 (6-LUTs) Very rough estimate of Silicon area comparison (assuming SX95 andLX110 have about the same die size) 6-LUT 6-LUT 18x18 Multiplier In other word you can replace one LUT based 18x18 multiplier With 100 dedicated 18x18 Multipliers!!! 6-LUT 6-LUT
Optimized Resources: Dedicated Logic Processor MicroBlaze soft-core PowerPC hard-core • 500 MHz • Super scalor • Highspeed 2x5 switch fabric • 250 MHz • Simple scalar
Optimized Resources: Dedicated Logic System on Chip Dedicated Logic Reconfigurable Logic RAM ADC Sensor Matrix Multiplier Coprocessor Sensor Motor Data Buffer PID Controller Ethernet MAC Also see Actel Fusion:http://www.actel.com/products/fusion/default.aspx
Xilinx CLB Architecture • Virtex 5 FPGA User Guide
Computational Fabric - LUT • N-Lut, 3,4…6,…8-LUT • AND, XOR, NOT • Exercises • How many 4-LUTs to OR 32 bits (draw) • How many 4-LUTs to AND 2 bits with the OR of these 32 bits (draw) • Draw the truth table for the 4-LUT that gives the final output • How could one implement a LUT (Memory + MUX) • How many ways can a 4-LUT be programmed • How many ways can a N-LUT be programmed • Granularity trade-off: Functionality vs. propagation delay (2-LUT -> CPU), bit-level vs. datapath
Computational Fabric - DFF • Enable building circuits that can store information (sequential circuits, state machines) • Enables pipelining to increase operating frequency/ throughput
Communication: Interconnect & Routing • Need a mechanism to move the results of a LUT to other LUTs. • Island stale (Array of CB) • Nearest neighbor (paper on reconfigure arch that uses this) • Not scalable (large delays, and uses logic elements for routing?) • Segmented (different length for latency trade-off) • Multi hop scales < O(N)? • Avoid using logic • Hierarchical (good for apps with lots of local communication and little remote communication) • Typical an FPGA silicon area will be 10% logic and 90% interconnect!!
Optimized Resources: Hard Cores • LUTs + DFFs can implement any arbitrary digital logic. But not optimally (ASICs give make much better use of silicon area for Power, Speed, routing resources) • Arithmetic • Add, Mult • On chip memory • System on chip building blocks • Processor, PCI-express, Gigbit Ethernet, A/D