1 / 38

CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing. Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #7 – Applications I. Recap – Fixed-Point Arithmetic. Addition, subtraction the same (Q4.4 example): Multiplication requires realignment:.

aitana
Télécharger la présentation

CprE / ComS 583 Reconfigurable Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CprE / ComS 583Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #7 – Applications I

  2. Recap – Fixed-Point Arithmetic • Addition, subtraction the same (Q4.4 example): • Multiplication requires realignment: 3.6250 0011.1010 + 2.8125 0010.1101 6.4375 0110.0111 3.6250 0011.1010 x 2.8125 0010.1101 00111010 00111010 00111010 00111010 10.1953125 1010.00110010 CprE 583 – Reconfigurable Computing

  3. Recap – Capacity Trends Virtex-5 550 MHz 24M gates* Virtex-II Pro 450 MHz 8M gates* Virtex-4 500 MHz 16M gates* Virtex-II 450 MHz 8M gates Spartan-3 326 MHz 5M gates Virtex-E 240 MHz 4M gates Xilinx Device Complexity Virtex 200 MHz 1M gates XC4000 100 MHz 250K gates Spartan-II 200 MHz 200K gates Spartan 80 MHz 40K gates XC3000 85 MHz 7.5K gates XC5200 50 MHz 23K gates XC2000 50 MHz 1K gates 1985 1987 1991 1995 1998 1999 2000 2002 2003 2004 2006 Year CprE 583 – Reconfigurable Computing

  4. Recap – FPGA Memory Resources Block Select+RAM Distributed LUT-based RAM CprE 583 – Reconfigurable Computing

  5. Recap – Block Multipliers • Block multipliers (18b x 18b) arranged in columns near RAM CprE 583 – Reconfigurable Computing

  6. Recap – Block Multipliers (cont.) • Synthesis tools can take larger multipliers and break them down into 18x18 multipliers CprE 583 – Reconfigurable Computing

  7. Recap – FPGA CPU Resources • Built-in PowerPC 405 processor blocks CprE 583 – Reconfigurable Computing

  8. Recap – Xilinx Virtex-4 FPGAs • Comes in three varieties: • Virtex-4 LX: most amount of LUTs • Virtex-4 FX: has PowerPCs like V2P • Virtex-4 SX: contains most amount of XtremeDSP slices • CLB structure similar to Virtex-II • Largest LX device – 89,088 slices = 178,176 4-LUTs! • FX devices limited to 2 PPC 405s like Virtex-II Pro • XTremeDSP Slices: • Same 18x18 block multiplier, now with optional pipelining • Includes built-in 48-bit accumulator for MAC operations CprE 583 – Reconfigurable Computing

  9. Recap – Xilinx Virtex-5 FPGAs • CLB slices uses 6-input LUTs • Block RAMs now 36Kbits per block • DSP slices now support 25x18 MAC • Diagonal routing • LX, LXT, SXT, FXT varieties CprE 583 – Reconfigurable Computing

  10. Outline • Recap • Splash / Splash 2 System Architecture • Splash 2 Programming Models • Splash 2 Applications • Text searching • Genetic pattern matching • Image processing CprE 583 – Reconfigurable Computing

  11. Overview • An early well-known reconfigurable computer was Splash / Splash 2 • Implemented as linear, systolic array • Developed at Supercomputing Research Center (1990-1994) • Memory tightly coupled with each FPGA • Multiple Splash boards can be combined to form larger system CprE 583 – Reconfigurable Computing

  12. Splash 1 • Born to solve DNA string matching • Operational in 1989 • Features • 32 Xilinx XC3090 FPGAs (420 Mλ2) • 32 128KB SRAMs (600 Mλ2) • VMEbus interface (FIFO 1 MHz clock) • 33 Gλ2 total (not counting interconnect) • Linear interconnect only CprE 583 – Reconfigurable Computing

  13. Splash 1 Architecture VME Bus VSB Bus Interface Interface FIFO IN Control FIFO OUT F3 F2 F1 F0 F31 F30 F29 F28 M3 M2 M1 M0 M31 M30 M29 M28 M4 M5 M6 M7 M24 M25 M26 M27 F4 F5 F6 F7 F24 F25 F26 F27 F11 F10 F9 F8 F23 F22 F21 F20 M11 M10 M9 M8 M23 M22 M21 M20 M12 M13 M14 M15 M16 M17 M18 M19 F12 F13 F14 F15 F16 F17 F18 F19 CprE 583 – Reconfigurable Computing

  14. Splash 2 • Operational in 1992 • Attached processor system • Features • 17 Xilinx XC4010 FPGAs (500 Mλ2) • 17 512KB SRAMs (2 Gλ2) • 9 TI SN74ACT8841s (16 port, 4b crossbar) • Sbus interface (< 100MB/s) • 43 Gλ2 total (not counting interconnect) • Supported 2 programming models: • Systolic • SIMD CprE 583 – Reconfigurable Computing

  15. Splash 2 Architecture From Prev Board M5 M6 M7 M8 M0 M1 M2 M3 M4 F5 F6 F7 F8 F1 F2 F3 F4 Crossbar Switch F0 To Next Board F12 F11 F10 F9 F16 F15 F14 F13 M12 M11 M10 M9 M16 M15 M14 M13 CprE 583 – Reconfigurable Computing

  16. Splash 2 Architecture (cont.) External Input Array Boards R-Bus S-Bus Interface Board Sparc-based Host SIMD S-Bus External Output CprE 583 – Reconfigurable Computing

  17. Comparing Splash 1 and 2 Splash 1 Splash 2 Year 1989 1992 FPGA XC3090 XC4010 4-LUTs / Board 10,240 13,600 Max Input Rate 1 MB/s 100 MB/s Interconnect Linear Array Linear Array Broadcast Crossbar Memory / FPGA 128 KB 512 KB Area 33 Gλ2 43 Gλ2 CprE 583 – Reconfigurable Computing

  18. Splash 2 Programming Models • Linear (systolic) array • All near-neighbor communication, pipelined • Very fast (at the time) speed of 20-30MHz achieved • All FPGAs run the same program • SIMD array • Instructions fanned out to all processing elements • Data across all elements collected at the end . . in … out F1 F2 F3 F16 F3 in F0 out … F1 F2 F3 F3 F16 CprE 583 – Reconfigurable Computing

  19. Programming Splash 2 • Languages • VHDL • dbC – a C-like language capable of SIMD instructions • How many programs? • Host • Interface board (XL and XR) • Splash array board • X0 • X1 – X16 • TI crossbar chips • Five programs! • Requires intimate knowledge of the architecture CprE 583 – Reconfigurable Computing

  20. Application #1 – Dictionary Search • Search through dictionary of words for data hit • Applicable to internet search engines / databases • Opportunities for search parallelism • Splash implementation uses systolic communication CprE 583 – Reconfigurable Computing

  21. Dictionary Search (cont.) • Each FPGA used to look into local memory • Longer data words “hashed” into 18 bit address • Valid bit in memory indicates if data value is currently stored • Could be stored in several locations Mi-1 Mi Mi+1 Fi-1 Fi Fi+1 CprE 583 – Reconfigurable Computing

  22. Example Hash Function Shift amount: 7 bits Hash function: 1100 1000 1010 0011 00 0000 0000 0000 0000 0000 Clear hash register 01 1010 0001 1101 00 Input the letters “th” --------------------------------- 10 1000 0011 0101 1100 0000 Temporary Result 10 0000 0101 0000 0110 1011 Result for “th” 00 0000 0001 1001 01 Input for letters “e_” ----------------------------------------- 01 0010 0110 0001 1110 1011 Temporary result 10 0101 1010 0100 1100 0011 Result for “the_” • XOR two character value with temp result and hash function • Rotate result • Different hash function for each FPGA CprE 583 – Reconfigurable Computing

  23. Dictionary Search (cont.) • Distribute dictionary in parallel to all memories • Collect word values in FIFOs • Distribute words two characters at a time across all devices • Perform local hashing and lookup in parallel • Collect “hit” result at end • Splash 2 implementation results • 25 MHz • Three phases needed • Fetch 2 bit-sliced characters • Perform hash • Table look-up • Takes advantage of both systolic and SIMD modes CprE 583 – Reconfigurable Computing

  24. Genetic Pattern Matching • Comparing strings by edit distance • Motivation: The Human Genome Project • Do two genetic strings match? • How are they related? • When biologists characterize a new sequence, they want to compare it to the (growing) database of known sequences • Abstraction: • What is the cost of transforming s into t • Given – costs for insertion, deletion, substitution CprE 583 – Reconfigurable Computing

  25. Alphabet and Costs • Alphabet • Letters in the string. For DNA, there are four: • A (Adenine) • C (Cytosine) • T (Thymine) • G (Guanine) • Transformation Costs • Insert:1, Delete:1, Substitute:2, match:0 • Type of comparison • One target to many sources • One target to one source CprE 583 – Reconfigurable Computing

  26. Substitution Example Word Move Cost baboon Delete ‘o’ 1 bab|on Substitute ‘o’ 2 bobon Insert ‘u’ 1 boubon Insert ‘r’ 1 bourbon Match? 0 bourbon Total cost: 5 CprE 583 – Reconfigurable Computing

  27. Dynamic Programming Solution • Source sequence: s1, s2, … sm • Target sequence: t1, t2, … tn • di,j = distance between subsequence s1, s2, …si and subsequence t1, t2, …tj, where d0,0= 0 di,0 = di-1,0 + Delete(si) d0,j = d0,j-1 + Insert(tj) di-1,j + Delete(si) di,j = min di,j-1 + Insert(tj) di-1,j-1 + Substitute(si, tj) • Distance(Source, Target) = dm,n CprE 583 – Reconfigurable Computing

  28. Dynamic Programming Example b o u r b o n 0 1 2 3 4 5 6 7 b 1 0 1 2 3 3 4 5 a 2 1 2 3 4 4 5 6 b 3 2 2 3 4 4 5 6 o 4 2 2 3 3 4 4 5 o 5 3 2 3 4 5 5 6 n 6 4 3 4 5 5 6 5 CprE 583 – Reconfigurable Computing

  29. Parallelism on the Anti-Diagonal b o u r b o n 0 1 2 3 4 5 6 7 b 1 0 1 2 3 3 4 5 a 2 1 2 3 4 4 5 6 di-1,j-1 di-1,j b 3 2 2 3 4 4 5 6 o 4 2 2 3 3 4 4 5 o 5 3 2 3 4 5 5 6 n 6 4 3 4 5 5 6 5 di,j-1 di,j CprE 583 – Reconfigurable Computing

  30. Bidirectional PE Example If (SCin != 0) and (TCin != 0) PEDist + Substitute(SCin, TCin) PEDist  min TDin + Delete(SCin) SDin + Insert(TCin) else-if (SCin != 0) PEDist  SDin else-if (TCin != 0) PEDist  TDin endif SCout  SCin TCout  TCin SDout  PEDist TDout  PEDist SCin SCout Bidir PE Bidir PE Bidir PE SDin SDout TCout TCin TDout TDin PEDist PEDist PEDist CprE 583 – Reconfigurable Computing

  31. Bidirectional Summary • 16 CLBs/PE • 384 PEs/Board • 2,100 Million Cells/sec • Requires 2*(m+n) PEs • Uses only half the processors at any one time • Must stream both source and target for each comparison • Makes comparison against large DB impractical CprE 583 – Reconfigurable Computing

  32. Genetic Search Performance • Nearly linear scaling in cell updates per second (CUPS) • Need to reuse array for large patterns Hardware CUPS λ Area CUP/λ2s Splash 2 x16 43,000M 0.60μ 500Mλ2x17x16 0.32 Splash 2 3,000M 0.60μ 500Mλ2x16 0.38 Splash 1 370M 0.60μ 420Mλ2x32 0.028 P-NAC (34) 500M 2.0μ 7.8Mλ2x34 1.9 CM-2 (64K) 150M ? CM-5 (32) 33M ? SPARC 10 1.2M 0.40μ 1.6GMλ2 0.00075 SPARC 1 0.87M 0.75μ 273Mλ2 0.0032 CprE 583 – Reconfigurable Computing

  33. Application #3 – Image Processing • Reconfigurable computers well suited to image processing due to high parallelism and specialization (filtering) • Algorithms change sufficiently fast such that ASIC implementations become outdated • Examine two issues with Splash • Image compression • Image error estimation • Parallelize across array in SIMD and systolic fashion CprE 583 – Reconfigurable Computing

  34. Gaussian Pyramid Operations • Gaussian Pyramid • Down sample image to compress image size for communication • Average over a set of points to create new point • Laplacian Pyramid • Determine error found from Gaussian Pyramid • Expand contracted picture and compare with original CprE 583 – Reconfigurable Computing

  35. Gaussian Pyramids • Systolic array in which each device performs a separate function • Limited by clock rate of slowest device CprE 583 – Reconfigurable Computing

  36. Pyramid Flow • Generates both Gaussian and Laplacian pyramid for 512 x 480 image in 22.7 ms at 15.7 MHz CprE 583 – Reconfigurable Computing

  37. Other Image Processing • Target recognition • Break image into “chips” • Each chip passed through linear array in attempt to match with stored image • Images can be rotated, mirrored • Zoom in if suspicious object found CprE 583 – Reconfigurable Computing

  38. Summary • Splash 2 effective due to scalability and programming model • Parameterizable applications benefit that are regular and distributed • High bandwidth effective for searching/signal processing • Challenges remain in software development CprE 583 – Reconfigurable Computing

More Related