CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #7 – Applications I

Recap – Fixed-Point Arithmetic • Addition, subtraction the same (Q4.4 example): • Multiplication requires realignment: 3.6250 0011.1010 + 2.8125 0010.1101 6.4375 0110.0111 3.6250 0011.1010 x 2.8125 0010.1101 00111010 00111010 00111010 00111010 10.1953125 1010.00110010 CprE 583 – Reconfigurable Computing

Recap – Capacity Trends Virtex-5 550 MHz 24M gates* Virtex-II Pro 450 MHz 8M gates* Virtex-4 500 MHz 16M gates* Virtex-II 450 MHz 8M gates Spartan-3 326 MHz 5M gates Virtex-E 240 MHz 4M gates Xilinx Device Complexity Virtex 200 MHz 1M gates XC4000 100 MHz 250K gates Spartan-II 200 MHz 200K gates Spartan 80 MHz 40K gates XC3000 85 MHz 7.5K gates XC5200 50 MHz 23K gates XC2000 50 MHz 1K gates 1985 1987 1991 1995 1998 1999 2000 2002 2003 2004 2006 Year CprE 583 – Reconfigurable Computing

Recap – FPGA Memory Resources Block Select+RAM Distributed LUT-based RAM CprE 583 – Reconfigurable Computing

Recap – Block Multipliers • Block multipliers (18b x 18b) arranged in columns near RAM CprE 583 – Reconfigurable Computing

Recap – Block Multipliers (cont.) • Synthesis tools can take larger multipliers and break them down into 18x18 multipliers CprE 583 – Reconfigurable Computing

Recap – FPGA CPU Resources • Built-in PowerPC 405 processor blocks CprE 583 – Reconfigurable Computing

Recap – Xilinx Virtex-4 FPGAs • Comes in three varieties: • Virtex-4 LX: most amount of LUTs • Virtex-4 FX: has PowerPCs like V2P • Virtex-4 SX: contains most amount of XtremeDSP slices • CLB structure similar to Virtex-II • Largest LX device – 89,088 slices = 178,176 4-LUTs! • FX devices limited to 2 PPC 405s like Virtex-II Pro • XTremeDSP Slices: • Same 18x18 block multiplier, now with optional pipelining • Includes built-in 48-bit accumulator for MAC operations CprE 583 – Reconfigurable Computing

Recap – Xilinx Virtex-5 FPGAs • CLB slices uses 6-input LUTs • Block RAMs now 36Kbits per block • DSP slices now support 25x18 MAC • Diagonal routing • LX, LXT, SXT, FXT varieties CprE 583 – Reconfigurable Computing

Outline • Recap • Splash / Splash 2 System Architecture • Splash 2 Programming Models • Splash 2 Applications • Text searching • Genetic pattern matching • Image processing CprE 583 – Reconfigurable Computing

Overview • An early well-known reconfigurable computer was Splash / Splash 2 • Implemented as linear, systolic array • Developed at Supercomputing Research Center (1990-1994) • Memory tightly coupled with each FPGA • Multiple Splash boards can be combined to form larger system CprE 583 – Reconfigurable Computing

Splash 1 • Born to solve DNA string matching • Operational in 1989 • Features • 32 Xilinx XC3090 FPGAs (420 Mλ2) • 32 128KB SRAMs (600 Mλ2) • VMEbus interface (FIFO 1 MHz clock) • 33 Gλ2 total (not counting interconnect) • Linear interconnect only CprE 583 – Reconfigurable Computing

Splash 1 Architecture VME Bus VSB Bus Interface Interface FIFO IN Control FIFO OUT F3 F2 F1 F0 F31 F30 F29 F28 M3 M2 M1 M0 M31 M30 M29 M28 M4 M5 M6 M7 M24 M25 M26 M27 F4 F5 F6 F7 F24 F25 F26 F27 F11 F10 F9 F8 F23 F22 F21 F20 M11 M10 M9 M8 M23 M22 M21 M20 M12 M13 M14 M15 M16 M17 M18 M19 F12 F13 F14 F15 F16 F17 F18 F19 CprE 583 – Reconfigurable Computing

Splash 2 • Operational in 1992 • Attached processor system • Features • 17 Xilinx XC4010 FPGAs (500 Mλ2) • 17 512KB SRAMs (2 Gλ2) • 9 TI SN74ACT8841s (16 port, 4b crossbar) • Sbus interface (< 100MB/s) • 43 Gλ2 total (not counting interconnect) • Supported 2 programming models: • Systolic • SIMD CprE 583 – Reconfigurable Computing

Splash 2 Architecture From Prev Board M5 M6 M7 M8 M0 M1 M2 M3 M4 F5 F6 F7 F8 F1 F2 F3 F4 Crossbar Switch F0 To Next Board F12 F11 F10 F9 F16 F15 F14 F13 M12 M11 M10 M9 M16 M15 M14 M13 CprE 583 – Reconfigurable Computing

Splash 2 Architecture (cont.) External Input Array Boards R-Bus S-Bus Interface Board Sparc-based Host SIMD S-Bus External Output CprE 583 – Reconfigurable Computing

Comparing Splash 1 and 2 Splash 1 Splash 2 Year 1989 1992 FPGA XC3090 XC4010 4-LUTs / Board 10,240 13,600 Max Input Rate 1 MB/s 100 MB/s Interconnect Linear Array Linear Array Broadcast Crossbar Memory / FPGA 128 KB 512 KB Area 33 Gλ2 43 Gλ2 CprE 583 – Reconfigurable Computing

Splash 2 Programming Models • Linear (systolic) array • All near-neighbor communication, pipelined • Very fast (at the time) speed of 20-30MHz achieved • All FPGAs run the same program • SIMD array • Instructions fanned out to all processing elements • Data across all elements collected at the end . . in … out F1 F2 F3 F16 F3 in F0 out … F1 F2 F3 F3 F16 CprE 583 – Reconfigurable Computing

Programming Splash 2 • Languages • VHDL • dbC – a C-like language capable of SIMD instructions • How many programs? • Host • Interface board (XL and XR) • Splash array board • X0 • X1 – X16 • TI crossbar chips • Five programs! • Requires intimate knowledge of the architecture CprE 583 – Reconfigurable Computing

Application #1 – Dictionary Search • Search through dictionary of words for data hit • Applicable to internet search engines / databases • Opportunities for search parallelism • Splash implementation uses systolic communication CprE 583 – Reconfigurable Computing

Dictionary Search (cont.) • Each FPGA used to look into local memory • Longer data words “hashed” into 18 bit address • Valid bit in memory indicates if data value is currently stored • Could be stored in several locations Mi-1 Mi Mi+1 Fi-1 Fi Fi+1 CprE 583 – Reconfigurable Computing

Example Hash Function Shift amount: 7 bits Hash function: 1100 1000 1010 0011 00 0000 0000 0000 0000 0000 Clear hash register 01 1010 0001 1101 00 Input the letters “th” --------------------------------- 10 1000 0011 0101 1100 0000 Temporary Result 10 0000 0101 0000 0110 1011 Result for “th” 00 0000 0001 1001 01 Input for letters “e_” ----------------------------------------- 01 0010 0110 0001 1110 1011 Temporary result 10 0101 1010 0100 1100 0011 Result for “the_” • XOR two character value with temp result and hash function • Rotate result • Different hash function for each FPGA CprE 583 – Reconfigurable Computing

Dictionary Search (cont.) • Distribute dictionary in parallel to all memories • Collect word values in FIFOs • Distribute words two characters at a time across all devices • Perform local hashing and lookup in parallel • Collect “hit” result at end • Splash 2 implementation results • 25 MHz • Three phases needed • Fetch 2 bit-sliced characters • Perform hash • Table look-up • Takes advantage of both systolic and SIMD modes CprE 583 – Reconfigurable Computing

Genetic Pattern Matching • Comparing strings by edit distance • Motivation: The Human Genome Project • Do two genetic strings match? • How are they related? • When biologists characterize a new sequence, they want to compare it to the (growing) database of known sequences • Abstraction: • What is the cost of transforming s into t • Given – costs for insertion, deletion, substitution CprE 583 – Reconfigurable Computing

Alphabet and Costs • Alphabet • Letters in the string. For DNA, there are four: • A (Adenine) • C (Cytosine) • T (Thymine) • G (Guanine) • Transformation Costs • Insert:1, Delete:1, Substitute:2, match:0 • Type of comparison • One target to many sources • One target to one source CprE 583 – Reconfigurable Computing

Substitution Example Word Move Cost baboon Delete ‘o’ 1 bab|on Substitute ‘o’ 2 bobon Insert ‘u’ 1 boubon Insert ‘r’ 1 bourbon Match? 0 bourbon Total cost: 5 CprE 583 – Reconfigurable Computing

Dynamic Programming Solution • Source sequence: s1, s2, … sm • Target sequence: t1, t2, … tn • di,j = distance between subsequence s1, s2, …si and subsequence t1, t2, …tj, where d0,0= 0 di,0 = di-1,0 + Delete(si) d0,j = d0,j-1 + Insert(tj) di-1,j + Delete(si) di,j = min di,j-1 + Insert(tj) di-1,j-1 + Substitute(si, tj) • Distance(Source, Target) = dm,n CprE 583 – Reconfigurable Computing

Dynamic Programming Example b o u r b o n 0 1 2 3 4 5 6 7 b 1 0 1 2 3 3 4 5 a 2 1 2 3 4 4 5 6 b 3 2 2 3 4 4 5 6 o 4 2 2 3 3 4 4 5 o 5 3 2 3 4 5 5 6 n 6 4 3 4 5 5 6 5 CprE 583 – Reconfigurable Computing

Parallelism on the Anti-Diagonal b o u r b o n 0 1 2 3 4 5 6 7 b 1 0 1 2 3 3 4 5 a 2 1 2 3 4 4 5 6 di-1,j-1 di-1,j b 3 2 2 3 4 4 5 6 o 4 2 2 3 3 4 4 5 o 5 3 2 3 4 5 5 6 n 6 4 3 4 5 5 6 5 di,j-1 di,j CprE 583 – Reconfigurable Computing

Bidirectional PE Example If (SCin != 0) and (TCin != 0) PEDist + Substitute(SCin, TCin) PEDist  min TDin + Delete(SCin) SDin + Insert(TCin) else-if (SCin != 0) PEDist  SDin else-if (TCin != 0) PEDist  TDin endif SCout  SCin TCout  TCin SDout  PEDist TDout  PEDist SCin SCout Bidir PE Bidir PE Bidir PE SDin SDout TCout TCin TDout TDin PEDist PEDist PEDist CprE 583 – Reconfigurable Computing

Bidirectional Summary • 16 CLBs/PE • 384 PEs/Board • 2,100 Million Cells/sec • Requires 2*(m+n) PEs • Uses only half the processors at any one time • Must stream both source and target for each comparison • Makes comparison against large DB impractical CprE 583 – Reconfigurable Computing

Genetic Search Performance • Nearly linear scaling in cell updates per second (CUPS) • Need to reuse array for large patterns Hardware CUPS λ Area CUP/λ2s Splash 2 x16 43,000M 0.60μ 500Mλ2x17x16 0.32 Splash 2 3,000M 0.60μ 500Mλ2x16 0.38 Splash 1 370M 0.60μ 420Mλ2x32 0.028 P-NAC (34) 500M 2.0μ 7.8Mλ2x34 1.9 CM-2 (64K) 150M ? CM-5 (32) 33M ? SPARC 10 1.2M 0.40μ 1.6GMλ2 0.00075 SPARC 1 0.87M 0.75μ 273Mλ2 0.0032 CprE 583 – Reconfigurable Computing

Application #3 – Image Processing • Reconfigurable computers well suited to image processing due to high parallelism and specialization (filtering) • Algorithms change sufficiently fast such that ASIC implementations become outdated • Examine two issues with Splash • Image compression • Image error estimation • Parallelize across array in SIMD and systolic fashion CprE 583 – Reconfigurable Computing

Gaussian Pyramid Operations • Gaussian Pyramid • Down sample image to compress image size for communication • Average over a set of points to create new point • Laplacian Pyramid • Determine error found from Gaussian Pyramid • Expand contracted picture and compare with original CprE 583 – Reconfigurable Computing

Gaussian Pyramids • Systolic array in which each device performs a separate function • Limited by clock rate of slowest device CprE 583 – Reconfigurable Computing

Pyramid Flow • Generates both Gaussian and Laplacian pyramid for 512 x 480 image in 22.7 ms at 15.7 MHz CprE 583 – Reconfigurable Computing

Other Image Processing • Target recognition • Break image into “chips” • Each chip passed through linear array in attempt to match with stored image • Images can be rotated, mirrored • Zoom in if suspicious object found CprE 583 – Reconfigurable Computing

Summary • Splash 2 effective due to scalability and programming model • Parameterizable applications benefit that are regular and distributed • High bandwidth effective for searching/signal processing • Challenges remain in software development CprE 583 – Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing

Presentation Transcript

Computing at UF

Optical Encoder for a Game Steering Wheel May05-26

Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More

Component-based Computing implications for Application Architectures

Pervasive and Mobile Computing: A 3-tier Architecture

Course web page:

Reconfigurable Computing - Memory in FPGAs

Cellular Networks and Mobile Computing COMS 6998-7, Spring 2014

Distributed Cluster Computing Platforms

Cellular Networks and Mobile Computing COMS 6998-10, Spring 2013

Cellular Networks and Mobile Computing COMS 6998- 10, Spring 2013

Optical Computing

Introduction to Supercomputers, Architectures and High Performance Computing

Soft Computing

Reiner Hartenstein University of Kaiserslautern

Reconfigurable Computing: a New Business Model – and its Impact on SoC Design

Synthesis of Digital Microfluidic Biochips with Reconfigurable Operation Execution

Software Defined Networking COMS 6998-10, Fall 2014

Introduction to Parallel Computing

ENG6530 Reconfigurable Computing Systems