170 likes | 403 Vues
Hardware Acceleration of Applications Using FPGAs. Andy Nisbet Andy.Nisbet@cs.tcd.ie http://www.cs.tcd.ie/Andy.Nisbet Phone:+353-(0)1-608-3682 FAX:+353-(0)1-677-2204. Content. FPGAs? High-level language hardware compilation. High Performance Computing Using FPGAs??? HandelC.
E N D
Hardware Acceleration of Applications Using FPGAs Andy Nisbet Andy.Nisbet@cs.tcd.ie http://www.cs.tcd.ie/Andy.Nisbet Phone:+353-(0)1-608-3682 FAX:+353-(0)1-677-2204
Content • FPGAs? • High-level language hardware compilation. • High Performance Computing Using FPGAs??? • HandelC. • Research Directions at Trinity College using FPGAs.
FPGAs • FPGAs can be configured after the time of manufacture. • Configurable logic blocks, input/output blocks for connecting to external microchip pins and programmable interconnect. • Logic blocks can be configured/interconnected to form simple combinatorial logic structures, & complex functional units. • FPGAs can be used to provide a standalone solution to a task, or they can be used in tandem with a conventional microprocessor.
How are FPGAs Programmed • Conventional techniques use VHDL or Verilog. Require many low-level hardware details. • Synthesis tools can then convert the HDL into an EDIF netlist which can be placed and routed onto an FPGA device. A bit/configuration file is produced. • High-level languages such as HandelC and SystemC. A hardware compiler translates the specification into VHDL/Verilog, or an EDIF netlist.
Why use an FPGA? • Conventional microprocessors have a fixed architecture. • FPGAs can generate application specific logic where the balance and mix of functional units can be altered dynamically. • The number and type of functional units that can be instantiated on an FPGA are only limited by the silicon real estate available. • Potential to generate orders of magnitude speedup for computationally intensive algorithms, such as in signal/image-processing. • Maximum clock speed of FPGA is <= 400MHz, designs often ~50MHz.
High Performance Computing using FPGAs • Current FPGAs can instantiate multiple floating-point units. Applications work has focussed on using integer and fixed-point arithmetic. • Logarithmic Number System (LNS) ALU has single 20/32 bit precision with very small area in comparison to standard floating-point units. • Performance benefits for this system have already been demonstrated over Texas Instruments TMS320C3/4x 50MHz DSP processors on a 2million gate Xilinx FPGA device.
LNS on XC2V8000 • 14 independent 32bit ALUs (2-3GFlop Peak Estimate). • 336 independent 20bit ALUs (20-40GFlop Peak Estimate). • Clock 60 MHz (predicted), ADD and SUB latency 6 cycles, pipelined. MUL,DIV,SQRT =>depends just on the number of parallel 32bit adders, subtractors or bit shifters.
HandelC • Provided by Celoxica http://www.celoxica.com/ • Based on CSP/OCCAM (we’re not showing CSP aspects in this talk!) • C with hardware extensions. • Enables the compilation of programs into synchronous/clocked hardware. • Many other similar systems SystemC, Forge, JHDL, SA-C …
HandelC • Fork-join model of parallel computation, parallel statements are placed in a par{} block. • CSP aspect uses channels, useful for multiple clock domains. • Each statement takes ONE clock cycle to execute. • Clock cycle is determined after place and route and is 1/(longest logic gate + routing delay).
HandelC Simple Example // Original code takes one LONG clock cycle. unsigned int 32 x,a,b,c,d,e,f,g,h; x = a +b + c +d + e + f +g +h; // Parallelise and pipeline into something taking 3 SHORT cycles par{ // all statements inside the par block are executed in parallel temp1 = a+b; temp2 = c+d; temp3 = e+f; temp4 = g+h; // position 1 SYNCHRONISATION????? sum1 = temp1+temp2; sum2 = temp3 + temp4; // position 2 SYNCHRONISATION????? x = sum1+sum2; }
Porting/Optimising Applications for HandelC • Define variable storage & bit width, off-chip SRAM, on-chip FPGA synthesised registers/RAM/Block RAM. • Replace floating-point with LNS or fixed-point arithmetic. • Iterative optimisation process, apply high-level restructuring transformations=>see file.HTML
Efficient HandelC • Replace parallel for loops with parallel while loops. The loop increment can then execute in parallel. • Avoid the use of n-bit (n >> 1) comparators <,<=,>,>= and single-cycle multipliers. • Parallelise and pipeline code as far as possible. • Use dedicated on-chip resources such as multipliers (interface command/VHDL/Verilog). • Sequential statements not on the critical path can share functional units in order to reduce area requirements. • Optimise variable storage:-- registers, distributed RAM, block RAM, or off-chip in SRAM.
For -> While unsigned int 8 i; for(i = 0; i < 255; i++) { par{ … } } // becomes unsigned int 1 terminate = 0; while(!terminate) { par { terminate = (i == 254); i++; } }
Variable Storage unsigned int 32 i; // REGISTER unsigned int 23 j[40]; // ARRAY of REGISTERs (fully associative) ram unsigned 8 myRAM[16]; // single port DISTRIBUTED RAM mpram { // dual port DISTRIBUTED RAM ram unsigned int 8 readWrite[16]; // R/W rom unsigned int 8 readOnly[16]; // could be ram as well } myMPRAM; //to minimise logic for access to RAMs/ROMs USE registers myRam[aRegister] = aRegisterDataValue; // Adding with {block = 1}; Makes BLOCK RAM ram unsigned int 8 myBlockRAM[16] with {block = 1}; ram unsigned in 21 twoDim[12][8]; par { twoDim[0][aReg] = 0; twoDim[1][aReg] = 1; }
FPGA Research at TCD. • Interactive/Automatic iterative conversion from C to HandelC/SystemC/FORGE, prototype using SUIF/NCI (with David Gregg). • Application studies using lattice QCD, image segmentation, image processing (with Jim Sexton, Simon Wilson & Fergal Shevlin). Collision detection and telecommunication applications. • FPGA/SCI work, (Michael Manzke). • Exploitation of “striped CPU” FPGAs. • Numerical stability? Floating to 20/32 bit LNS and fixed-point. • New work, no results (yet!) focussed on compute bound applications, as PCI has poor IO.