Instructor: Dr. Phillip Jones (phjones@iastate) Reconfigurable Computing Laboratory

CPRE 583Reconfigurable ComputingLecture 9: Wed 9/21/2011(Reconfigurable Computing Architectures) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ece.iastate.edu/cpre583/

Announcements/Reminders • MP1: Due Friday (9/23), and MP2 will be released on Friday as well. • Mini literary survey assigned • PowerPoint tree due: Fri 9/23 by class, so try to have to me by 9/22 night. My current plan is to summarize some of the classes findings during class. • Final 5-10 page write up on your tree due: Fri 9/30 midnight.

Start with searching for papers from 2008-2011 on IEEE Xplorer: http://ieeexplore.ieee.org/ Advanced Search (Full Text & Meta data) Find popular cross references for each area For each area try to identify 1 good survey papers For each area Identify 2-3 core Problems/issues For each problem identify 2-3 Approaches for addressing For each approach identify 1-2 papers that Implement the approach. Literary Survey

Literary Survey: Example Structure Network Intrusion Detection P2 P3 P1 A1 A2 A3 A1 A2 A1 A2 I1 I1 I1 I2 I1 I1 I2 I1 I1 • 5-10 page write up on your survey tree

Fall 2010 Student Example

Overview • Chapter 2 (Reconfigurable Architectures)

Common Questions

What you should learn • Basic trade-offs associated with different aspects of a Reconfigurable Architecture. (Chapter 2)

Reconfigurable Architectures • Main Idea Chapter 2’s author wants to convey • Applications often have one or more small computationally intense regions of code (kernels) • Can these kernels be sped up using dedicated hardware? • Different kernels have different needs. How does a kernels requirements guide design decisions when implementing a Reconfigurable Architecture?

Reconfigurable Architectures • Forces that drive a Reconfigurable Architecture • Price • Mass production 100K to millions • Experimental 1 to 10’s • Granularity of reconfiguration • Fine grain • Course Grain • Degree of system integration/coupling • Tightly • Loosely All are a function of the application that will run on the Architecture

Example Points in (Price,Granularity,Coupling) Space $1M’s Exec Int Intel / AMD Decode Store float RFU Processor Price Coupling Tight $100’s Loose Coarse PC Ethernet Granularity ML507 Fine

What’s the point of a Reconfigurable Architecture • Performance metrics • Computational • Throughput • Latency • Power • Total power dissipation • Thermal • Reliability • Recovery from faults Increase application performance!

Typical Approach for Increasing Performance • Application/algorithm implemented in software • Often easier to write an application in software • Profile application (e.g. gprof) • Determine where the application is spending its time • Identify kernels of interest • e.g. application spends 90% of its time in function matrix_multiply() • Design custom hardware/instruction to accelerate kernel(s) • Analysis to kernel to determine how to extract fine/coarse grain parallelism (does any parallelism even exist?) Amdahl’s Law!

Amdahl’s Law: Example • Application My_app • Running time: 100 seconds • Spends 90 seconds in matrix_mul() • What is the maximum possible speed up of My_app if I place matrix_mul() in hardware? • What if the original My_app spends 99 seconds in matrx_mul()? 10 seconds = 10x faster 1 seconds = 100x faster Good FPGA paper that illustrates increasing an algorithm’s performance with Hardware “NOVEL FPGA BASED HAAR CLASSIFIER FACE DETECTION ALGORITHM ACCELERATION”, FPL 2008 http://class.ece.iastate.edu/cpre583/papers/Shih-Lien_Lu_FPL2008.pdf

Granularity

Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

Granularity: Fine Grain CLB CLB CLB CLB CLB CLB CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates

Granularity: Fine Grain Configurable Logic Block • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

Granularity: Fine Grain Configurable Logic Block • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates CLB CLB CLB CLB CLB CLB CLB CLB

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Microprocessor 1024-bits

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 1024-bits

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 4 op 1024-bits 3 A 3 B 3 4 op 3 3 A B 3

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op A 3 10-LUT Microprocessor 3 3 B 1024-bits op A 3 3 3 B 4 op 3 A 3 B 3

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 4 op op A A 3 3 10-LUT Microprocessor 3 3 3 3 B B 1024-bits 4 op 3 A 3 3 B

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT A 10-LUT B Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT AND 4 A 10-LUT 1 Bit logic and constants 1024-bits OR Area that was required using 2-LUTS (A and “1100”) or (B or “1000”) 0 OR 4 B It’s much worse, each 10-LUT only has one output

Granularity: Example Architectures • Fine grain: GARP • Course grain: PipeRench

Granularity: GARP Memory D-cache I-cache CPU RFU Config cache Garp chip

Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip

Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip Example computations in one cycle A<<10 | (b&c) (A-2*b+c)

Granularity: GARP Memory • Impact of configuration size • 1 GHz bus frequency • 128-bit memory bus • 512Kbits of configuration size D-cache I-cache On a RFU context switch how long to load a new full configuration? CPU RFU 4 microseconds An estimate of amount of time for the CPU perform a context switch is ~5 microseconds Config cache Garp chip ~2x increase context switch latency!!

Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip • “The Garp Architecture and C Compiler” • http://www.cs.cmu.edu/~tcal/IEEE-Computer-Garp.pdf

Granularity: PipeRench • Coarse granularity • Higher (higher) level programming • Reference papers • PipeRench: A Coprocessor for Streaming Multimedia Acceleration (ISCA 1999): http://www.cs.cmu.edu/~mihaib/research/isca99.pdf • PipeRench Implementation of the Instruction Path Coprocessor (Micro 2000): http://class.ee.iastate.edu/cpre583/papers/piperench_Micro_2000.pdf

Granularity: PipeRench PE PE PE PE PE PE PE PE PE 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU Reg file Reg file Reg file Reg file Reg file Reg file Reg file Reg file Reg file Interconnect Global bus Interconnect

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 PE PE PE PE PE PE PE PE PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 PE PE PE PE PE PE PE PE PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 PE PE PE PE 1 PE PE PE PE PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 2 PE PE PE PE PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 1 2 2 PE PE PE PE 3 PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 4 PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0

Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 1

Instructor: Dr. Phillip Jones (phjones@iastate) Reconfigurable Computing Laboratory

Instructor: Dr. Phillip Jones (phjones@iastate) Reconfigurable Computing Laboratory

Presentation Transcript

Computing at UF

Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More

Pervasive and Mobile Computing: A 3-tier Architecture

Course web page:

Machine Learning on Massive Datasets

Basic Firearms Instructor Course

Reconfigurable Computing - Memory in FPGAs

Cellular Networks and Mobile Computing COMS 6998-7, Spring 2014

Java for High Performance Computing

Cellular Networks and Mobile Computing COMS 6998-10, Spring 2013

Optical Computing

William Regli Geometric and Intelligent Computing Laboratory Department of Computer Science

University of British Columbia CICS 515 (Part 1) Internet Computing Lecture 1 - Overview

Soft Computing

Reiner Hartenstein University of Kaiserslautern

Reconfigurable Computing: a New Business Model – and its Impact on SoC Design

Synthesis of Digital Microfluidic Biochips with Reconfigurable Operation Execution

Instructor: Li Erran Li ( lierranli@cs.columbia )

ENG6530 Reconfigurable Computing Systems

IE 514 Production Scheduling

Computing with Concurrent Objects