Lev Kirischian, Irina Terterian, Pil Woo Chun and Vadim Geurkov

Re-configurable Parallel Stream Processor with self-assembling and self-restorable micro-architecture Lev Kirischian, Irina Terterian,Pil Woo Chun and Vadim Geurkov Embedded and Re-configurable Systems Lab RYERSON University, CANADA

Example of Multi-task Data-Flow workload where each task can run in different modes Tasks Task 4: Mode 1 Mode 3 Mode 4 Mode 7 Task 3 Task 2: Mode 1 Task 2: Mode 2 Task 1: Mode 1 Mode 2 Mode 3 Time

Usual Approach: Conventional Processors with Software-to-Task Optimization (Compilers +OS) Software-to-task optimization allows using conventional computing platforms with fixed architecture (Superscalar, VLIW, etc.) coupled with software compilers and OS. Limitations of the conventional processors If tasks are executed on sequential computing system – processing time often cannot fit specification requirements If tasks are executed on parallel computing system with fixed architecture – cost-effectiveness of these parallel computers strongly depend on the tasks algorithm or data structure

Alternative Approach: Application Specific Processors (ASP) with Static Hardware-to-Task Optimization ASP allows reaching required cost-performance parameters because ASP-architecture is optimized on data-flow graph of the task and task data structure Limitations for the Application Specific Processors • Decrease of performance if task algorithm or data structure changes • Limited possibility for further modernization • High cost for multi-task or multi-mode custom computing systems

Proposed Approach: Reconfigurable Processor with Dynamic Architecture-to-Task Optimization High-performance computing system for multi-task data-flow applications should contain two major components: 1. Dynamically Re-configurable Computing Platform based on partially-configurable FPGA devices to provide maximum possible hardware flexibility. 2. Library of Application Specific Virtual Processors (ASVP) – configuration bit-streams to program On-Chip Application Specific Processor’s circuitry for the period of time while Application (Task) is active.

Architecture of Partially Reconfigurable FPGA devices (Xilinx “Virtex” Family) Configuration Data Files Internal Configuration SRAM In Out I / O Frame CLBs Frame # 1 Block RAM CLBs Frame # i Block RAM CLBs Frame # N I / O Frame Internal (Virtual BUS) CLB - Configurable Logic Block - Uniform Logic Element of a Frame, smallest individually configurable component in the FPGA

Concept of Application Specific Virtual Processor (ASVP) • Application Specific Virtual Processor (ASVP) – • a group of logic resources dedicated and optimally configured to reflect the algorithm and data structure of the task. • ASVP is presented in a form of configuration data file (configuration bit-stream) to be downloaded into the FPGA when task should be activated

Life-cycle of Application Specific Virtual Processor 1. ASVP-core downloads to the Reconfigurable platform before task activation 2. ASVP performs the task data processing as long as it is necessary without interruption or time sharing of dedicated logic resources with any other task 3. After task completion all resources included in the ASVP can be re-configured for any other task.

ASVP Architecture-to-Task Optimization in Partially Reconfigurable FPGA FPGA Slots: 1 2 3 ... Data-Flow Graph FPGA X O R X O R + Virtual Hardware Component XOR Data In XOR XOR + Input Output Data Out Internal (Virtual) BUS

Micro-architecture of a Virtual Hardware Component

Virtual Hardware Component & Virtual Bus Interconnection Virtual Bus Virtual Hardware Component Boundary

Micro-architecture of Application Specific Virtual Processor (ASVP) Micro-architecture of ASVP is based on Virtual Hardware Components interconnected via Virtual Bus lines

Parallel Task Processing on the Dynamically Re-configurable Stream Processor (DRSP) Data out #2 Data out #3 Data in #2 ASVP1 for Task 1 ASVP 2 ASVP 3 Data out #1 I/O 1 I/O 2 I/O 3 I/O 4 Data in #1 FU 1 FU 2 FU 3 FU 4 RIM 1 RIM 2 RIM 3 RIM 4 Virtual Bus

DRSP: System Level Architecture Host PC Data Stream Source Task Memory Task 1:{Afix+Amodes} …………………. Task h:{Afix+Amodes} PRCP-base Reconfigurable Functional Unit Afix i + … Cache Memory {Amodes i} P C I - Bus PCI-Interface Module Configuration & Data Bus RT-HOS Data Out

Architecture of Reconfigurable Computing Module SPI 2 x 3.43 Gbit / S (12 bit*300 MHz) Input LVDS ports Real-Time Hardware Operating System Based on XCV50E Vertex FPGA 8.12 Gbit /S LVTTL BUS (64 bit x 133MHz) PCI Inter face 800 Mbit/S Reconfig. Functional Unit [ RFM 0111-002] Config.Files / Data Cache (4x512KB) SPI 2 x 3.43 Gbit / S (12 bit*300 MHz) Output LVDS Ports

Reconfigurable Computing Module based on Xilinx “Virtex-E family of FPGA Devices

Restoration of ASVP using spare CLB-column Column # 1 2 3 ... If hardware fault occurs the damaged Virtual Hardware Component can be relocated to the reserved CLB-column. AP i X O R X O R + + Input Output Communication Field

When the proposed technology is most beneficial? • Workload consists of many tasks, where each task can run in different modes. • Each task requires high-speed data-stream processing • Task algorithmsmay be modified within life cycle of a system • Active tasks must run in parallel and should not be interrupted in any case when one of the tasks switches its mode or terminates. • System can be remotely or self-restored even if some hardware fault occurs

DRSP Application for Networked Intelligent Manufacturing Systems High performance parallel data-stream processing (up to thousands of billions operations / sec.) of big volume of data (up to hundreds of Giga bits) for: a) Complex image processing and image recognition, b)Spectrum analysis and digital signal processing, c)Data transmission via LAN with data compression / decompression and encryption / decryption, d)Control of high performance manufacturing equipment and robotic systems.

25 20 Acceleration 15 10 5 0 1 2 3 4 5 6 7 8 9 10 Number of CLB-slots in Virtual Component Acceleration of Task / Mode Switching Acceleration of task or mode switching comparing with Entire FPGA-based system increases when number of CLB-columns in ASVP is minimal and can be over that 20 times faster

Minimization of Hardware Resources Minimization of Logic resources in DRSP approach Comparing with entire FPGA-based systems: When number of tasks and task modes increases in a workload, respectively increases the cost-effectiveness of DRSP

SUMMARY: RDSP Comparing with Conventional CPU, DSP or ASP Platforms DRSP Conv. CPU DSP ASP Performance Flexibility Reliability Much lower than DRSP Lower than DRSP Much lower than DRSP Somewhat higher None, or very little Lower than DRSP Much lower than DRSP Much lower than DRSP Lower than DRSP

Thank you

Lev Kirischian, Irina Terterian, Pil Woo Chun and Vadim Geurkov

Lev Kirischian, Irina Terterian, Pil Woo Chun and Vadim Geurkov

Presentation Transcript

Irina Artwork - Art By Irina

Irina Solodova

Chun, Young Woo College of Forest Science, Kookmin University, KOREA

Irina Tyshkevich

Woo, PowerPoint!!!!!

European PIL

Chun-

LEV ????

Irina Issakova

Vadim Kanavets (ITEP)

PIL - LTC

Irina V. Kartavtseva, Irina N. Sheremetyeva

Irina Vodă

Woo Creo