A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing

International Workshop on Reconfigurable Communication-centric Systems-on-Chip ReCoSoC13 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo∗, Xinyu Niu†, Christian Pilato∗, Tobias Becker†, Wayne Luk†, Marco D. Santambrogio∗ * Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano† Department of Computing Imperial College of London

Motivations • The design of heterogeneous, reconfigurable systems is a complex task • Adequate computer-assisted design (CAD) tools required • One of the foreseen predominant platforms of the future is the MPSoC • Lots of heterogeneous cores onto single chips • Typically, we want to accelerate an application or o class of applications onto the MPSoC • Starting point should be the application, not the architecture alone • Decisions in the frontend phase may highly affect the backendimplementation • iterative exploration is a practical requirement This is an ongoing project at Politecnico di Milano to assist in the design of such complex systems

Contents • Framework Overview • Preliminary Results – Test Case • Conclusions and Future Work

Framework Overview • Inputs (single XML file): • Information about the target device • Application source files (.C) plus custom pragmas for additional information • (e.g., task level parallelism/kernels) • Architectural template to use • Application Analysis • Task graph generation • Dataflow Graph generation (per function) • High Level Analysis: • Estimates of resource consumption for each node (DFG based) • Mapping and Scheduling • Mapping, Scheduling • Refinement of the architectural template • Output: • Project files ready for the synthesiswith back-end tools

XML Exchange Format • The entire project is contained inside an XML file • Architecture: components’ characteristics (e.g., reconfigurable regions), … • Applications: source code files and profiling information • Library: task implementations with the characterization (time, resources, ...) • Partitions: task graph, mapping and scheduling, … • It allows a modular organization of the framework, but also the sharing of information among the different phases • Specific details of the target platform are taken into account only in the final phase (interaction with backend tools)

Task Graph Generation • Application source code files can be analyzed to extract the task graphs • Profiling information can drive the generation of such solutions • Task graph will be then specified in the XML file as processing nodes connected by data transfers #pragma omp task void threshold(unsigned char *o1,unsigned char *r, unsigned char t, int * p){ nt DIMH = p[0]; int minH1 = p[1]; int maxH1 = p[2]; int minV1 = p[3]; int maxV1 = p[4]; for(v=minV1;v<maxV1;v++) for(h=minH1;h<maxH1;h++){ If(original1[v*DIMH+h]>thresh){ result[v*DIMH*BPP+h*BPP]=255; result[v*DIMH*BPP+h*BPP+1]=255; result[v*DIMH*BPP+h*BPP+2]=255; } else{ result[v*DIMH*BPP+h*BPP]=0; result[v*DIMH*BPP+h*BPP+1]=0; result[v*DIMH*BPP+h*BPP+2]=0; } } }

Library Generation: a collection of different implementations • LLVM-based compiler to extract the dataflow graph of each task • Estimation of required resources (including bit-width analysis) • Possibility to interact with HLS synthesis tools to obtain more accurate results (trading off design time with estimation accuracy) • Generated implementations are then stored into the XML file to offer opportunities to the mapper and floorplacer Politecnico di Milano/Imperial College of London joint effort to integrate High Level Analysis techniques into the toolchain

Mapping, Scheduling and Floorplacing • We generate one or more configurations where each task of the application is analyzed and assigned (via Mapping, Scheduling and Floorplanning – M/S/FP) to • An available and admissible implementation • A component of the architecture (GPP, IP or reconfigurable region) • This allows to • “share” implementations across different tasks (hardware sharing) • move a task implementation to another processing element at run-time (task relocation)

Architecture Exploration • During exploration, the target architecture can be refined • Adding/removing processing elements (reconfigurable regions) • Modifying their parameters • Determining the proper interconnection topology • It can iteratively affect: • mapping and scheduling: modification to the computational resources (especially the number of reconfigurable regions) • floorplacing: resources might become more scarce or more available due to the presence of more or less components to floorplace • It allows a progressive and iterative refinement of the solution and a concurrent customization of both architecture and application • E.g.: mapping and floorplacing can suggest which resources should be added

Supported Platforms • Virtex-5 XC5VLX110T (embedded) • Two XCF32P Platform Flash PROMs (32Mbyte each) SystemACE™ Compact Flash configuration controller • 64-bit wide 256Mbyte DDR2 small outline DIMM (SODIMM) • Maxeler MaxWorkstation (HPC system) • Intel i7 2600s@2.8GHz, 16GB RAM, 500GB HDD • Max3 dataflow engine (DFE) • Virtex 6 SX475T FPGA, 24GB memory • DFE connected to CPU via PCI Express XUPV5 MAX3 DFE DRAM (24GB) DDR2 (256MB) CPU0 Compute FPGA Interface FPGA CPU1 Reconf. Area CPU CPU DRAM (16GB) CPU CPU

Backend Toolchains • DFGs for HW tasks • Mapping configurations .xml .c • Source code for CPU MaxWorkstation FPGA-based embedded system Manual MaxJ Implementations CPU Compiler DFG-C DFG-MaxJ Manual VHDL Implementations HLS (C-VHDL) HLS (MaxJ-VHDL) MaxIDE Bitstream Generation Bitstream Generation exec bin The code can be always further optimized by hand; e.g., glue code for data transfers bit bit

Helper Graphical User Interface • Practical GUI to support the designer, to limit the errors in the interactions with the XML and to allow custom design methodologies

Preliminary results: edge detection • Edge detection application: 4 stages of computation • C + custom #pragmas based description • Extracted taskgraph and corresponding DFG of first stage (Scale, 1x parallelism) • We generate 4 implementations with different levels of parallelism and resource consumption for each of the 4 tasks of the application • “parallelism X”: X pixels processed at once • Maxeler Backend

Experimental Results / 1 • Static vs reconfigurable design (both extracted using the framework) R0: S,T R1: B,E • We limit the available area to 10klut and implement the most performing design • Reconfigurable (parallelism 8) • Static (parallelism 4) IP0: S IP1: B IP2: E IP3: T

Experiment Results / 2 • Reconfiguration time is automatically masked (when possible) • Partial Reconfiguration improves performance of application via automatic resource multiplexing • Automatic due to exploration of different schedulings

Experiment Results / 3 • HLA estimates are fairly accurate, given that they are extracted in a matter of seconds on a commodity desktop machine. • Average values over the set of tasks • Average accuracy is > 85%

Conclusions and Future Work • We presented a modular framework to design heterogeneous, reconfigurable systems • Easy to plug alternative methods for each of the phase • Possibility to perform progressive refinement of both application and architecture • Critical part: multi-objective optimization strategy. Different experiments with different heuristics or possibly different algorithms • Easy to plug in different components • This is becoming part of a larger project (ASAP – Advanced Synthesis of Applications and Platforms) • SystemC TLM backend for (co-)simulation and early validation • More architectural templates • Closer interaction with actual synthesis (e.g., high-level synthesis) • Automated methodologies to accelerate the design

Thank you!Riccardo Cattaneorcattaneo@elet.polimi.it Research partially funded by the European Community’s Seventh Framework Programme, FASTER project.

A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing

A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing

Presentation Transcript

Developing a Framework for Effective Support

Orleans: A framework for cloud computing

DataFlow Computing for Exascale HPC

Benefits of Partial Reconfiguration

PARTIAL RECONFIGURATION USING FPGAs: ARCHITECTURE

StaticRoute : A novel router for the dynamic partial reconfiguration of FPGAs

A Framework for Partial Secrecy

Evaluating Partial Reconfiguration for Embedded FPGA Applications

SECURE-PARTIAL RECONFIGURATION OF FPGAs

PARTIAL RECONFIGURATION DESIGN

StaticRoute : A novel router for the dynamic partial reconfiguration of FPGAs

Partial Reconfiguration Not just a half baked job of reconfiguring

Partial Equilibrium Framework

Run-Time FPGA Partial Reconfiguration for Image Processing Applications

A Framework for Source-Code-Level Interprocedural Dataflow Analysis of AspectJ Software

Application Study of EAPR based Partial Dynamic Reconfiguration

Customizing Virtual Networks with Partial FPGA Reconfiguration

PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors

Introduction to Partial Reconfiguration

Design Framework for Partial Run-Time FPGA Reconfiguration

A Framework for Effective, Interoperable Collaboration

A Framework for Source-Code-Level Interprocedural Dataflow Analysis of AspectJ Software