IEEE International Symposium on Rapid System Prototyping – Montreal, Canada – October 4, 2013

Riccardo Cattaneo, Christian Pilato, Gianluca C. Durelli, Marco D. Santambrogio and Donatella Sciuto Politecnico di Milano, Italy IEEE International Symposium on Rapid System Prototyping – Montreal, Canada – October 4, 2013 SMASH: A Heuristic Methodology for Designing Partially Reconfigurable MPSoCs

What is an FPGA? • Hardware devicethat can be customizedafter the fabrication to execute a specific functionality • Distinct hardware blocks are “intrinsically” running in parallel on the device • Heterogeneous grid of interconnected components • look-up tables (LUTs), block rams (BRAMs), digital signal processors (DSPs), switch matrices, input/output blocks (IOBs) etc… • Possibility to reuse resources by reconfiguring part of the logic at run time (partial reconfiguration)

Heterogeneous SoCs with FPGAs AVNet ZedBoard (Zynq7000-based dev board) Coarse Grain overview of Zynq7000 All-Programmable SoC • Highly coupled heterogeneous systems • Zynq Platform: ARM Dual-Cortex A9 cores tightly coupled with a Xilinx Artix-7 FPGA • High speed, low latency reconfigurable interconnect

Design Challenges and Motivation INPUT SMASH The steps are strictly interdependent! • Hardware engineer needs to: • partition the application in blocks (partitioning) • determine which parts are better to be executed in hardware (mapping and scheduling) • generate the systems (architecture refinement) • Partial reconfiguration allows reusing the same logic across different tasks • More tasks can be ported in hardware • Significant overhead to be taken into account

SMASH: Proposed Methodology • Design Space Exploration • determines the propermapping and scheduling • Architecture Refinement • customizes the architecturaltemplate to derive the corresponding platform

Mapping and Scheduling • Output: • Implementation and component for each task • Order of execution Input: • Task graph (DAG) • Architectural Template • Identifies resources constraints • Implementations • List of different trade-offs in termsof performance and resources

Implementation vs. Component • Each task can have multiple alternative implementations on the same component • Faster tasks usually require more resources • Some tasks can share implementations to execute the same functionality multiple times • Hardware reuse: no reconfiguration is required • Implementation is more related to functionality and resources • Component is more related to where the task is actually executed • Processor or hardware module

SMASH: Execution Overview SMASH iteration Evaluate metrics Generatetrace Schedule trace Store solution Termination? No Yes Return best solution Simultaneous MApping and Scheduling Heuristic

Exploring Mapping and Scheduling • Exploration based on the Serial Generation Scheme (SGS) • Constructive approach to better handle design constraints • Decision is not taken if it would lead to a constraint violation • Different combinations of mapping and scheduling • Each decision represents a mapping of a task with respect to an implementation and a processing element • The order of selection represents the priority values for resolving scheduling conflicts on the resources

Ant Colony Optimization • Our proposed approach is based on Ant Colony Optimization (ACO)to limit unfeasible solutions • Cooperative behavior of the ants while searching • The ant has different possibilities at each step and takes stochastic decisions, composing a trace • Stochastic principles guarantee exploration (a probability is generated for each admissible decision at each step) • Feed-backsguarantee the exploitation of good parts of the solutions

Algorithm Overview Exploration: generating trace Mapping decision Exploitation: updating global information Pseudo-code of the proposed ACO-based exploration:

Stochastic Selection Process global heuristic local heuristic There is always the possibility of adding a new PE or reusing an existing one (platform customization) • At each decision point d, the probability to assign a candidate j (task/communication) to a proper implementation pointi (implementation+processing element) is: • Global information G: feedback information • Probability that the decision leads to a good solution • Local heuristic L: problem-specific hint • “Adjusted” by the global heuristic if wrong • Roulette wheel and extraction of a combination i, j • Probability is generated iff the resources required by the resulting PEs can be satisfied by the architecture

More about SMASH • Simultaneous MApping and Scheduling Heuristic SMASH iteration Evaluate metrics Generatetrace Schedule trace Store solution Termination? No Yes Return best solution 13

Trace Generation and Evaluation • Evaluation is performed only on the complete trace • Updated version of the original TG augmented with communications and reconfigurations • Reconfiguration is taken into account from the early stages of the design process • Possibility to include different evaluation methods • Analytical estimations vs. TLM simulations • Decisions composing the best solution are reinforced • As the time goes, the best trace is identified

Scheduling Definition Input • Task graph (DAG) • Trace: ordered list of mapping decisions (task-component-implementation) Output • Start/end time estimations for each task Goal • Reduce total execution time

Scheduling: Methodology Overview SMASH scheduler Task graph and trace Extended task graph Metrics Create extended task graph Actual scheduling (assign times) Evaluate Metrics

Extended TG: Communications Adding explicit tasks based on the communication topology

Extended TG: Reconfigurations • A reconfiguration task is introduced iff: • Two processing tasks are mapped on the same component and • Their implementations are different, i.e., module cannot be reused • Insertion of a reconfiguration task: • New edges are introduced from all WRITEs exiting the source processing task to the reconfiguration • New edges are introduced from the reconfiguration to all the READs entering the target processing task

Extended TG: Reconfigurations

Trace Evaluation Possibility to integrate different policies to generate the corresponding scheduling

Architecture Refinement • Actual platform instance is derived based on the resulting decisions • Hardware modules with only one task assigned are converted into static IP blocks • Hardware modules with more tasks assigned are represented as reconfigurable regions • Integration with the generation of the run time manager to manage reconfigurations • Still work in progress and manually performed

Experimental Evaluation • Synthetic benchmarks (TGFF) • Focus on scalability of the approach • Possibility to evaluate different task graph patterns • Resulting systems (platform instance and extended task graph with mapping/scheduling decisions) converted into virtual platforms • Validation of the different solutions assuming correctness of the execution • Simulations performed with Synopsys Platform Architect • VPU performance annotations extracted from tasks’ implementations

Experimental Setup • Three different class of experiments: • Static: FPGA area is divided into a set of up to KS static IP cores (no partial reconfiguration) • Mixed: both IP cores and reconfigurable regions can be used, with an upper bound of KM IPs and RM reconfigurable regions. • Reconfigurable: architectures with no more than KR regions • Reconfigurable regions can be also deployed as static cores in the final architecture if only one task is assigned to them

Experimental Results Small task graphs cannot benefit of reconfiguration Large task graphs are affected by communication overhead

Conclusions and Future Work • SMASH is an automated methodology to design reconfigurable systems • It determines the mapping and scheduling of the different tasks • It allows customizing the architectural template • Future work • Integration of floorplanning procedures to compuate and validate physical constraints of the blocks • Automatic generation of the platform specification

End… http://www.fp7-faster.eu/

IEEE International Symposium on Rapid System Prototyping – Montreal, Canada – October 4, 2013