1 / 30

Agenda

An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit. Farhad Mehdipour † , Hamid Noori †† , Morteza Saheb Zamani † , Kazuaki Murakami †† , Mehdi Sedighi † , Koji Inoue ††

lerato
Télécharger la présentation

Agenda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Integrated Temporal Partitioning and MappingFramework for Handling Custom Instructions on aReconfigurable Functional Unit Farhad Mehdipour†, Hamid Noori††, Morteza Saheb Zamani†, Kazuaki Murakami††, Mehdi Sedighi†, Koji Inoue†† †Computer and IT Engineering Department, Amirkabir University of Technology {mehdipur,szamani,msedighi}@aut.ac.ir ††Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University noori@c.csce.kyushu-u.ac.jp, {murakami,inoue}@i.kyushu-u.ac.jp

  2. Agenda • Introduction • General overview of the architecture • Generating Custom Instructions • Reconfigurable Functional Unit (RFU) • Tool Chain used for our quantitative approach • Integrated Temporal Partitioning and Mapping • The Integrated Framework • Incremental Temporal Partitioning Algorithm • Mapping Procedure • Experimental Results

  3. Introduction • Approaches for designing embedded SoCs • Application Specific Integrated Circuits (ASICs) • Higher performance • Lower power consumption • Not flexible • Expensive and time consuming design process • General Purpose Processors (GPPs) • Availability of tools • Programmability • Low performance • High power consumption • Application Specific Instruction-set Processors (ASIPs) • More flexible than ASICs • Higher performance than GPPs • Long and costly design and verification • Extensible Processors • More flexibility • significant non-recurring engineering costs

  4. General overview of the architecture Adaptive Dynamic Extensible Processor N-way in-order general RISC Detects start addresses of Hot Basic Blocks (HBBs) Base Processor Fetch Reg File Augmented Hardware Decode Switches between main processor and RFU Profiler Execute RFU Memory Sequencer Write Executes Custom Instructions

  5. Operation modes Training Mode Training Mode Normal Mode Running Tools for Generating Custom Instructions, Generating Configuration Data for RFU and Initializing Sequencer Table Monitors PC and Switches between main processor and RFU Detecting Start Address of HBBs Applications Applications Applications Binary-Level Profiling Processor Processor Processor Profiler Profiler Profiler Profiler RFU RFU RFU Sequencer Sequencer Sequencer Binary Rewriting Executing CIs

  6. Integrating base processor with other components

  7. Generation of Custom Instructions • Custom instructions • Limited to one Hot Basic Block (HBB) • Exclude floating point, multiply, divide and load instructions • Include at most one STORE, at most one BRANCH/JUMP and all other fixed point instructions • Simple algorithm for generating custom instructions • HBBs usually include 10~40 instructions for Mibench • Custom instruction generator is going to be executed on the base processor (in online training mode)

  8. 4052c0 addiu $29,$29,-32 4052c8 mov.d $f0,$f12 4052d0 sw $18,24($29) 4052d8 addu $18,$0,$6 4052e0 sw $31,28($29) 4052e8 sw $16,16($29) 4052f0 mfc1 $16,$f0 4052f8 mfc1 $17,$f1 405300 srl $6,$17,0x14 405308 andi $6,$6,2047 405310 sltiu $2,$6,2047 405318 addu $6,$6,$18 405320 sltiu $2,$6,2047 405328 lui $2,32783 405330 and $17,$17,$2 405338 andi $2,$6,2047 405340 sll $2,$2,0x14 405348 or $17,$17,$2 405350 mtc1 $16,$f0 405358 mtc1 $17,$f1 405360 lw $31,28($29) 405370 lw $16,16($29) 405378 addiu $29,$29,32 405380 jr $31 Finding the biggest sequence of instructions in the HBB that can be executed on the ACC Moving the instructions and appending supportable instructions to the head of the detected instruction sequence after checking flow-dependency and anti-dependency Moving the instructions and appending supportable instructions to the tail of the detected instruction sequence after checking flow-dependency and anti-dependency Rewriting object code if instructions have been moved Moving instructions, should not modify the logic of the application Custom instruction generation is done without considering any other constraints. Generating Custom Instructions

  9. Reconfigurable Functional Unit (RFU) • RFU is a matrix of Functional Units (FUs) • RFU has configuration memory • FUs support only logical operations, add/subtract, shifts and compare • RFU updates the PC after executing each CI • RFU has variable delay which depends on depth of DFG of Custom Instructions

  10. RFU Architecture: A Quantitative Approach • 22 programs of MiBench were chosen • Simplescalar toolset was utilized for simulation • RFU is a matrix of FUs • No of Inputs • No of Outputs • No of FUs • Width • Depth • Connections • Location of Inputs & Outputs • Coverage (Mapping) rate: Percentage of generated CIs that can be mapped on the RFU considering constraints • Considering frequency and weight in measurement • CI Execution Frequency • Weight (To equal number of executed instructions) • Average = for all CIs (ΣFreq*Weight)

  11. Tool Chain

  12. RFU Inputs (no constraint) 96.37 89.37 98.48 8

  13. RFU Outputs (no constraint) 96.58 6

  14. RFU Architecture • Distributing Inputs in different rows • Row1 = 7 • Row 2 = 2 • Row 3 = 2 • Row 4 = 2 • Row 5 = 1 • Connections with Variable Length • row1  row3 = 1 • row1  row4 = 1 • row1  row5 = 1 • row2  row4 = 1 Synthesis results using Hitachi 0.18 μm Area : 1.1534 mm2 Delay : 9.66 ns

  15. Generating Custom Instruction for the Target RFU • In our primary CI generator we did not consider any constraints for the generated CIs and tried to generate CIs as large as possible. • Therefore, some of the generated CIs could not be mapped on the proposed RFU due to its constraints after fixing the architecture.

  16. Customizing CI generator for the Target RFU – First Approach (CIGen) • Some primary constraints of the RFU (number of inputs, number of outputs and number of nodes) were added to our CI generator tool to generate CIs that are mappable. • In this approach the CI generator is unaware of the mapping process results • Some of CIs may not be ultimately mapped to the RFU due to the routing and connection constraints

  17. Customizing CI generator for the Target RFU – Second Approach • Integrated Framework • Performs an integrated temporal partitioning and mapping process • Takes rejected CIs as input • Partitions them to appropriate mappable CIs • Advantages • All generated CIs are mappable • Using a mapping-aware temporal partitioning process

  18. Integrated Framework- Incremental Temporal Partitioning Algorithm • Incremental Temporal Partitioning • The node with the highest ASAP level is selected and moved to the subsequent partition. • Nodes selection and moving order: 15, 13, 11, 9, 14, 12, 10, 8, 3 and 7.

  19. Mapping Custom Instructions • Mapping is the same as the well-known placement problem: • Determining the appropriate positions for DFG nodes on the RFU. • Assigning CI instructions to FUs is done based on the priority of the nodes.

  20. An Example: Mapping of a CI on the RFU

  21. Customizing Mapping Tool Spiral shaped mapping is possible thanks to the horizontal connections in the third and fourth rows of RFU

  22. CIs length for Mibench applications

  23. Percentage of rejected CIs for CIGen

  24. Initial and final number of partitions

  25. Maximum critical path length for CIs

  26. Performance Evaluation • Simplescalar was configured to behave as a MIPS324K processor. The base processor supports MIPS instruction set. • 22 applications of Mibench

  27. Delay of RFU according to CI length • Synopsys Tools + Hitachi 0.18μm

  28. Speedup

  29. Conclusions • Proposing a reconfigurable functional unit for an Adaptive Dynamic Extensible Processor using a quantitative approach. • Developing an integrated framework for partitioning and mapping custom instructions for the proposed RFU.

  30. Thank you for your attention.

More Related