Lin, Hai Fei, Yunsi

Exploring Custom Instruction Synthesis forApplication-Specific Instruction Set Processors withMultiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), 2010 Date:2010/05/20 吳俊雄

OUTLINE • INTRODUCTION • MULTI-OBJECTIVE ASIP DESIGN • Two Algorithms for Custom Instruction Synthesis • Mixed Integer Linear Programming • Simulated Annealing Method • EXPERIMENTAL RESULTS

INTRODUCTION • Traditional custom instruction synthesisflows for ASIPs mainly target performance improvement. • We show that the existing custom instruction exploration algorithms • Mixed Integer Linear Programming (MILP) • Simulated Annealing Method • And cost estimation methods • Performance improvement • Energy efficiency • Area overhead

INTRODUCTION • Our work presented in this paper has three major contributions • We address the importance of energy andresource efficiency in ASIP design • We discuss a setof key factors during the custominstruction selection • We show that traditional design spaceexploration algorithms are either not feasible or inefficientto estimate all the necessary factors • Since the theoretical complexity for exploring the design space thoroughly is O(2n), most practical techniques adopt heuristics to prune the design space during the search. • Present a holistic ASIP synthesis and simulation flow which allows the flexibility to adjust the optimization goal between energy efficiency, area overhead and performance.

MULTI-OBJECTIVE ASIP DESIGN • There are two major energy factors: • Instruction fetch consumes aconsiderable portion of the total energy within a processor. • The data communication between operations is originally implemented through register file accesses within the base processor. • The dynamic energy consumption is affectedby the reduction of the number of instructions and dataregister file accesses.

MULTI-OBJECTIVE ASIP DESIGN Custom processor 1 with CFU1 achieves better performanceimprovement, because it utilizes operation parallelism in theDFG to reduce the total execution cycles. Customprocessor 2 with CFU2 achieves larger energy saving, because it realizes a sub-graph covering more operations anddata transfer edges.

MULTI-OBJECTIVE ASIP DESIGN We show that generating custom instructions from a DFGcan be viewed as solving an operation scheduling problem. Thescheduling scheme should ensure data dependency and that the input/outputedges of each software stage satisfy the I/O constraint setby the register file ports. For a scheduling scheme, the number ofsoftware stages with operations in represents the number ofinstructions for the customized processor. The edges acrossdifferent software stages represent register file accesses.

Two Algorithms for Custom InstructionSynthesis S3,4=1 • Mixed Integer Linear Programming (MILP) • Primary Variable definition: i: index of the operations, l: index of software stages. • Parameter definition: hardware execution delay k is the index of operation types.

Two Algorithms for Custom InstructionSynthesis • Assistant Variable definition: execution cycle delay • Constraints: • data dependency constraint • I/O Sd6=0.8 i j

Two Algorithms for Custom InstructionSynthesis SN:The number of instructions SE:The total number of data accesses For multi-issue, out-of-order processors equals to the longest execution path delay of the DFG :The largest number of this type of operations amongdifferent software stages :the number of functional modules (operators) of type k needed in the final custom hardware extension.

Two Algorithms for Custom InstructionSynthesis :The unit hardware area of functional module type k. energy consumption area overhead execution cycle The advantage of applying MILP to solve the scheduling problem is that, theoretically, it can find the optimum solution to the problem with sufficient searching time.

Two Algorithms for Custom InstructionSynthesis Simulated Annealing Method Solution Vector definition: OPv = {op1, op2, op3, ..., opn} Solution variation mechanism: In each iteration, we randomly selectn operations and move them to a different software stage togenerate a new solution. n represents the maximum distance between current solution and the one it evolves to. t is the current temperature, T is the starting temperature and N is the total number of operations.

Two Algorithms for Custom InstructionSynthesis R=[3~8] The allowable range for certain operation to move aroundis determined by the location of its parent and child nodes. In our algorithm, the actual moving range for an operation is further tightened by the current temperature - range = R * sqr(t/T ). We randomly move the operation to a software stage within this range.

Two Algorithms for Custom InstructionSynthesis Solution acceptance mechanism: A new solution is accepted when its cost is smaller than that of the current solution, or can be accepted with a probability of p when the new cost is larger than that of the current solution, where Simulated Annealing algorithm balances the trade-off between the solution quality and searching time.

Two Algorithms for Custom InstructionSynthesis

MULTI-OBJECTIVE ASIP SYNTHESISFLOW

EXPERIMENTAL RESULTS CPLEX is used to solve the MILP problem for design space exploration. The baseline processor is an out-of-order MIPSstyle processor. Set the ratio betweenthe weight variable g1 and g2 to be 12.2 : 1. Set the register file I/O constraints to be 4/2. We perform experiments for energy reduction and for performance improvement by setting the variable å2 and å3 at zero, and å1 and å2 at zero, respectively.

EXPERIMENTAL RESULTS The average speedup 1.42 for Binary Tree 1.64 for MILP (p.) 1.56 for MILP (e.) The average energy consumption reductions are 18.1%, 22.7% and 29.8%.

EXPERIMENTAL RESULTS The custom instruction templatespresented in (b) and (c) are targeting performance and energy efficiency, respectively. There are more operations inthe templates identified for energy efficiency, shown in (c),and they include longer critical paths than the sub-graphsshown in (b).

EXPERIMENTAL RESULTS å3=0, å1 = 1, å2 = 0 å1 = å2 = 0.5 For different designs, the ratio between å1 and å2 can be varied to find the best trade-off between them.

EXPERIMENTAL RESULTS The SA algorithm achieves anaverage of 1.46 performance speedup, which is a little lowerthan that achieved by the MILP algorithm (1.64).

Lin, Hai Fei, Yunsi

Lin, Hai Fei, Yunsi

Presentation Transcript

ITIS3100 By Fei Xu

FEI Arizona Chapter

Financial Governance FEI

MQF HAI Subcommittee: HAI Plan Update

FEI Arizona Chapter

Qing Hai

Li- Jia Li, Richard Socher , Li Fei-Fei

HAI

FEI Arizona Chapter

Fei Yang

FEI Survey Results

FEI Arizona Chapter