Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions

csda csda Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL)

Custom ISE Identification Register File AFU ALU MUL LD/ST out1 = F (in1, in2, in3, in4) out2 = G (in1, in2, in3, in4) Data Memory Limited number of I/O ports

Outline • Problem formulation • ISE selection • I/O serialisation • Related work • Non-optimality of earlier work • Integer Linear Programming (ILP) formulation • Results • Conclusions

a x1 d b c e x3 g f h x2 Problem Formulation • Given • a dataflow graph • a set of forbidden nodes • Find a subgraph S, which is • convex • free of forbidden nodes • And, has largest gain M (S) = Nexec * (SW (S) – HW (S))

Convex Subgraph • In order to execute the AFU we need the output of node b • Computation of node b requires the output of AFU a b c d A non-convex AFU cannot be scheduled without creating a deadlock

I/O Serialisation c b d d b c e e f f 2 inputs, 4 outputs Available I/O ports: (1, 2)

a x1 d b c e x3 g f h x2 ISE Merit Estimation c b d e f M (S) = Nexec * (SW (S) – HW (S))

Related Work • ISE identification under I/O constraints • Search space pruning using I/O and convexity constraints [Atasu03, Clark03, Yu04, Pozzi06, Yu07, Chen07] • ILP based approach [Atasu05] • Pseudo-polynomial time algorithm [Bonzini07] • ISE identification under relaxed I/O constraints • Restricted search space exploration [Pozzi05] • Generation of a semi compact set of connected ISEs [Pothineni07] • I/O serialisation • Exponential time algorithms[Pozzi05, Pothineni07] • Algorithms for specific processor models • Single-issue RISC processor model [Verma07]

Earlier Work ISE Selection I/O Serialisation Atasu03 Pozzi05 Chen07 Pothineni07 Bonzini07 Yu07 Exponential time I/O serialisation algorithm Optimal ISEs selection under various I/O constraints

Non-Optimality of Earlier Work .6 .6 .5 .5 .2 .2 .6 .6 .3 .3 .5 .5 cycle saved: cycle saved: .6 .6 .5 .5 cycle saved: cycle saved:

Our Contributions • Optimal ILP formulation for a large class of processor models • Earlier work consider RISC processor model only • Single run • In the earlier work ISE selection was done for various I/O constraints • ISE selection and I/O scheduling together • Another source of non-optimality of earlier work

Integer Linear Programming Objective function Linear constraints

ILP Formulation • Linear constraints • No forbidden nodes • Convexity constraints • I/O serialisation based constraints • I/O access per cycle based constraints • Objective function • Saving in cycles should be maximum

ISE Selection Constraints (1 of 2) • Variable: For each node ni a Boolean variable xi • xi is true iff node ni is in the selected ISE • Constraint: No forbidden node should be in the ISE • If ni is a forbidden node, then xi = 0 • Variable: For each node ni two Boolean variables pi and si • pi (si) is true iff at least a predecessor (successor) of ni is in the selected ISE • Constraint: Subgraph corresponding to the selected ISE must be convex • If (pi and si are true), then xi must be true (i.e., pi + si – xi≤ 1)

ISE Selection Constraints (2 of 2) • Relationship between pi, si and xi 0 if ni has no children pi = U (xj U pj) where nj’s are children of ni 0 if ni has no parents si = U (xj U pj) where nj’s are parents of ni

I/O Serialisation Based Constraints (1 of 3) • Variable: An integer variable intDelayi • Denotes the cycle in which node ni is executed, e.g., • intDelay1 = 0 • intDelay4 = 1 • intDelay5 = 2 • Variable: A real variable fractionalDelayi • Denotes the smallest time after intDelayi cycle when output of ni are available, e.g., • fractionalDelay3 = HW (n3) • fractionalDelay4 = HW (n3) + HW (n4) • Variable: An integer variable ρij • Denotes the number of stages across the edges between the nodes ni and nj , e.g., • ρ13 = 1 • ρ34 = 0 • ρ25 = 2 n1 n2 n3 n4 n5

I/O Serialisation Based Constraints (2 of 3) • Constraint: The difference between the cycles of predecessor and successor node is the same as number of latches on the edge connecting them, e.g., • intDelay4 = intDelay3 + ρ34 • intDelay5 = intDelay2 + ρ25 • Constraint: The total number of stages is the same as the last cycle in which an output node is computed, e.g., • R = intDelay5 + ρ57 • R = intDelay2 + ρ26 n1 n2 n3 n4 n5 n7 n6 Extra latches on output edges are created in order to realize an imaginary sink node

I/O Serialisation Based Constraints (3 of 3) • Constraint: fractionalDelay of a node depends on the fractionalDelay of its predecessor nodes, e.g., • Case 1: if node is the first node in the cycle • fractionalDelay3 = HW (n3) • Case 2: if node is not the first node in the cycle • fractionalDelay4 = fractionalDelay3 + HW (n4) • Constraint: fractionalDelay of a node should never exceed the cycle time, e.g., • fractionalDelay3≤ λ • fractionalDelay4≤ λ n1 n2 n3 n4 n5 n7 n6

I/O Access Per Cycle Based Constraints • Variable: Boolean variables cikIN and cikOUT • cikIN is true, iff ni is an input of ISE and is accessed in the kth stage of execution (similarly for cikOUT) • Constraint: In each stage no more than m inputs should be accessed, and no more than n outputs should be written back, i.e., for each k • ∑ cikIN ≤ m • ∑ cikOUT ≤ n • cikIN and cikOUT can be computed using the intDelay, fractionalDelay of nodes and ρ values of incoming and outgoing edges of the AFU

Objective Function • Saving in cycles should be maximized • SW (S) – HW (S) should be maximum SW (S) = ∑ xi SW (ni) HW (S) = R Any processor model where SW (S) and HW (S) can be computed using linear inequalities, can be handled using ILP

Experimental Setup Input dataflow graph exp / subopt ISE selection Atasu03 ISE selection Atasu03 ILP method exp / opt No serialisation I/O serialisation Pozzi05

Results (1 of 3) adpcmcoder adpcmdecoder No pipelining Pozzi’s algorithm ILP method viterbi

Results (2 of 3) Benchmark:aes Biggest dataflow graph:703 After 3 minutes After an hour Pozzi’s algorithm takes several hours on this benchmark, and produces inferior results

Results (3 of 3) The best AFU with 22 inputs and 22 outputs

Conclusions ISE Selection I/O Serialisation Atasu03 Pozzi05 Chen07 Pothineni07 Bonzini07 Yu07 Optimal, single run algorithm The methodology can be generalized for a large class of processor models

Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions

Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions

Presentation Transcript

Lect 4: Instruction Encoding and Instruction Set

Instruction Set

MC68HC11 Instruction Set

MIPS Instruction Set

8085 Instruction Set

INSTRUCTION SET

Optimal Fast Hashing

Architecture and instruction set

INSTRUCTION SET

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions

Optimal Fast Hashing

Instruction Set

Automatically Generating Custom Instruction Set Extensions

INSTRUCTION SET

ARM instruction set

ARM Instruction Set

ARM instruction set

Instruction Set Virtualization

Lect 5: Instruction Encoding and Instruction Set

CPU08 INSTRUCTION SET

Instruction Set Design

Instruction Set Extensions for Multi-Threading in LEON3