1 / 25

Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions

csda. csda. Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions. Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL). Custom ISE Identification.

kaleb
Télécharger la présentation

Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. csda csda Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL)

  2. Custom ISE Identification Register File AFU ALU MUL LD/ST out1 = F (in1, in2, in3, in4) out2 = G (in1, in2, in3, in4) Data Memory Limited number of I/O ports

  3. Outline • Problem formulation • ISE selection • I/O serialisation • Related work • Non-optimality of earlier work • Integer Linear Programming (ILP) formulation • Results • Conclusions

  4. a x1 d b c e x3 g f h x2 Problem Formulation • Given • a dataflow graph • a set of forbidden nodes • Find a subgraph S, which is • convex • free of forbidden nodes • And, has largest gain M (S) = Nexec * (SW (S) – HW (S))

  5. Convex Subgraph • In order to execute the AFU we need the output of node b • Computation of node b requires the output of AFU a b c d A non-convex AFU cannot be scheduled without creating a deadlock

  6. I/O Serialisation c b d d b c e e f f 2 inputs, 4 outputs Available I/O ports: (1, 2)

  7. a x1 d b c e x3 g f h x2 ISE Merit Estimation c b d e f M (S) = Nexec * (SW (S) – HW (S))

  8. Related Work • ISE identification under I/O constraints • Search space pruning using I/O and convexity constraints [Atasu03, Clark03, Yu04, Pozzi06, Yu07, Chen07] • ILP based approach [Atasu05] • Pseudo-polynomial time algorithm [Bonzini07] • ISE identification under relaxed I/O constraints • Restricted search space exploration [Pozzi05] • Generation of a semi compact set of connected ISEs [Pothineni07] • I/O serialisation • Exponential time algorithms[Pozzi05, Pothineni07] • Algorithms for specific processor models • Single-issue RISC processor model [Verma07]

  9. Earlier Work ISE Selection I/O Serialisation Atasu03 Pozzi05 Chen07 Pothineni07 Bonzini07 Yu07 Exponential time I/O serialisation algorithm Optimal ISEs selection under various I/O constraints

  10. Non-Optimality of Earlier Work .6 .6 .5 .5 .2 .2 .6 .6 .3 .3 .5 .5 cycle saved: cycle saved: .6 .6 .5 .5 cycle saved: cycle saved:

  11. Our Contributions • Optimal ILP formulation for a large class of processor models • Earlier work consider RISC processor model only • Single run • In the earlier work ISE selection was done for various I/O constraints • ISE selection and I/O scheduling together • Another source of non-optimality of earlier work

  12. Integer Linear Programming Objective function Linear constraints

  13. ILP Formulation • Linear constraints • No forbidden nodes • Convexity constraints • I/O serialisation based constraints • I/O access per cycle based constraints • Objective function • Saving in cycles should be maximum

  14. ISE Selection Constraints (1 of 2) • Variable: For each node ni a Boolean variable xi • xi is true iff node ni is in the selected ISE • Constraint: No forbidden node should be in the ISE • If ni is a forbidden node, then xi = 0 • Variable: For each node ni two Boolean variables pi and si • pi (si) is true iff at least a predecessor (successor) of ni is in the selected ISE • Constraint: Subgraph corresponding to the selected ISE must be convex • If (pi and si are true), then xi must be true (i.e., pi + si – xi≤ 1)

  15. ISE Selection Constraints (2 of 2) • Relationship between pi, si and xi 0 if ni has no children pi = U (xj U pj) where nj’s are children of ni 0 if ni has no parents si = U (xj U pj) where nj’s are parents of ni

  16. I/O Serialisation Based Constraints (1 of 3) • Variable: An integer variable intDelayi • Denotes the cycle in which node ni is executed, e.g., • intDelay1 = 0 • intDelay4 = 1 • intDelay5 = 2 • Variable: A real variable fractionalDelayi • Denotes the smallest time after intDelayi cycle when output of ni are available, e.g., • fractionalDelay3 = HW (n3) • fractionalDelay4 = HW (n3) + HW (n4) • Variable: An integer variable ρij • Denotes the number of stages across the edges between the nodes ni and nj , e.g., • ρ13 = 1 • ρ34 = 0 • ρ25 = 2 n1 n2 n3 n4 n5

  17. I/O Serialisation Based Constraints (2 of 3) • Constraint: The difference between the cycles of predecessor and successor node is the same as number of latches on the edge connecting them, e.g., • intDelay4 = intDelay3 + ρ34 • intDelay5 = intDelay2 + ρ25 • Constraint: The total number of stages is the same as the last cycle in which an output node is computed, e.g., • R = intDelay5 + ρ57 • R = intDelay2 + ρ26 n1 n2 n3 n4 n5 n7 n6 Extra latches on output edges are created in order to realize an imaginary sink node

  18. I/O Serialisation Based Constraints (3 of 3) • Constraint: fractionalDelay of a node depends on the fractionalDelay of its predecessor nodes, e.g., • Case 1: if node is the first node in the cycle • fractionalDelay3 = HW (n3) • Case 2: if node is not the first node in the cycle • fractionalDelay4 = fractionalDelay3 + HW (n4) • Constraint: fractionalDelay of a node should never exceed the cycle time, e.g., • fractionalDelay3≤ λ • fractionalDelay4≤ λ n1 n2 n3 n4 n5 n7 n6

  19. I/O Access Per Cycle Based Constraints • Variable: Boolean variables cikIN and cikOUT • cikIN is true, iff ni is an input of ISE and is accessed in the kth stage of execution (similarly for cikOUT) • Constraint: In each stage no more than m inputs should be accessed, and no more than n outputs should be written back, i.e., for each k • ∑ cikIN ≤ m • ∑ cikOUT ≤ n • cikIN and cikOUT can be computed using the intDelay, fractionalDelay of nodes and ρ values of incoming and outgoing edges of the AFU

  20. Objective Function • Saving in cycles should be maximized • SW (S) – HW (S) should be maximum SW (S) = ∑ xi SW (ni) HW (S) = R Any processor model where SW (S) and HW (S) can be computed using linear inequalities, can be handled using ILP

  21. Experimental Setup Input dataflow graph exp / subopt ISE selection Atasu03 ISE selection Atasu03 ILP method exp / opt No serialisation I/O serialisation Pozzi05

  22. Results (1 of 3) adpcmcoder adpcmdecoder No pipelining Pozzi’s algorithm ILP method viterbi

  23. Results (2 of 3) Benchmark:aes Biggest dataflow graph:703 After 3 minutes After an hour Pozzi’s algorithm takes several hours on this benchmark, and produces inferior results

  24. Results (3 of 3) The best AFU with 22 inputs and 22 outputs

  25. Conclusions ISE Selection I/O Serialisation Atasu03 Pozzi05 Chen07 Pothineni07 Bonzini07 Yu07 Optimal, single run algorithm The methodology can be generalized for a large class of processor models

More Related