1 / 137

CMPUT680 - Winter 2006

CMPUT680 - Winter 2006. Topic I: Superblock and Hyperblock Formation José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680. Instruction Level Parallelism Optimizations. The objective of an optimizer is to reduce the number and complexity of the instructions

bond
Télécharger la présentation

CMPUT680 - Winter 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMPUT680 - Winter 2006 Topic I: Superblock and Hyperblock Formation José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680 CMPUT 329 - Computer Organization and Architecture II

  2. Instruction Level Parallelism Optimizations The objective of an optimizer is toreduce the number and complexity of the instructions executed by the processor. Superscalar or Very Long Instruction Word (VLIW) processors can reduce the execution time even when the number of instructions executed moderately increases, as long as the dependence height is reduced. CMPUT 329 - Computer Organization and Architecture II

  3. Speculative and Predicated Execution Speculative Execution: execution of an instruction before knowing that its execution is required. Superblock: structure used to implement compiler-controlled speculative execution. Predicated Execution: architecture-supported conditional execution of an instruction based on the value of a Boolean source operand, referred to as the predicate of the instruction. If-conversion: compiler algorithm that converts conditional branches into predicate-defining instructions to allow the use of predication. CMPUT 329 - Computer Organization and Architecture II

  4. Trace Scheduling (Fisher, 1981) Some optimization and scheduling decisions may decrease the execution time for one control path while increasing the execution time for another path. Thus decisions should favor more frequently executed paths to improve overall performance. Trace scheduling divides a procedure in a set of frequently executed traces(paths). CMPUT 329 - Computer Organization and Architecture II

  5. Trace Scheduling There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances). These control-flow transitions are ignored during trace scheduling. After scheduling, bookeeping is required to ensure the correct execution of off-trace code. CMPUT 329 - Computer Organization and Architecture II

  6. Bookeeping for Trace Scheduling Instr 1 Instr 2 Instr 2 Instr 3 Instr 3 Instr 4 Instr 4 Instr 1 Instr 5 Instr 5 What bookeeping is required when Instr 1 is moved below the side entrance in the trace? CMPUT 329 - Computer Organization and Architecture II

  7. Bookeeping for Trace Scheduling Instr 3 Instr 1 Instr 2 Instr 4 Instr 2 Instr 3 Instr 3 Instr 4 Instr 4 Instr 1 Instr 5 Instr 5 CMPUT 329 - Computer Organization and Architecture II

  8. Bookeeping for Trace Scheduling Instr 1 Instr 1 Instr 2 Instr 5 Instr 3 Instr 2 Instr 4 Instr 3 Instr 5 Instr 4 What bookeeping is required when Instr 5 moves above the side entrance in the trace? CMPUT 329 - Computer Organization and Architecture II

  9. Bookeeping for Trace Scheduling Instr 5 Instr 1 Instr 1 Instr 2 Instr 5 Instr 3 Instr 2 Instr 4 Instr 3 Instr 5 Instr 4 CMPUT 329 - Computer Organization and Architecture II

  10. Superblocks A superblock is a trace without side entrances, i.e., control can only enter from the top, but it can leave at one or more exit points. The formation of superblocks creates additional optimization opportunities because constraints associated with infrequently executed paths of control are ignored (thus these constraints do not inhibit optimizations that favor frequently executed paths). CMPUT 329 - Computer Organization and Architecture II

  11. Y 1 D 100 10 90 B 90 C 10 0 90 D 0 E 90 10 90 0 F 100 99 1 Z Superblock Formation(Example) Y 1 D 100 90 10 B 90 C 10 90 0 D 0 E 90 10 99 90 0 F 100 1 Z CMPUT 329 - Computer Organization and Architecture II

  12. Y 1 D 100 10 90 B 90 C 10 0 90 D 0 E 90 10 90 0 F 100 99 1 Z Superblock Formation(Example) Is this a superblock? No, a superblock cannot have side entrances, and this set of nodes has two side entrances into node F. How do we convert it into a superblock? CMPUT 329 - Computer Organization and Architecture II

  13. Superblock Formation(Example) Y 1 Tail duplication, is the duplication of basic blocks that appear after a side entrance to eliminate side entrances and transform a trace into a superblock. D 100 90 10 B 90 C 10 0 90 10 D 0 E 90 90 89.1 F 90 9.9 0 0.9 Z 10 F’ 10 0.1 CMPUT 329 - Computer Organization and Architecture II

  14. opA: mul r1,r2,3 opA: mul r1,r2,3 1 1 opB: add r2,r2,1 99 opB: add r2,r2,1 99 opC’: mul r3,r2,3 1 opC: mul r3,r2,3 opC: mul r3,r2,3 Code After Superblock Formation Original Code opA: mul r1,r2,3 1 opB: add r2,r2,1 99 opC’: mul r3,r2,3 opC: mov r3,r1 Code After Common Subexpression Elimination Common Subexpression Elimination in Superblocks CMPUT 329 - Computer Organization and Architecture II

  15. … X X mov r0,r1 … Y Y mov r0,r2 … add r1,r1,4 add r2,r2,4 add r3,r3,4 Z Z mov r0,r3 After Operation Migration Operation Migration in Superblocks … mov r0,r1 … mov r0,r2 … mov r0,r3 … add r1,r1,4 add r2,r2,4 add r3,r3,4 Original Code CMPUT 329 - Computer Organization and Architecture II

  16. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops MEM[r0+x] 10 20 30 r0 1 r1 1 r4 CMPUT 329 - Computer Organization and Architecture II

  17. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops MEM[r0+x] 10 20 30 r0 1 r1 1 r4 10 CMPUT 329 - Computer Organization and Architecture II

  18. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops MEM[r0+x] 10 20 30 r0 1 r1 1 r4 11 CMPUT 329 - Computer Organization and Architecture II

  19. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops MEM[r0+x] 11 20 30 r0 1 r1 1 r4 11 CMPUT 329 - Computer Organization and Architecture II

  20. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops MEM[r0+x] 11 20 30 r0 1 r1 2 r4 11 CMPUT 329 - Computer Organization and Architecture II

  21. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops MEM[r0+x] 11 20 30 r0 1 r1 2 r4 11 CMPUT 329 - Computer Organization and Architecture II

  22. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops MEM[r0+x] 11 20 30 r0 1 r1 2 r4 12 CMPUT 329 - Computer Organization and Architecture II

  23. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops MEM[r0+x] 12 20 30 r0 1 r1 2 r4 12 CMPUT 329 - Computer Organization and Architecture II

  24. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops 12 MEM[r0+x] 20 30 r0 2 r1 2 r4 12 CMPUT 329 - Computer Organization and Architecture II

  25. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops 12 MEM[r0+x] 20 30 r0 2 r1 2 r4 20 CMPUT 329 - Computer Organization and Architecture II

  26. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops 12 MEM[r0+x] 20 30 r0 2 r1 2 r4 21 CMPUT 329 - Computer Organization and Architecture II

  27. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops 12 MEM[r0+x] 21 30 r0 2 r1 2 r4 21 CMPUT 329 - Computer Organization and Architecture II

  28. OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 OpA: ld_I r4, x, r0 0 OpB: add r4, r4, r1 0 100 100 OpE: add r0, r0, 1 OpC’: st_i x, r0, r4 OpE: add r0, r0, 1 OpD: add r1, r1, 1 OpD: add r1, r1, 1 Original Program Segment OpC: st_i x, r0, r4 After Variable Migration Global Variable Migration in Superblock Loops CMPUT 329 - Computer Organization and Architecture II

  29. Superblock Enlarging Optimizations By enlarging a superblock, we can provide the scheduler with more independent instructions to choose from for each cycle Superblock enlarging optimizations: Branch target expansion Loop unrolling Loop peeling CMPUT 329 - Computer Organization and Architecture II

  30. L1: blt r1, r2, L3 L1: blt r1, r2, L3 20 100 20 beq r3, r4, L5 L2: jump L4 L2: jump L4 L3: beq r3, r4, L5 Branch Target Expansion Idea: To expand the superblock with the target of a likely taken branch. CMPUT 329 - Computer Organization and Architecture II

  31. Superblock Loops A superblock loop is a superblock that has a frequently taken backedge from its last node to its first node. We will study the extension of some common loop optimizations to superblocks. CMPUT 329 - Computer Organization and Architecture II

  32. Dependence Removing Optimizations The goal is to eliminate data dependences between instructions within frequently executed superblocks. Dependence removing optimizations include: Register renaming Accumulator variable expansion Induction variable expansion Search variable expansion Operation combining Strength reduction Tree height reduction CMPUT 329 - Computer Organization and Architecture II

  33. Instruction Latencies for Examples CMPUT 329 - Computer Organization and Architecture II

  34. Register Renaming Example L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) blt r1, r5, L1 (f) For (j=0; j<n; j++) { C(j) = A(j)+B(j) } Original Loop Assembly Code For all the examples we assume a superscalar processor with infinite resources and no register renaming hardware. Thus for the code above, we obtain the following schedule. CMPUT 329 - Computer Organization and Architecture II

  35. Instr. a a b b c c c d e f 0 5 cycles Code Schedule Register Renaming Example For (j=0; j<n; j++) { C(j) = A(j)+B(j) } L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) blt r1, r5, L1 (f) Original Loop Assembly Code 7 cycles / 1 iteration CMPUT 329 - Computer Organization and Architecture II

  36. L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) ld_f f2, A, r1 (f) ld_f f3, B, r1 (g) add_f f4, f2, f3 (h) st_f C, r1, f4 (i) add r1, r1, 4 (j) ld_f f2, A, r1 (k) ld_f f3, B, r1 (l) add_f f4, f2, f3 (m) st_f C, r1, f4 (n) add r1, r1, 4 (o) blt r1, r5, L1 (p) After Loop Unrolling Register Renaming Example L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) blt r1, r5, L1 (f) Original Assembly Code CMPUT 329 - Computer Organization and Architecture II

  37. L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) ld_f f2, A, r1 (f) ld_f f3, B, r1 (g) add_f f4, f2, f3 (h) st_f C, r1, f4 (i) add r1, r1, 4 (j) ld_f f2, A, r1 (k) ld_f f3, B, r1 (l) add_f f4, f2, f3 (m) st_f C, r1, f4 (n) add r1, r1, 4 (o) blt r1, r5, L1 (p) 5 10 15 After Loop Unrolling Loop Unrolling Instr. a a b b c c c d e f f g g h h h i j k k l l m m m n o p 0 cycles Code Schedule 19 cycles / 3 iterations = 6.3 cycles / iteration CMPUT 329 - Computer Organization and Architecture II

  38. L1: ld_f f21, A, r11 (a) ld_f f31, B, r11 (b) add_f f41, f21, f31 (c) st_f C, r11, f41 (d) add r12, r11, 4 (e) ld_f f22, A, r12 (f) ld_f f32, B, r12 (g) add_f f42, f22, f32 (h) st_f C, r12, f42 (i) add r13, r12, 4 (j) ld_f f23, A, r13 (k) ld_f f33, B, r13 (l) add_f f43, f23, f33 (m) st_f C, r13, f43 (n) add r11, r13, 4 (o) blt r11, r5, L1 (p) L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) ld_f f2, A, r1 (f) ld_f f3, B, r1 (g) add_f f4, f2, f3 (h) st_f C, r1, f4 (i) add r1, r1, 4 (j) ld_f f2, A, r1 (k) ld_f f3, B, r1 (l) add_f f4, f2, f3 (m) st_f C, r1, f4 (n) add r1, r1, 4 (o) blt r1, r5, L1 (p) After Register Renaming After Loop Unrolling Register Renaming CMPUT 329 - Computer Organization and Architecture II

  39. L1: ld_f f21, A, r11 (a) ld_f f31, B, r11 (b) add_f f41, f21, f31 (c) st_f C, r11, f41 (d) add r12, r11, 4 (e) ld_f f22, A, r12 (f) ld_f f32, B, r12 (g) add_f f42, f22, f32 (h) st_f C, r12, f42 (i) add r13, r12, 4 (j) ld_f f23, A, r13 (k) ld_f f33, B, r13 (l) add_f f43, f23, f33 (m) st_f C, r13, f43 (n) add r11, r13, 4 (o) blt r11, r5, L1 (p) a a b b c c c d e f f g g h h h i j k k l l m m m n o p 0 5 cycles 15 10 Code Schedule After Register Renaming Loop Unrolling and Register Renaming Instr. 8 cycles / 3 iterations = 2.7 cycles / iteration CMPUT 329 - Computer Organization and Architecture II

  40. Accumulator Variable Expansion An accumulator variable accumulates a sum or product in each iteration of a loop. Accumulator variable expansion eliminates redefinitions of an accumulator variable within an unrolled loop by creating k temporary accumulators (k is the number of accumulation instructions). The values of all temporary accumulators must be summed at the exit points of the loop where the accumulator is live. CMPUT 329 - Computer Organization and Architecture II

  41. Accumulator Expansion Example For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) } ld_f f1, C, r2 (-) L1: ld_f f3, A, r4 (a) ld_f f5, B, r6 (b) mul_f f7, f3, f5 (c) add_f f1, f1, f7 (d) add r4, r4, 4 (e) add r6, r6, r8 (f) blt r4, r9, L1 (g) st_f C, r2, f1 (-) Original Loop Assembly Code For all examples we assume a superscalar processor with infinite resources and no register renaming hardware. Thus for the code above, we obtain the following schedule. CMPUT 329 - Computer Organization and Architecture II

  42. ld_f f1, C, r2 (-) L1: ld_f f3, A, r4 (a) ld_f f5, B, r6 (b) mul_f f7, f3, f5 (c) add_f f1, f1, f7 (d) add r4, r4, 4 (e) add r6, r6, r8 (f) blt r4, r9, L1 (g) st_f C, r2, f1 (-) 0 5 cycles Accumulator Expansion Example For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) } Original Loop Instr. Assembly Code a a b b c c c d d d e 8 cycles / 1 iteration f g Code Schedule CMPUT 329 - Computer Organization and Architecture II

  43. ld_f f1, C, r2 (-) L1: ld_f f31, A, r41 (a) ld_f f51, B, r61 (b) mul_f f71, f31, f51 (c) add_f f1, f1, f71 (d) add r42, r41, 4 (e) add r62, r61, r8 (f) ld_f f32, A, r42 (g) ld_f f52, B, r62 (h) mul_f f72, f32, f52 (i) add_f f1, f1, f72 (j) add r43, r42, 4 (k) add r63, r62, r8 (l) ld_f f33, A, r43 (m) ld_f f53, B, r63 (n) mul_f f73, f33, f53 (o) add_f f1, f1, f73 (p) add r41, r43, 4 (q) add r61, r63, r8 (r) blt r4, r9, L1 (s) st_f C, r2, f1 (-) Loop Unrolling and Register Renaming ld_f f1, C, r2 (-) L1: ld_f f3, A, r4 (a) ld_f f5, B, r6 (b) mul_f f7, f3, f5 (c) add_f f1, f1, f7 (d) add r4, r4, 4 (e) add r6, r6, r8 (f) blt r4, r9, L1 (g) st_f C, r2, f1 (-) Assembly Code After Unrolling and Renaming CMPUT 329 - Computer Organization and Architecture II

  44. ld_f f1, C, r2 (-) L1: ld_f f31, A, r41 (a) ld_f f51, B, r61 (b) mul_f f71, f31, f51 (c) add_f f1, f1, f71 (d) add r42, r41, 4 (e) add r62, r61, r8 (f) ld_f f32, A, r42 (g) ld_f f52, B, r62 (h) mul_f f72, f32, f52 (i) add_f f1, f1, f72 (j) add r43, r42, 4 (k) add r63, r62, r8 (l) ld_f f33, A, r43 (m) ld_f f53, B, r63 (n) mul_f f73, f33, f53 (o) add_f f1, f1, f73 (p) add r41, r43, 4 (q) add r61, r63, r8 (r) blt r4, r9, L1 (s) st_f C, r2, f1 (-) 5 10 15 Loop Unrolling and Register Renaming Instr. a a b b c c c d d d e f g g h h i i i j j j k l m m n n o o o p p p q r s 0 cycles Code Schedule 14 cycles / 3 iterations = 4.7 cycles / iteration CMPUT 329 - Computer Organization and Architecture II

  45. ld_f f11, C, r2 (-) mov_f f12, 0 (-) mov_f f13, 0 (-) L1: ld_f f31, A, r41 (a) ld_f f51, B, r61 (b) mul_f f71, f31, f51 (c) add_f f11, f11, f71 (d) add r42, r41, 4 (e) add r62, r61, r8 (f) ld_f f32, A, r42 (g) ld_f f52, B, r62 (h) mul_f f72, f32, f52 (i) add_f f12, f12, f72 (j) add r43, r42, 4 (k) add r63, r62, r8 (l) ld_f f33, A, r43 (m) ld_f f53, B, r63 (n) mul_f f73, f33, f53 (o) add_f f13, f13, f73 (p) add r41, r43, 4 (q) add r61, r63, r8 (r) blt r4, r9, L1 (s) add_f f11, f11, f12 (-) add_f f11, f11, f13 (-) st_f C, r2, f1 (-) 5 10 15 Accumulator Expansion Instr. a a b b c c c d d d e f g g h h i i i j j j k l m m n n o o o p p p q r s 0 cycles Code Schedule 10 cycles / 3 iterations = 3.3 cycles / iteration CMPUT 329 - Computer Organization and Architecture II

  46. Induction Variable Expansion An induction variable is used to index through loop iterations and through regular data structure, such as arrays. Induction variable expansion eliminates dependences between definitions of induction variables and their uses in unrolled loops. CMPUT 329 - Computer Organization and Architecture II

  47. 0 5 cycles Induction Variable Expansion Example For (i=0; i<n; i++) { C(j) = A(j) * B(j) j = j + K } L1: ld_f f3, A, r2 (a) ld_f f4, B, r2 (b) mul_f f5, f3, f4 (c) st_f C, r2, f5 (d) add r2, r2, r7 (e) add r1, r1, 1 (f) blt r1, r6, L1 (g) Original Loop Instr. Assembly Code a a b b c c c d e 6 cycles / 1 iteration f g Code Schedule CMPUT 329 - Computer Organization and Architecture II

  48. L1: ld_f f31, A, r21 (a) ld_f f41, B, r21 (b) mul_f f51, f31, f41 (c) st_f C, r21, f51 (d) add r22, r21, r7 (e) ld_f f32, A, r22 (f) ld_f f42, B, r22 (g) mul_f f52, f32, f42 (h) st_f C, r22, f52 (i) add r23, r22, r7 (j) ld_f f33, A, r23 (k) ld_f f43, B, r23 (l) mul_f f53, f33, f43 (m) st_f C, r23, f53 (n) add r21, r23, r7 (o) add r1, r1, 3 (p) blt r1, r6, L1 (q) After Unrolling and Renaming Loop Unrolling and Register Renaming L1: ld_f f3, A, r2 (a) ld_f f4, B, r2 (b) mul_f f5, f3, f4 (c) st_f C, r2, f5 (d) add r2, r2, r7 (e) add r1, r1, 1 (f) blt r1, r6, L1 (g) Assembly Code CMPUT 329 - Computer Organization and Architecture II

  49. L1: ld_f f31, A, r21 (a) ld_f f41, B, r21 (b) mul_f f51, f31, f41 (c) st_f C, r21, f51 (d) add r22, r21, r7 (e) ld_f f32, A, r22 (f) ld_f f42, B, r22 (g) mul_f f52, f32, f42 (h) st_f C, r22, f52 (i) add r23, r22, r7 (j) ld_f f33, A, r23 (k) ld_f f43, B, r23 (l) mul_f f53, f33, f43 (m) st_f C, r23, f53 (n) add r21, r23, r7 (o) add r1, r1, 3 (p) blt r1, r6, L1 (q) 5 15 10 After Unrolling and Renaming Loop Unrolling and Register Renaming Instr. a a b b c c c d e f f g g h h h i j k k l l m m m n o p q 0 cycles Code Schedule 8 cycles / 3 iterations = 2.6 cycles / iteration CMPUT 329 - Computer Organization and Architecture II

  50. 5 15 10 Induction Variable Expansion mov r21, r2 (-) add r22, r21, r7 (-) add r23, r22, r7 (-) mul r71, r7, 3 (-) L1: ld_f f31, A, r21 (a) ld_f f41, B, r21 (b) mul_f f51, f31, f41 (c) st_f C, r21, f51 (d) ld_f f32, A, r22 (f) ld_f f42, B, r22 (g) mul_f f52, f32, f42 (h) st_f C, r22, f52 (i) ld_f f33, A, r23 (k) ld_f f43, B, r23 (l) mul_f f53, f33, f43 (m) st_f C, r23, f53 (n) add r21, r21, r71 (e) add r22, r22, r71 (j) add r23, r23, r71 (o) add r1, r1, 3 (p) blt r1, r6, L1 (q) Instr. a a b b c c c d f f g g h h h i k k l l m m m n e j o p q 0 cycles Code Schedule 6 cycles / 3 iterations = 2 cycles / iteration After Unrolling and Renaming CMPUT 329 - Computer Organization and Architecture II

More Related