CMPUT680

CMPUT680 - Winter 2006 Topic I: Superblock and Hyperblock Formation José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680 CMPUT 329 - Computer Organization and Architecture II

Instruction Level Parallelism Optimizations The objective of an optimizer is toreduce the number and complexity of the instructions executed by the processor. Superscalar or Very Long Instruction Word (VLIW) processors can reduce the execution time even when the number of instructions executed moderately increases, as long as the dependence height is reduced. CMPUT 329 - Computer Organization and Architecture II

Speculative and Predicated Execution Speculative Execution: execution of an instruction before knowing that its execution is required. Superblock: structure used to implement compiler-controlled speculative execution. Predicated Execution: architecture-supported conditional execution of an instruction based on the value of a Boolean source operand, referred to as the predicate of the instruction. If-conversion: compiler algorithm that converts conditional branches into predicate-defining instructions to allow the use of predication. CMPUT 329 - Computer Organization and Architecture II

Trace Scheduling (Fisher, 1981) Some optimization and scheduling decisions may decrease the execution time for one control path while increasing the execution time for another path. Thus decisions should favor more frequently executed paths to improve overall performance. Trace scheduling divides a procedure in a set of frequently executed traces(paths). CMPUT 329 - Computer Organization and Architecture II

Trace Scheduling There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances). These control-flow transitions are ignored during trace scheduling. After scheduling, bookeeping is required to ensure the correct execution of off-trace code. CMPUT 329 - Computer Organization and Architecture II

Bookeeping for Trace Scheduling Instr 1 Instr 2 Instr 2 Instr 3 Instr 3 Instr 4 Instr 4 Instr 1 Instr 5 Instr 5 What bookeeping is required when Instr 1 is moved below the side entrance in the trace? CMPUT 329 - Computer Organization and Architecture II

Bookeeping for Trace Scheduling Instr 3 Instr 1 Instr 2 Instr 4 Instr 2 Instr 3 Instr 3 Instr 4 Instr 4 Instr 1 Instr 5 Instr 5 CMPUT 329 - Computer Organization and Architecture II

Bookeeping for Trace Scheduling Instr 1 Instr 1 Instr 2 Instr 5 Instr 3 Instr 2 Instr 4 Instr 3 Instr 5 Instr 4 What bookeeping is required when Instr 5 moves above the side entrance in the trace? CMPUT 329 - Computer Organization and Architecture II

Bookeeping for Trace Scheduling Instr 5 Instr 1 Instr 1 Instr 2 Instr 5 Instr 3 Instr 2 Instr 4 Instr 3 Instr 5 Instr 4 CMPUT 329 - Computer Organization and Architecture II

Superblocks A superblock is a trace without side entrances, i.e., control can only enter from the top, but it can leave at one or more exit points. The formation of superblocks creates additional optimization opportunities because constraints associated with infrequently executed paths of control are ignored (thus these constraints do not inhibit optimizations that favor frequently executed paths). CMPUT 329 - Computer Organization and Architecture II

Y 1 D 100 10 90 B 90 C 10 0 90 D 0 E 90 10 90 0 F 100 99 1 Z Superblock Formation(Example) Y 1 D 100 90 10 B 90 C 10 90 0 D 0 E 90 10 99 90 0 F 100 1 Z CMPUT 329 - Computer Organization and Architecture II

Y 1 D 100 10 90 B 90 C 10 0 90 D 0 E 90 10 90 0 F 100 99 1 Z Superblock Formation(Example) Is this a superblock? No, a superblock cannot have side entrances, and this set of nodes has two side entrances into node F. How do we convert it into a superblock? CMPUT 329 - Computer Organization and Architecture II

Superblock Formation(Example) Y 1 Tail duplication, is the duplication of basic blocks that appear after a side entrance to eliminate side entrances and transform a trace into a superblock. D 100 90 10 B 90 C 10 0 90 10 D 0 E 90 90 89.1 F 90 9.9 0 0.9 Z 10 F’ 10 0.1 CMPUT 329 - Computer Organization and Architecture II

opA: mul r1,r2,3 opA: mul r1,r2,3 1 1 opB: add r2,r2,1 99 opB: add r2,r2,1 99 opC’: mul r3,r2,3 1 opC: mul r3,r2,3 opC: mul r3,r2,3 Code After Superblock Formation Original Code opA: mul r1,r2,3 1 opB: add r2,r2,1 99 opC’: mul r3,r2,3 opC: mov r3,r1 Code After Common Subexpression Elimination Common Subexpression Elimination in Superblocks CMPUT 329 - Computer Organization and Architecture II

… … X X mov r0,r1 … Y Y mov r0,r2 … add r1,r1,4 add r2,r2,4 add r3,r3,4 Z Z mov r0,r3 After Operation Migration Operation Migration in Superblocks … mov r0,r1 … mov r0,r2 … mov r0,r3 … add r1,r1,4 add r2,r2,4 add r3,r3,4 Original Code CMPUT 329 - Computer Organization and Architecture II

OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops MEM[r0+x] 10 20 30 r0 1 r1 1 r4 CMPUT 329 - Computer Organization and Architecture II

OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops MEM[r0+x] 10 20 30 r0 1 r1 1 r4 10 CMPUT 329 - Computer Organization and Architecture II

OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 0 100 OpE: add r0, r0, 1 OpD: add r1, r1, 1 Original Program Segment Global Variable Migration in Superblock Loops 12 MEM[r0+x] 20 30 r0 2 r1 2 r4 12 CMPUT 329 - Computer Organization and Architecture II

OpA: ld_I r4, x, r0 OpB: add r4, r4, r1 OpC: st_I x, r0, r4 OpA: ld_I r4, x, r0 0 OpB: add r4, r4, r1 0 100 100 OpE: add r0, r0, 1 OpC’: st_i x, r0, r4 OpE: add r0, r0, 1 OpD: add r1, r1, 1 OpD: add r1, r1, 1 Original Program Segment OpC: st_i x, r0, r4 After Variable Migration Global Variable Migration in Superblock Loops CMPUT 329 - Computer Organization and Architecture II

Superblock Enlarging Optimizations By enlarging a superblock, we can provide the scheduler with more independent instructions to choose from for each cycle Superblock enlarging optimizations: Branch target expansion Loop unrolling Loop peeling CMPUT 329 - Computer Organization and Architecture II

L1: blt r1, r2, L3 L1: blt r1, r2, L3 20 100 20 beq r3, r4, L5 L2: jump L4 L2: jump L4 L3: beq r3, r4, L5 Branch Target Expansion Idea: To expand the superblock with the target of a likely taken branch. CMPUT 329 - Computer Organization and Architecture II

Superblock Loops A superblock loop is a superblock that has a frequently taken backedge from its last node to its first node. We will study the extension of some common loop optimizations to superblocks. CMPUT 329 - Computer Organization and Architecture II

Dependence Removing Optimizations The goal is to eliminate data dependences between instructions within frequently executed superblocks. Dependence removing optimizations include: Register renaming Accumulator variable expansion Induction variable expansion Search variable expansion Operation combining Strength reduction Tree height reduction CMPUT 329 - Computer Organization and Architecture II

Instruction Latencies for Examples CMPUT 329 - Computer Organization and Architecture II

Register Renaming Example L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) blt r1, r5, L1 (f) For (j=0; j<n; j++) { C(j) = A(j)+B(j) } Original Loop Assembly Code For all the examples we assume a superscalar processor with infinite resources and no register renaming hardware. Thus for the code above, we obtain the following schedule. CMPUT 329 - Computer Organization and Architecture II

Instr. a a b b c c c d e f 0 5 cycles Code Schedule Register Renaming Example For (j=0; j<n; j++) { C(j) = A(j)+B(j) } L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) blt r1, r5, L1 (f) Original Loop Assembly Code 7 cycles / 1 iteration CMPUT 329 - Computer Organization and Architecture II

L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) ld_f f2, A, r1 (f) ld_f f3, B, r1 (g) add_f f4, f2, f3 (h) st_f C, r1, f4 (i) add r1, r1, 4 (j) ld_f f2, A, r1 (k) ld_f f3, B, r1 (l) add_f f4, f2, f3 (m) st_f C, r1, f4 (n) add r1, r1, 4 (o) blt r1, r5, L1 (p) After Loop Unrolling Register Renaming Example L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) blt r1, r5, L1 (f) Original Assembly Code CMPUT 329 - Computer Organization and Architecture II

L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) ld_f f2, A, r1 (f) ld_f f3, B, r1 (g) add_f f4, f2, f3 (h) st_f C, r1, f4 (i) add r1, r1, 4 (j) ld_f f2, A, r1 (k) ld_f f3, B, r1 (l) add_f f4, f2, f3 (m) st_f C, r1, f4 (n) add r1, r1, 4 (o) blt r1, r5, L1 (p) 5 10 15 After Loop Unrolling Loop Unrolling Instr. a a b b c c c d e f f g g h h h i j k k l l m m m n o p 0 cycles Code Schedule 19 cycles / 3 iterations = 6.3 cycles / iteration CMPUT 329 - Computer Organization and Architecture II

L1: ld_f f21, A, r11 (a) ld_f f31, B, r11 (b) add_f f41, f21, f31 (c) st_f C, r11, f41 (d) add r12, r11, 4 (e) ld_f f22, A, r12 (f) ld_f f32, B, r12 (g) add_f f42, f22, f32 (h) st_f C, r12, f42 (i) add r13, r12, 4 (j) ld_f f23, A, r13 (k) ld_f f33, B, r13 (l) add_f f43, f23, f33 (m) st_f C, r13, f43 (n) add r11, r13, 4 (o) blt r11, r5, L1 (p) L1: ld_f f2, A, r1 (a) ld_f f3, B, r1 (b) add_f f4, f2, f3 (c) st_f C, r1, f4 (d) add r1, r1, 4 (e) ld_f f2, A, r1 (f) ld_f f3, B, r1 (g) add_f f4, f2, f3 (h) st_f C, r1, f4 (i) add r1, r1, 4 (j) ld_f f2, A, r1 (k) ld_f f3, B, r1 (l) add_f f4, f2, f3 (m) st_f C, r1, f4 (n) add r1, r1, 4 (o) blt r1, r5, L1 (p) After Register Renaming After Loop Unrolling Register Renaming CMPUT 329 - Computer Organization and Architecture II

L1: ld_f f21, A, r11 (a) ld_f f31, B, r11 (b) add_f f41, f21, f31 (c) st_f C, r11, f41 (d) add r12, r11, 4 (e) ld_f f22, A, r12 (f) ld_f f32, B, r12 (g) add_f f42, f22, f32 (h) st_f C, r12, f42 (i) add r13, r12, 4 (j) ld_f f23, A, r13 (k) ld_f f33, B, r13 (l) add_f f43, f23, f33 (m) st_f C, r13, f43 (n) add r11, r13, 4 (o) blt r11, r5, L1 (p) a a b b c c c d e f f g g h h h i j k k l l m m m n o p 0 5 cycles 15 10 Code Schedule After Register Renaming Loop Unrolling and Register Renaming Instr. 8 cycles / 3 iterations = 2.7 cycles / iteration CMPUT 329 - Computer Organization and Architecture II

Accumulator Variable Expansion An accumulator variable accumulates a sum or product in each iteration of a loop. Accumulator variable expansion eliminates redefinitions of an accumulator variable within an unrolled loop by creating k temporary accumulators (k is the number of accumulation instructions). The values of all temporary accumulators must be summed at the exit points of the loop where the accumulator is live. CMPUT 329 - Computer Organization and Architecture II

Accumulator Expansion Example For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) } ld_f f1, C, r2 (-) L1: ld_f f3, A, r4 (a) ld_f f5, B, r6 (b) mul_f f7, f3, f5 (c) add_f f1, f1, f7 (d) add r4, r4, 4 (e) add r6, r6, r8 (f) blt r4, r9, L1 (g) st_f C, r2, f1 (-) Original Loop Assembly Code For all examples we assume a superscalar processor with infinite resources and no register renaming hardware. Thus for the code above, we obtain the following schedule. CMPUT 329 - Computer Organization and Architecture II

ld_f f1, C, r2 (-) L1: ld_f f3, A, r4 (a) ld_f f5, B, r6 (b) mul_f f7, f3, f5 (c) add_f f1, f1, f7 (d) add r4, r4, 4 (e) add r6, r6, r8 (f) blt r4, r9, L1 (g) st_f C, r2, f1 (-) 0 5 cycles Accumulator Expansion Example For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) } Original Loop Instr. Assembly Code a a b b c c c d d d e 8 cycles / 1 iteration f g Code Schedule CMPUT 329 - Computer Organization and Architecture II

ld_f f1, C, r2 (-) L1: ld_f f31, A, r41 (a) ld_f f51, B, r61 (b) mul_f f71, f31, f51 (c) add_f f1, f1, f71 (d) add r42, r41, 4 (e) add r62, r61, r8 (f) ld_f f32, A, r42 (g) ld_f f52, B, r62 (h) mul_f f72, f32, f52 (i) add_f f1, f1, f72 (j) add r43, r42, 4 (k) add r63, r62, r8 (l) ld_f f33, A, r43 (m) ld_f f53, B, r63 (n) mul_f f73, f33, f53 (o) add_f f1, f1, f73 (p) add r41, r43, 4 (q) add r61, r63, r8 (r) blt r4, r9, L1 (s) st_f C, r2, f1 (-) Loop Unrolling and Register Renaming ld_f f1, C, r2 (-) L1: ld_f f3, A, r4 (a) ld_f f5, B, r6 (b) mul_f f7, f3, f5 (c) add_f f1, f1, f7 (d) add r4, r4, 4 (e) add r6, r6, r8 (f) blt r4, r9, L1 (g) st_f C, r2, f1 (-) Assembly Code After Unrolling and Renaming CMPUT 329 - Computer Organization and Architecture II

ld_f f1, C, r2 (-) L1: ld_f f31, A, r41 (a) ld_f f51, B, r61 (b) mul_f f71, f31, f51 (c) add_f f1, f1, f71 (d) add r42, r41, 4 (e) add r62, r61, r8 (f) ld_f f32, A, r42 (g) ld_f f52, B, r62 (h) mul_f f72, f32, f52 (i) add_f f1, f1, f72 (j) add r43, r42, 4 (k) add r63, r62, r8 (l) ld_f f33, A, r43 (m) ld_f f53, B, r63 (n) mul_f f73, f33, f53 (o) add_f f1, f1, f73 (p) add r41, r43, 4 (q) add r61, r63, r8 (r) blt r4, r9, L1 (s) st_f C, r2, f1 (-) 5 10 15 Loop Unrolling and Register Renaming Instr. a a b b c c c d d d e f g g h h i i i j j j k l m m n n o o o p p p q r s 0 cycles Code Schedule 14 cycles / 3 iterations = 4.7 cycles / iteration CMPUT 329 - Computer Organization and Architecture II

ld_f f11, C, r2 (-) mov_f f12, 0 (-) mov_f f13, 0 (-) L1: ld_f f31, A, r41 (a) ld_f f51, B, r61 (b) mul_f f71, f31, f51 (c) add_f f11, f11, f71 (d) add r42, r41, 4 (e) add r62, r61, r8 (f) ld_f f32, A, r42 (g) ld_f f52, B, r62 (h) mul_f f72, f32, f52 (i) add_f f12, f12, f72 (j) add r43, r42, 4 (k) add r63, r62, r8 (l) ld_f f33, A, r43 (m) ld_f f53, B, r63 (n) mul_f f73, f33, f53 (o) add_f f13, f13, f73 (p) add r41, r43, 4 (q) add r61, r63, r8 (r) blt r4, r9, L1 (s) add_f f11, f11, f12 (-) add_f f11, f11, f13 (-) st_f C, r2, f1 (-) 5 10 15 Accumulator Expansion Instr. a a b b c c c d d d e f g g h h i i i j j j k l m m n n o o o p p p q r s 0 cycles Code Schedule 10 cycles / 3 iterations = 3.3 cycles / iteration CMPUT 329 - Computer Organization and Architecture II

Induction Variable Expansion An induction variable is used to index through loop iterations and through regular data structure, such as arrays. Induction variable expansion eliminates dependences between definitions of induction variables and their uses in unrolled loops. CMPUT 329 - Computer Organization and Architecture II

0 5 cycles Induction Variable Expansion Example For (i=0; i<n; i++) { C(j) = A(j) * B(j) j = j + K } L1: ld_f f3, A, r2 (a) ld_f f4, B, r2 (b) mul_f f5, f3, f4 (c) st_f C, r2, f5 (d) add r2, r2, r7 (e) add r1, r1, 1 (f) blt r1, r6, L1 (g) Original Loop Instr. Assembly Code a a b b c c c d e 6 cycles / 1 iteration f g Code Schedule CMPUT 329 - Computer Organization and Architecture II

L1: ld_f f31, A, r21 (a) ld_f f41, B, r21 (b) mul_f f51, f31, f41 (c) st_f C, r21, f51 (d) add r22, r21, r7 (e) ld_f f32, A, r22 (f) ld_f f42, B, r22 (g) mul_f f52, f32, f42 (h) st_f C, r22, f52 (i) add r23, r22, r7 (j) ld_f f33, A, r23 (k) ld_f f43, B, r23 (l) mul_f f53, f33, f43 (m) st_f C, r23, f53 (n) add r21, r23, r7 (o) add r1, r1, 3 (p) blt r1, r6, L1 (q) After Unrolling and Renaming Loop Unrolling and Register Renaming L1: ld_f f3, A, r2 (a) ld_f f4, B, r2 (b) mul_f f5, f3, f4 (c) st_f C, r2, f5 (d) add r2, r2, r7 (e) add r1, r1, 1 (f) blt r1, r6, L1 (g) Assembly Code CMPUT 329 - Computer Organization and Architecture II

L1: ld_f f31, A, r21 (a) ld_f f41, B, r21 (b) mul_f f51, f31, f41 (c) st_f C, r21, f51 (d) add r22, r21, r7 (e) ld_f f32, A, r22 (f) ld_f f42, B, r22 (g) mul_f f52, f32, f42 (h) st_f C, r22, f52 (i) add r23, r22, r7 (j) ld_f f33, A, r23 (k) ld_f f43, B, r23 (l) mul_f f53, f33, f43 (m) st_f C, r23, f53 (n) add r21, r23, r7 (o) add r1, r1, 3 (p) blt r1, r6, L1 (q) 5 15 10 After Unrolling and Renaming Loop Unrolling and Register Renaming Instr. a a b b c c c d e f f g g h h h i j k k l l m m m n o p q 0 cycles Code Schedule 8 cycles / 3 iterations = 2.6 cycles / iteration CMPUT 329 - Computer Organization and Architecture II

5 15 10 Induction Variable Expansion mov r21, r2 (-) add r22, r21, r7 (-) add r23, r22, r7 (-) mul r71, r7, 3 (-) L1: ld_f f31, A, r21 (a) ld_f f41, B, r21 (b) mul_f f51, f31, f41 (c) st_f C, r21, f51 (d) ld_f f32, A, r22 (f) ld_f f42, B, r22 (g) mul_f f52, f32, f42 (h) st_f C, r22, f52 (i) ld_f f33, A, r23 (k) ld_f f43, B, r23 (l) mul_f f53, f33, f43 (m) st_f C, r23, f53 (n) add r21, r21, r71 (e) add r22, r22, r71 (j) add r23, r23, r71 (o) add r1, r1, 3 (p) blt r1, r6, L1 (q) Instr. a a b b c c c d f f g g h h h i k k l l m m m n e j o p q 0 cycles Code Schedule 6 cycles / 3 iterations = 2 cycles / iteration After Unrolling and Renaming CMPUT 329 - Computer Organization and Architecture II

CMPUT680 - Winter 2006

CMPUT680 - Winter 2006

Presentation Transcript

Winter Olympics Medals :Torino 2006

Jordanian-German Winter Academy 2006

Winter 2006-2007

Integrated Accounting Issues Winter 2006

Winter 2006 ACAD - student work -

CSE 451 Section Winter 2006

CMPUT680 - Winter 2006

CMPUT680 - Winter 2006

CMPUT680 - Fall 2003

CMPUT680 - Fall 2003

CMPUT680 - Winter 2006

CMPUT680 - Fall 2003

CMPUT680 - Fall 2003

CMPUT680 - Winter 2006

CMPUT680 - Winter 2006

CMPUT680 - Winter 2001

Winter 2006 IGC3 Meeting

CSE 451 Section Winter 2006

CMPUT680 - Fall 2006

CMPUT680 - Winter 2006