CS433: Computer System Organization

CS433: Computer System Organization Luddy Harrison Compiling for VLIWs part 2:Predication

Exposing adequate ILP Unrolling Unroll and Jam Software pipelining Register renaming Allocating Register Banks and Functional Units(last time) Instruction Scheduling and Register Allocation Scheduling around interlocks Trace Scheduling Predication (this time) Dimensions of the Problem

Branch Delay Penalty on VLIW BC L 6 cycle penalty =24 lost instructionissue opportunities L:

Predication X = a[i];if (x > 0) { b[i] = x – c[i];} else { e[i] = x + d[i];} If there is a 50% probability of a branch, then this code will suffer aP/2 cycle penalty (or a PW/2 instruction issue opportunity penalty)on the average, where P is the penalty for a conditional branch (or mispredicted branch) W is the issue width of the machine

Predication X = a[i];if (x > 0) { b[i] = x – c[i];} else { e[i] = x + d[i];} The condition guarding a section(s) of code will become the predicate. It will be used in both positive and negative (complemented) form.

Predication X = a[i];if (x > 0) { b[i] = x – c[i];} else { e[i] = x + d[i];} If we are compiling: reads must be either conditionalized on predicate (to prevent illegal access) or else proved to be in-bounds and aligned. If programming by hand, reads may be executed unconditionally but at a cost of additional bandwidth consumption.

Predication X = a[i];if (x > 0) {b[i] = x – c[i];} else {e[i] = x + d[i];} Stores must be conditionalized (predicated). These change the visible state of the machine after the conditional section has finished.When programming by hand, we may occasionally discover that a conditional write can be done unconditionally, but this is an oddity.

Compiled Using Conditional Branch P1 = CMP R1, 0 // compare X to 0BLT P1 LR1 = R2 + R3 // compute c[i]R4 = LOAD R1 // load c[i]R5 = R0 – R4R6 = R7 + R3 // compute b[i]STORE R5, R6JMP ML:R11 = R12 + R3 // compute d[i]R14 = LOAD R11 // load d[i]R15 = R0 – R14R16 = R17 + R3STORE R15, R15M: X = a[i];if (x > 0) { b[i] = x – c[i];} else { e[i] = x + d[i];}

Converting to Predicated Form P1 = CMP R1, 0 // compare X to 0IF P1 R1 = R2 + R3 // compute c[i]IF P1 R4 = LOAD R1 // load c[i]IF P1 R5 = R0 – R4IF P1 R6 = R7 + R3 // compute b[i]IF P1 STORE R5, R6IF !P1R11 = R12 + R3 // compute d[i]IF !P1 R14 = LOAD R11 // load d[i]IF !P1 R15 = R0 – R14IF !P1 R16 = R17 + R3IF !P1 STORE R15, R16 If every instruction type can be predicated, it is relatively simple to convert into a naïve predicated form.

Scheduling Naively P1 = CMP R1, 0IF P1 R1 = R2 + R3 || IF !P1R11 = R12 + R3IF P1 R4 = LOAD R1 || IF !P1 R14 = LOAD R11IF P1 R5 = R0 – R4 || IF !P1 R15 = R0 – R14 || IF P1 R6 = R7 + R3 || IF !P1 R16 = R17 + R3IF P1 STORE R5, R6 || IF !P1 STORE R15, R16

Some Difficulties P1 = CMP R1, 0IF P1 R1 = R2 + R3 || IF !P1R11 = R12 + R3IF P1 R4 = LOAD R1 || IF !P1 R14 = LOAD R11IF P1 R5 = R0 – R4 || IF !P1 R15 = R0 – R14 ||IF P1 R6 = R7 + R3 || IF !P1 R16 = R17 + R3IF P1 STORE R5, R6 || IF !P1 STORE R15, R16 • If we do this in virtual register form, it appears that the predicated assignments do not kill their destinations R1 = 19 … IF P1 R1 = x+y IF P1 R2 = R1+7it looks as though the first assignment to R1 reaches the use of R1 • Nothing can be hoisted above the comparison P1 = CMP … IF P1 R1 = R2 + R3

More Sophisticated Conversion and Scheduling A // A-D are unrelated instructions prior to the CMPBCDP1 = CMP R1, 0 // compare X to 0R1 = R2 + R3R4 = LOAD R1// only OK if we are sure it can’t trapR5 = R0 – R4R6 = R7 + R3IF P1 STORE R5, R6 // must be predicated if we want the same resultR11 = R12 + R3R14 = LOAD R11 // only OK if we are sure it can’t trapR15 = R0 – R14R16 = R17 + R3IF !P1 STORE R15, R16

Scheduling R1 = R2 + R3 || R11 = R12 + R3AR4 = LOAD R1 || R14 = LOAD R11 || P1 = CMP R1, 0BR5 = R0 – R4 || R15 = R0 – R14 || R6 = R7 + R3 || R16 = R17 + R3CIF P1 STORE R5, R6 || IF !P1 STORE R15, R16D • Predication • Expands the basic blocks (straight-line segments) of the code • Creates additional scheduling opportunities • Comes at a cost: useless work is performed unconditionally

Efficiency and Utilization R1 = R2 + R3 ||R11 = R12 + R3R4 = LOAD R1||R14 = LOAD R11||P1 = CMP R1, 0R5 = R0 – R4 ||R15 = R0 – R14 ||R6 = R7 + R3 ||R16 = R17 + R3IF P1 STORE R5, R6||IF !P1 STORE R15, R16 • If this is a 4-wide machine, then • Utilization is 11 / 16 • Efficiencyis 6 / 16 true case: 6 useful instructions in 16 slots false case: 6 useful instructions in 16 slots average is (6 + 6)/2 instructions in 16 slots

Converting Predicated Assignments if (x > 0)y = a + b;elsey = c + d; y is assigned onboth sides of the “if” • CMP R8, 0 // x > 0R1 = R2 + R3 // a + bR4 = R5 + R6 // c + dCMOV GT R4, R1 // if (x > 0) R4 = c + d • at this point, R4 holds the value of y • The first 3 instructions can be done in parallel • It is common for machines to have conditional move as their only support for predication • (not so common in the case of VLIWs however) CMP R8, 0 || R1 = R2 + R3 || R4 = R5 + R6 …CMOV GT R4, R1

If we have more than CMOV if (x > 0)y = a + b;elsey = c + d; y is assigned onboth sides of the “if” CMP R8, 0…IFGT R4 = R2 + R3 || IFLE R4 = R5 + R6 The TigerSHARC does this, expressed as an IF .. ELSE form.

Bandwidth and Latency “Saturating” the bandwidth of one dimension of the machine Conditions Condition codes Conditional branching Data types Integer, fractional, saturation, etc. Pipelining Instruction Sets MIPS, ARM, Thumb, TigerSHARC, C6X Static ILP Exploitation Vector processing VLIW processing Compiler techniques Unroll / Jam Scheduling Predication This isn’t an exhaustive list The homeworks are a good guide also Mid-Term Review

CS433: Computer System Organization