1 / 17

CS433: Computer System Organization

CS433: Computer System Organization. Luddy Harrison Compiling for VLIWs part 2: Predication. Exposing adequate ILP Unrolling Unroll and Jam Software pipelining Register renaming Allocating Register Banks and Functional Units (last time). Instruction Scheduling and Register Allocation

brosh
Télécharger la présentation

CS433: Computer System Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS433: Computer System Organization Luddy Harrison Compiling for VLIWs part 2:Predication

  2. Exposing adequate ILP Unrolling Unroll and Jam Software pipelining Register renaming Allocating Register Banks and Functional Units(last time) Instruction Scheduling and Register Allocation Scheduling around interlocks Trace Scheduling Predication (this time) Dimensions of the Problem

  3. Branch Delay Penalty on VLIW BC L 6 cycle penalty =24 lost instructionissue opportunities L:

  4. Predication X = a[i];if (x > 0) { b[i] = x – c[i];} else { e[i] = x + d[i];} If there is a 50% probability of a branch, then this code will suffer aP/2 cycle penalty (or a PW/2 instruction issue opportunity penalty)on the average, where P is the penalty for a conditional branch (or mispredicted branch) W is the issue width of the machine

  5. Predication X = a[i];if (x > 0) { b[i] = x – c[i];} else { e[i] = x + d[i];} The condition guarding a section(s) of code will become the predicate. It will be used in both positive and negative (complemented) form.

  6. Predication X = a[i];if (x > 0) { b[i] = x – c[i];} else { e[i] = x + d[i];} If we are compiling: reads must be either conditionalized on predicate (to prevent illegal access) or else proved to be in-bounds and aligned. If programming by hand, reads may be executed unconditionally but at a cost of additional bandwidth consumption.

  7. Predication X = a[i];if (x > 0) {b[i] = x – c[i];} else {e[i] = x + d[i];} Stores must be conditionalized (predicated). These change the visible state of the machine after the conditional section has finished.When programming by hand, we may occasionally discover that a conditional write can be done unconditionally, but this is an oddity.

  8. Compiled Using Conditional Branch P1 = CMP R1, 0 // compare X to 0BLT P1 LR1 = R2 + R3 // compute c[i]R4 = LOAD R1 // load c[i]R5 = R0 – R4R6 = R7 + R3 // compute b[i]STORE R5, R6JMP ML:R11 = R12 + R3 // compute d[i]R14 = LOAD R11 // load d[i]R15 = R0 – R14R16 = R17 + R3STORE R15, R15M: X = a[i];if (x > 0) { b[i] = x – c[i];} else { e[i] = x + d[i];}

  9. Converting to Predicated Form P1 = CMP R1, 0 // compare X to 0IF P1 R1 = R2 + R3 // compute c[i]IF P1 R4 = LOAD R1 // load c[i]IF P1 R5 = R0 – R4IF P1 R6 = R7 + R3 // compute b[i]IF P1 STORE R5, R6IF !P1R11 = R12 + R3 // compute d[i]IF !P1 R14 = LOAD R11 // load d[i]IF !P1 R15 = R0 – R14IF !P1 R16 = R17 + R3IF !P1 STORE R15, R16 If every instruction type can be predicated, it is relatively simple to convert into a naïve predicated form.

  10. Scheduling Naively P1 = CMP R1, 0IF P1 R1 = R2 + R3 || IF !P1R11 = R12 + R3IF P1 R4 = LOAD R1 || IF !P1 R14 = LOAD R11IF P1 R5 = R0 – R4 || IF !P1 R15 = R0 – R14 || IF P1 R6 = R7 + R3 || IF !P1 R16 = R17 + R3IF P1 STORE R5, R6 || IF !P1 STORE R15, R16

  11. Some Difficulties P1 = CMP R1, 0IF P1 R1 = R2 + R3 || IF !P1R11 = R12 + R3IF P1 R4 = LOAD R1 || IF !P1 R14 = LOAD R11IF P1 R5 = R0 – R4 || IF !P1 R15 = R0 – R14 ||IF P1 R6 = R7 + R3 || IF !P1 R16 = R17 + R3IF P1 STORE R5, R6 || IF !P1 STORE R15, R16 • If we do this in virtual register form, it appears that the predicated assignments do not kill their destinations R1 = 19 … IF P1 R1 = x+y IF P1 R2 = R1+7it looks as though the first assignment to R1 reaches the use of R1 • Nothing can be hoisted above the comparison P1 = CMP … IF P1 R1 = R2 + R3

  12. More Sophisticated Conversion and Scheduling A // A-D are unrelated instructions prior to the CMPBCDP1 = CMP R1, 0 // compare X to 0R1 = R2 + R3R4 = LOAD R1// only OK if we are sure it can’t trapR5 = R0 – R4R6 = R7 + R3IF P1 STORE R5, R6 // must be predicated if we want the same resultR11 = R12 + R3R14 = LOAD R11 // only OK if we are sure it can’t trapR15 = R0 – R14R16 = R17 + R3IF !P1 STORE R15, R16

  13. Scheduling R1 = R2 + R3 || R11 = R12 + R3AR4 = LOAD R1 || R14 = LOAD R11 || P1 = CMP R1, 0BR5 = R0 – R4 || R15 = R0 – R14 || R6 = R7 + R3 || R16 = R17 + R3CIF P1 STORE R5, R6 || IF !P1 STORE R15, R16D • Predication • Expands the basic blocks (straight-line segments) of the code • Creates additional scheduling opportunities • Comes at a cost: useless work is performed unconditionally

  14. Efficiency and Utilization R1 = R2 + R3 ||R11 = R12 + R3R4 = LOAD R1||R14 = LOAD R11||P1 = CMP R1, 0R5 = R0 – R4 ||R15 = R0 – R14 ||R6 = R7 + R3 ||R16 = R17 + R3IF P1 STORE R5, R6||IF !P1 STORE R15, R16 • If this is a 4-wide machine, then • Utilization is 11 / 16 • Efficiencyis 6 / 16 true case: 6 useful instructions in 16 slots false case: 6 useful instructions in 16 slots average is (6 + 6)/2 instructions in 16 slots

  15. Converting Predicated Assignments if (x > 0)y = a + b;elsey = c + d; y is assigned onboth sides of the “if” • CMP R8, 0 // x > 0R1 = R2 + R3 // a + bR4 = R5 + R6 // c + dCMOV GT R4, R1 // if (x > 0) R4 = c + d • at this point, R4 holds the value of y • The first 3 instructions can be done in parallel • It is common for machines to have conditional move as their only support for predication • (not so common in the case of VLIWs however) CMP R8, 0 || R1 = R2 + R3 || R4 = R5 + R6 …CMOV GT R4, R1

  16. If we have more than CMOV if (x > 0)y = a + b;elsey = c + d; y is assigned onboth sides of the “if” CMP R8, 0…IFGT R4 = R2 + R3 || IFLE R4 = R5 + R6 The TigerSHARC does this, expressed as an IF .. ELSE form.

  17. Bandwidth and Latency “Saturating” the bandwidth of one dimension of the machine Conditions Condition codes Conditional branching Data types Integer, fractional, saturation, etc. Pipelining Instruction Sets MIPS, ARM, Thumb, TigerSHARC, C6X Static ILP Exploitation Vector processing VLIW processing Compiler techniques Unroll / Jam Scheduling Predication This isn’t an exhaustive list The homeworks are a good guide also Mid-Term Review

More Related