Static Code Scheduling

Static Code Scheduling CS 671 April 1, 2008

Code Scheduling • Scheduling or reordering instructions to improve performance and/or guarantee correctness • Important for dynamically-scheduled architectures • Crucial (assumed!) for statically-scheduled architectures, e.g. VLIW or EPIC • Takes into account anticipated latencies • Machine-specific, performed later in the optimization pass • How does this contrast with our earlier exploration of code motion?

Why Must the Compiler Schedule? • Many machines are pipelined and expose some aspects of pipelining to the user (compiler) • Examples: • Branch delay slots! • Memory-access delays • Multi-cycle operations • Some machines don’t have scheduling hardware

Example • Assume loads take 2 cycles and branches have a delay slot. • ____cycles

Start Op Try to fill Use Op Code Scheduling Strategy • Get resources operating in parallel • Integer data path • Integer multiply / divide hardware • FP adder, multiplier, divider • Method • Fill with computations that do not require result or same hardware resources • Drawbacks • Highly hardware dependent

Scheduling Approaches • Local • Branch scheduling • Basic-block scheduling • Global • Cross-block scheduling • Software pipelining • Trace scheduling • Percolation scheduling

Branch Scheduling • Two problems: • Branches often take some number of cycles to complete • Can be a delay between a compare b and its associated branch • A compiler will try to fill these slots with valid instructions (rather than nop) • Delay slots – present in PA-RISC, SPARC, MIPS • Condition delay – PowerPC, Pentium

Recall from Architecture… • IF – Instruction Fetch • ID – Instruction Decode • EX – Execute • MA – Memory access • WB – Write back IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB

Control Hazards Taken Branch ID EX MA WB IF IF --- --- --- --- Instr + 1 Branch Target IF ID EX MA WB IF ID EX MA WB Branch Target + 1

Data Dependences • If two operations access the same register, they are dependent • Types of data dependences Output Anti Flow r1 = r2 + r3 r2 = r5 * 6 r1 = r2 + r3 r1 = r4 * 6 r1 = r2 + r3 r4 = r1 * 6

Data Hazards Memory latency: data not ready lw R1,0(R2) IF ID EX MA WB IF ID EX stall MA WB add R3,R1,R4

Data Hazards Instruction latency: execute takes > 1 cycle addf R3,R1,R2 IF ID EX EX MA WB IF ID stall EX EX MA WB addf R3,R3,R4 Assumes floating point ops take 2 execute cycles

Multi-cycle Instructions • Scheduling is particularly important for multi-cycle operations • Alpha instructions > 1 cycle latency (partial list) mull(32-bit integer multiply) 8 mulq(64-bit integer multiply) 16 addt(fp add) 4 mult(fp multiply) 4 divs(fp single-precision divide) 10 divt(fp double-precision divide) 23

Avoiding data hazards • Move loads earlier and stores later (assuming this does not violate correctness) • Other stalls may require more sophisticated re-ordering, i.e. ((a+b)+c)+d becomes (a+b)+(c+d) • How can we do this in a systematic way??

Example: Without Scheduling • Assume: • memory instrs take 3 cycles • mult takes 2 cycles (to have • result in register) • rest take 1 cycle • ____cycles

Basic Block Dependence DAGS • Nodes - instructions • Edges - dependence between I1 and I2 • When we cannot determine whether there is a dependence, we must assume there is one • a) lw R2, (R1) • b) lw R3, (R1) 4 • c) R4  R2 + R3 • d) R5  R2 - 1 a b 2 2 2 d c

Example – Build the DAG Assume: memory instrs = 3 mult = 2 (to have result in register) rest = 1 cycle

Creating a schedule • Create a DAG of dependences • Determine priority • Schedule instructions with • Ready operands • Highest priority • Heuristics: If multiple possibilities, fall back on other priority functions

Operation Priority • Priority – Need a mechanism to decide which ops to schedule first (when you have choices) • Common priority functions • Height – Distance from exit node • Give priority to amount of work left to do • Slackness – inversely proportional to slack • Give priority to ops on the critical path • Register use – priority to nodes with more source operands and fewer destination operands • Reduces number of live registers • Uncover – high priority to nodes with many children • Frees up more nodes • Original order – when all else fails

Computing Priorities • Height(n) = • exec(n) if n is a leaf • max(height(m)) + exec(n) for m, where m is a successor of n • Critical path(s) = path through the dependence DAG with longest latency

Example – Determine Height and CP Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle Critical path: _______

Example – List Scheduling _____cycles

Scheduling vs. Register Allocation

Register Renaming

VLIW • Very Long Instruction Word • Compiler determines exactly what is issued every cycle (before the program is run) • Schedules also account for latencies • All hardware changes result in a compiler change • Usually embedded systems (hence simple HW) • Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)

Sample VLIW code VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots) Add/Sub Add/Sub Mul/Div Ld/St Branch c = a + b d = a - b e = a * b ld j = [x] nop g = c + d h = c - d nop ld k = [y] nop nop nop i = j * c ld f = [z] br g

1 2m 3m 4 5m 6 7m 8 9 10 Multi-Issue Scheduling Example Machine: 2 issue, 1 memory port, 1 ALU Memory port = 2 cycles, non-pipelined ALU = 1 cycle RU_map Schedule time ALU MEM 0 1 2 3 4 5 6 7 8 9 time Ready Placed 0 1 2 3 4 5 6 7 8 9

Earliest Latest Sets Machine: 2 issue, 1 memory port, 1 ALU Memory port = 2 cycles, pipelined ALU = 1 cycle 1m 2m 3 4m 5 6 7 8 9m 10

List Scheduling Algorithm • Build dependence graph, calculate priority • Add all ops to UNSCHEDULED set • time = 0 • while (UNSCHEDULED is not empty) • time++ • READY = UNSCHEDULED ops whose incoming deps have been satisfied • Sort READY using priority function • For each op in READY (highest to lowest priority) • op can be scheduled at current time? (resources free?) • Yes: schedule it, op.issue_time = time • Mark resources busy in RU_map relative to issue time • Remove op from UNSCHEDULED/READY sets • No: continue

Improving Basic Block Scheduling • Loop unrolling – creates longer basic blocks • Register renaming – can change register usage in blocks to remove immediate reuse of registers • Summary • Static scheduling complements (or replaces) dynamic scheduling by the hardware

Static Code Scheduling

Static Code Scheduling

Presentation Transcript

Static Scheduling Techniques

Chapter – 5.2 Static Process Scheduling

Static Process Scheduling

Static Techniques on code

Static Code Analysis to Find Bugs

Static Code Analysis and Governance

Static Scheduling for ILP

Static Process Scheduling

Clock-driven Static scheduling

Peer Code Review and Static Code Analysis Tools

Static vs. dynamic scheduling

Static ILP Static (Compiler Based) Scheduling

Static Source Code Analysis

Static Process Scheduling Section 5.2

Static Code Scheduling

Code Scheduling

Clock-driven Static scheduling

Static Code Analysis Tools Appvigil

Static Code Analysis

Quasi-static Scheduling for Reactive Systems

Chapter – 5.2 Static Process Scheduling

Static Analysis of Embedded C Code