IA-64 Register Model: Stack & Rotation

IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

Philosophy • Large files • Most processors have lots of registers • Explicit control over register-renaming • Most processors have register renaming • IA-64 makes the register names SW-visible & makes the renaming explicit

Outline • Register Stack • Register Stack Engine • Register Rotation • Loop Branches • Modulo-Scheduling of Loops • Summary

Register Stack • Motivation: • Automatic save/restore of GRs on procedure call/return • Cache traffic reduction • Latency hiding of register spill/fill

127 Stacked 32 31 Static 0 General Registers

127 illegal outputs locals (inputs) 32 31 Static 0 sof sol GR Stack Frame size of frame (sof) size of locals (sol) Current Frame Marker (CFM)

52 out 46 loc size of frame (sof) 32 size of locals (sol) sol sof CFM 14 21 GR Stack Frame - Example

52 out 46 38 out loc 32 call 32 sol sof sol sof 0 7 CFM 14 21 PFM x x 14 21 GR Stack Frame - Call

50 out 48 loc 52 out 46 32 38 out loc 32 call alloc 32 sol sol sof sof sol sof 16 0 19 7 CFM 14 21 PFM x x 14 14 21 21 GR Stack Frame - Allocate inputs

50 out 48 loc 52 52 out out 46 46 32 38 out loc loc 32 return call alloc 32 32 sol sol sol sof sof sof sol sof 14 16 0 19 21 7 CFM 14 21 PFM x x 14 14 14 21 21 21 GR Stack Frame - Return

Instructions • br.call • Copies CFM to PFM • Creates new frame with only output regs • Saves local regs from previous frame • alloc • Resizes current frame • Saves PFM to a GR

Instructions (cont.) • mov to PFS • Restores PFM from a GR • br.ret • Restores CFM from PFM • Restores local regs for previous frame

Leaf Procedure Optimization • No need to save/restore PFM • Can always use scratch static GRs • Can omit alloc if: • Not many registers needed • Register rotation not needed

Register Save Engine • Automatically spills/fills registers from memory as needed • Registers saved on a Backing Store Stack • Spills/fills NaT bits as well

Reg Stack & Backing Store A calls B calls C current frame call unallocated procC sofc procB procB solb RSE loads/ stores procA procA sola procA’s ancestors unallocated return Physical stacked registers Backing Store

Register Stack: Summary • Exposes register renaming to SW • Avoids register spill when few needed • Hides register spill/fill • Programmable sizes • only use as many registers as you need

Register Rotation • Motivation: • pipeline-schedule loops onto HW • remove extraneous work from loop • minimize start-up overhead • small code footprint • maximum computational throughput with few instructions

GR Stack Frame w/ Rotation 127 sof outputs sol locals Size of Rotating (sor) 32 31 Static 0 Current Frame Marker (CFM) rrb.pr rrb.fr rrb.gr sor sol sof

GR Rotation • Size of rotating region multiple of 8 • Rotating region overlays current frame • Starts at r32 • Overlay allows rotation & stack renaming in a single level of adders • Must copy input registers before loop

FR Rotation 127 Rotating Upper 3/4 of register file rotates 32 31 Static 0

Predicate Rotation 63 Rotating Upper 3/4 of register file rotates 16 15 Static 0

ld1 R35 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. . . . 36: Palm 35: 34: 33: 32: . . . Palm Sunny Springs is RRB=0

st1 R35 ld2 R34 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm . . . 36: Palm 35: Springs 34: 33: 32: . . . Palm Sunny Springs is RRB=0

st2 R35 ld3 R34 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Springs . . . Palm 35: Springs 34: is 33: 32: 127: . . . Palm Sunny Springs is RRB=-1

st3 R35 ld4 R34 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Springs is . . . Springs 34: is 33: Sunny 32: 127: 126: . . . Palm Sunny Springs is RRB=-2

st4 R35 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Sunny Springs is . . . is 33: Sunny 32: 127: 126: 125: . . . Palm Sunny Springs is RRB=-3

Loop Branches • br.cloop uses LC for simple, non-pipelined loops • decrements LC and loops until LC is 0 • br.ctop uses LC and EC for pipelined counted loops • br.wtop uses branch predicate and EC for pipelined “while” loops • br.cexit, br.wexit used for unrolled, pipelined loops

br.ctop • Function (simplified): • if (LC>0) {LC--; pr[63]=1; rrb--; loop;}else if (EC>1) {EC--; pr[63]=0; rrb--; loop;}else {EC--; pr[63]=0; rrb--; fall_through;} • LC counts main loop iterations • EC counts pipeline stages for drain

Software Pipelining • Overlapping execution of different loop iterations vs. • More iterations in same amount of time

Software Pipelining • Traditional architectures use loop unrolling • High overhead: extra code for loop body, prologue, and epilogue • Synergistic use of IA-64 features: • Full Predication • Special branches • Register rotation: removes loop copy overhead • Predicate rotation: removes prologue & epilogue Especially Useful for Integer Code With Small Number of Loop Iterations

Pipelined Loop Example • DAXPY inner loop • dy[i] = dy[i] + (da * dx[i]) • 2 loads, 1 fma, 1 store / iteration • Machine assumptions • can do 2 loads, 1 store, 1 fma, 1 br / cycle • load latency of 2 clocks • fma latency of 1 clocks

Example: Pipeline • Each column represents 1 source iteration load dx,dy tmp = dy + da * dx store dy

Example Code .rotf dx[3], dy[3], tmp[2] mov ar.lc = 3 // #iterations-1 mov ar.ec = 4 // #stages mov pr.rot = 0x10000 ;; looptop: (p16) ldfd dx[0] = [dxsp],8 (p16) ldfd dy[0] = [dysp],8 (p18) fma.d tmp[0] = da, dx[2], dy[2] (p19) stfd [dydp] = tmp[1],8 br.ctop looptop ;;

... 19: 0 18: 0 17: 0 16: 1 63: 0 . Loop Execution Execution Sequence (p16) ldx (p16) ldy(p18) fma (p19) st (p19) (p18) (p16) (p63) LC=3 EC=4 RRB=0 Initialization

Execution Sequence ... ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p19) 19: 0 18: 0 19: 0 (p18) 18: 0 18: 0 17: 0 17: 0 16: 1 17: 0 (p16) 16: 1 63: 1 16: 1 1 62: 0 63: 1 63: 0 ... . ... LC=3 EC=4 RRB=0 Branch 1 Loop Execution (p63) LC=2 EC=4 RRB=-1

Execution Sequence ... ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p19) 18: 0 17: 0 18: 0 (p18) 17: 0 17: 0 16: 1 16: 1 63: 1 16: 1 (p16) 63: 1 62: 1 63: 1 1 61: 0 62: 1 62: 0 ... . ... LC=2 EC=4 RRB=-1 Branch 2 Loop Execution (p63) LC=1 EC=4 RRB=-2

Execution Sequence ... ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma(p19) st (p19) 17: 0 16: 1 17: 0 (p18) 16: 1 16: 1 63: 1 63: 1 62: 1 63: 1 (p16) 62: 1 62: 1 61: 1 1 60: 0 61: 1 61: 0 . ... ... LC=1 EC=4 RRB=-2 Branch 3 Loop Execution (p63) LC=0 EC=4 RRB=-3

Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma(p19) st (p19) 16: 1 63: 1 (p18) 63: 1 62: 1 62: 1 61: 1 (p16) 60: 0 61: 1 0 60: 0 59: 0 . ... LC=0 EC=4 RRB=-3 Branch 4 Loop Execution (p63) LC=0 EC=3 RRB=-4

Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma(p19) st (p19) 63: 1 62: 1 (p18) 62: 1 61: 1 61: 1 60: 0 (p16) 59: 0 60: 0 0 59: 0 58: 0 . ... LC=0 EC=3 RRB=-4 Branch 5 Loop Execution (p63) LC=0 EC=2 RRB=-5

Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p19) 62: 1 61: 1 (p18) 61: 1 60: 0 60: 0 59: 0 (p16) 58: 0 59: 0 0 58: 0 57: 0 . ... LC=0 EC=2 RRB=-5 Branch 6 Loop Execution (p63) LC=0 EC=1 RRB=-6

Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st fall through (p19) 61: 1 60: 0 (p18) 60: 0 59: 0 59: 0 58: 0 (p16) 57: 0 58: 0 0 57: 0 56: 0 . ... LC=0 EC=1 RRB=-6 Branch 7 Loop Execution (p63) LC=0 EC=0 RRB=-7

Pipelining & Latency • Suppose we change the latencies • load latency of 6 clocks • fma latency of 4 clocks

Example: New Pipeline • Each column represents 1 source iteration load dx,dy tmp = dy + da * dx store dy

Updated Loop .rotf dx[7], dy[7], tmp[5] mov ar.lc = 3 // #iterations-1 mov ar.ec = 11 // #stages mov pr.rot = 0x10000 ;; looptop: (p16) ldfd dx[0] = [dxsp],8 (p16) ldfd dy[0] = [dysp],8 (p22) fma.d tmp[0] = da, dx[6], dy[6] (p26) stfd [dydp] = tmp[4],8 br.ctop looptop ;;

Rotation: Summary • Loop pipelining maximizes performance; minimizes overhead • Avoids code expansion of unrolling and code explosion of prologue and epilogue • Smaller code means fewer cache misses • Greater performance improvements in higher latency conditions • Reduced overhead allows S/W pipelining of small loops with unknown trip counts • Typical of integer scalar codes

Register Model Summary • GR Stack • Overlap call/ret operations with real work • RSE hides spills/fillls • GR, FR, PR Rotation • General acceleration for all types of loops • SW-visible resources • Large named register files & renaming • HW simplicity and explicit control

IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

IA-64 Register Model: Stack & Rotation