490 likes | 594 Vues
IA-64 Register Model: Stack & Rotation. Dale Morris Architect Hewlett Packard Co. Philosophy. Large files Most processors have lots of registers Explicit control over register-renaming Most processors have register renaming
E N D
IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.
Philosophy • Large files • Most processors have lots of registers • Explicit control over register-renaming • Most processors have register renaming • IA-64 makes the register names SW-visible & makes the renaming explicit
Outline • Register Stack • Register Stack Engine • Register Rotation • Loop Branches • Modulo-Scheduling of Loops • Summary
Register Stack • Motivation: • Automatic save/restore of GRs on procedure call/return • Cache traffic reduction • Latency hiding of register spill/fill
127 Stacked 32 31 Static 0 General Registers
127 illegal outputs locals (inputs) 32 31 Static 0 sof sol GR Stack Frame size of frame (sof) size of locals (sol) Current Frame Marker (CFM)
52 out 46 loc size of frame (sof) 32 size of locals (sol) sol sof CFM 14 21 GR Stack Frame - Example
52 out 46 38 out loc 32 call 32 sol sof sol sof 0 7 CFM 14 21 PFM x x 14 21 GR Stack Frame - Call
50 out 48 loc 52 out 46 32 38 out loc 32 call alloc 32 sol sol sof sof sol sof 16 0 19 7 CFM 14 21 PFM x x 14 14 21 21 GR Stack Frame - Allocate inputs
50 out 48 loc 52 52 out out 46 46 32 38 out loc loc 32 return call alloc 32 32 sol sol sol sof sof sof sol sof 14 16 0 19 21 7 CFM 14 21 PFM x x 14 14 14 21 21 21 GR Stack Frame - Return
Instructions • br.call • Copies CFM to PFM • Creates new frame with only output regs • Saves local regs from previous frame • alloc • Resizes current frame • Saves PFM to a GR
Instructions (cont.) • mov to PFS • Restores PFM from a GR • br.ret • Restores CFM from PFM • Restores local regs for previous frame
Leaf Procedure Optimization • No need to save/restore PFM • Can always use scratch static GRs • Can omit alloc if: • Not many registers needed • Register rotation not needed
Register Save Engine • Automatically spills/fills registers from memory as needed • Registers saved on a Backing Store Stack • Spills/fills NaT bits as well
Reg Stack & Backing Store A calls B calls C current frame call unallocated procC sofc procB procB solb RSE loads/ stores procA procA sola procA’s ancestors unallocated return Physical stacked registers Backing Store
Register Stack: Summary • Exposes register renaming to SW • Avoids register spill when few needed • Hides register spill/fill • Programmable sizes • only use as many registers as you need
Outline • Register Stack • Register Stack Engine • Register Rotation • Loop Branches • Modulo-Scheduling of Loops • Summary
Register Rotation • Motivation: • pipeline-schedule loops onto HW • remove extraneous work from loop • minimize start-up overhead • small code footprint • maximum computational throughput with few instructions
GR Stack Frame w/ Rotation 127 sof outputs sol locals Size of Rotating (sor) 32 31 Static 0 Current Frame Marker (CFM) rrb.pr rrb.fr rrb.gr sor sol sof
GR Rotation • Size of rotating region multiple of 8 • Rotating region overlays current frame • Starts at r32 • Overlay allows rotation & stack renaming in a single level of adders • Must copy input registers before loop
FR Rotation 127 Rotating Upper 3/4 of register file rotates 32 31 Static 0
Predicate Rotation 63 Rotating Upper 3/4 of register file rotates 16 15 Static 0
ld1 R35 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. . . . 36: Palm 35: 34: 33: 32: . . . Palm Sunny Springs is RRB=0
st1 R35 ld2 R34 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm . . . 36: Palm 35: Springs 34: 33: 32: . . . Palm Sunny Springs is RRB=0
st2 R35 ld3 R34 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Springs . . . Palm 35: Springs 34: is 33: 32: 127: . . . Palm Sunny Springs is RRB=-1
st3 R35 ld4 R34 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Springs is . . . Springs 34: is 33: Sunny 32: 127: 126: . . . Palm Sunny Springs is RRB=-2
st4 R35 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Sunny Springs is . . . is 33: Sunny 32: 127: 126: 125: . . . Palm Sunny Springs is RRB=-3
Loop Branches • br.cloop uses LC for simple, non-pipelined loops • decrements LC and loops until LC is 0 • br.ctop uses LC and EC for pipelined counted loops • br.wtop uses branch predicate and EC for pipelined “while” loops • br.cexit, br.wexit used for unrolled, pipelined loops
br.ctop • Function (simplified): • if (LC>0) {LC--; pr[63]=1; rrb--; loop;}else if (EC>1) {EC--; pr[63]=0; rrb--; loop;}else {EC--; pr[63]=0; rrb--; fall_through;} • LC counts main loop iterations • EC counts pipeline stages for drain
Software Pipelining • Overlapping execution of different loop iterations vs. • More iterations in same amount of time
Software Pipelining • Traditional architectures use loop unrolling • High overhead: extra code for loop body, prologue, and epilogue • Synergistic use of IA-64 features: • Full Predication • Special branches • Register rotation: removes loop copy overhead • Predicate rotation: removes prologue & epilogue Especially Useful for Integer Code With Small Number of Loop Iterations
Pipelined Loop Example • DAXPY inner loop • dy[i] = dy[i] + (da * dx[i]) • 2 loads, 1 fma, 1 store / iteration • Machine assumptions • can do 2 loads, 1 store, 1 fma, 1 br / cycle • load latency of 2 clocks • fma latency of 1 clocks
Example: Pipeline • Each column represents 1 source iteration load dx,dy tmp = dy + da * dx store dy
Example Code .rotf dx[3], dy[3], tmp[2] mov ar.lc = 3 // #iterations-1 mov ar.ec = 4 // #stages mov pr.rot = 0x10000 ;; looptop: (p16) ldfd dx[0] = [dxsp],8 (p16) ldfd dy[0] = [dysp],8 (p18) fma.d tmp[0] = da, dx[2], dy[2] (p19) stfd [dydp] = tmp[1],8 br.ctop looptop ;;
... 19: 0 18: 0 17: 0 16: 1 63: 0 . Loop Execution Execution Sequence (p16) ldx (p16) ldy(p18) fma (p19) st (p19) (p18) (p16) (p63) LC=3 EC=4 RRB=0 Initialization
Execution Sequence ... ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p19) 19: 0 18: 0 19: 0 (p18) 18: 0 18: 0 17: 0 17: 0 16: 1 17: 0 (p16) 16: 1 63: 1 16: 1 1 62: 0 63: 1 63: 0 ... . ... LC=3 EC=4 RRB=0 Branch 1 Loop Execution (p63) LC=2 EC=4 RRB=-1
Execution Sequence ... ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p19) 18: 0 17: 0 18: 0 (p18) 17: 0 17: 0 16: 1 16: 1 63: 1 16: 1 (p16) 63: 1 62: 1 63: 1 1 61: 0 62: 1 62: 0 ... . ... LC=2 EC=4 RRB=-1 Branch 2 Loop Execution (p63) LC=1 EC=4 RRB=-2
Execution Sequence ... ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma(p19) st (p19) 17: 0 16: 1 17: 0 (p18) 16: 1 16: 1 63: 1 63: 1 62: 1 63: 1 (p16) 62: 1 62: 1 61: 1 1 60: 0 61: 1 61: 0 . ... ... LC=1 EC=4 RRB=-2 Branch 3 Loop Execution (p63) LC=0 EC=4 RRB=-3
Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma(p19) st (p19) 16: 1 63: 1 (p18) 63: 1 62: 1 62: 1 61: 1 (p16) 60: 0 61: 1 0 60: 0 59: 0 . ... LC=0 EC=4 RRB=-3 Branch 4 Loop Execution (p63) LC=0 EC=3 RRB=-4
Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma(p19) st (p19) 63: 1 62: 1 (p18) 62: 1 61: 1 61: 1 60: 0 (p16) 59: 0 60: 0 0 59: 0 58: 0 . ... LC=0 EC=3 RRB=-4 Branch 5 Loop Execution (p63) LC=0 EC=2 RRB=-5
Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p19) 62: 1 61: 1 (p18) 61: 1 60: 0 60: 0 59: 0 (p16) 58: 0 59: 0 0 58: 0 57: 0 . ... LC=0 EC=2 RRB=-5 Branch 6 Loop Execution (p63) LC=0 EC=1 RRB=-6
Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st fall through (p19) 61: 1 60: 0 (p18) 60: 0 59: 0 59: 0 58: 0 (p16) 57: 0 58: 0 0 57: 0 56: 0 . ... LC=0 EC=1 RRB=-6 Branch 7 Loop Execution (p63) LC=0 EC=0 RRB=-7
Pipelining & Latency • Suppose we change the latencies • load latency of 6 clocks • fma latency of 4 clocks
Example: New Pipeline • Each column represents 1 source iteration load dx,dy tmp = dy + da * dx store dy
Updated Loop .rotf dx[7], dy[7], tmp[5] mov ar.lc = 3 // #iterations-1 mov ar.ec = 11 // #stages mov pr.rot = 0x10000 ;; looptop: (p16) ldfd dx[0] = [dxsp],8 (p16) ldfd dy[0] = [dysp],8 (p22) fma.d tmp[0] = da, dx[6], dy[6] (p26) stfd [dydp] = tmp[4],8 br.ctop looptop ;;
Rotation: Summary • Loop pipelining maximizes performance; minimizes overhead • Avoids code expansion of unrolling and code explosion of prologue and epilogue • Smaller code means fewer cache misses • Greater performance improvements in higher latency conditions • Reduced overhead allows S/W pipelining of small loops with unknown trip counts • Typical of integer scalar codes
Outline • Register Stack • Register Stack Engine • Register Rotation • Loop Branches • Modulo-Scheduling of Loops • Summary
Register Model Summary • GR Stack • Overlap call/ret operations with real work • RSE hides spills/fillls • GR, FR, PR Rotation • General acceleration for all types of loops • SW-visible resources • Large named register files & renaming • HW simplicity and explicit control
IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.