1 / 49

IA-64 Register Model: Stack & Rotation

IA-64 Register Model: Stack & Rotation. Dale Morris Architect Hewlett Packard Co. Philosophy. Large files Most processors have lots of registers Explicit control over register-renaming Most processors have register renaming

gefen
Télécharger la présentation

IA-64 Register Model: Stack & Rotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

  2. Philosophy • Large files • Most processors have lots of registers • Explicit control over register-renaming • Most processors have register renaming • IA-64 makes the register names SW-visible & makes the renaming explicit

  3. Outline • Register Stack • Register Stack Engine • Register Rotation • Loop Branches • Modulo-Scheduling of Loops • Summary

  4. Register Stack • Motivation: • Automatic save/restore of GRs on procedure call/return • Cache traffic reduction • Latency hiding of register spill/fill

  5. 127 Stacked 32 31 Static 0 General Registers

  6. 127 illegal outputs locals (inputs) 32 31 Static 0 sof sol GR Stack Frame size of frame (sof) size of locals (sol) Current Frame Marker (CFM)

  7. 52 out 46 loc size of frame (sof) 32 size of locals (sol) sol sof CFM 14 21 GR Stack Frame - Example

  8. 52 out 46 38 out loc 32 call 32 sol sof sol sof 0 7 CFM 14 21 PFM x x 14 21 GR Stack Frame - Call

  9. 50 out 48 loc 52 out 46 32 38 out loc 32 call alloc 32 sol sol sof sof sol sof 16 0 19 7 CFM 14 21 PFM x x 14 14 21 21 GR Stack Frame - Allocate inputs

  10. 50 out 48 loc 52 52 out out 46 46 32 38 out loc loc 32 return call alloc 32 32 sol sol sol sof sof sof sol sof 14 16 0 19 21 7 CFM 14 21 PFM x x 14 14 14 21 21 21 GR Stack Frame - Return

  11. Instructions • br.call • Copies CFM to PFM • Creates new frame with only output regs • Saves local regs from previous frame • alloc • Resizes current frame • Saves PFM to a GR

  12. Instructions (cont.) • mov to PFS • Restores PFM from a GR • br.ret • Restores CFM from PFM • Restores local regs for previous frame

  13. Leaf Procedure Optimization • No need to save/restore PFM • Can always use scratch static GRs • Can omit alloc if: • Not many registers needed • Register rotation not needed

  14. Register Save Engine • Automatically spills/fills registers from memory as needed • Registers saved on a Backing Store Stack • Spills/fills NaT bits as well

  15. Reg Stack & Backing Store A calls B calls C current frame call unallocated procC sofc procB procB solb RSE loads/ stores procA procA sola procA’s ancestors unallocated return Physical stacked registers Backing Store

  16. Register Stack: Summary • Exposes register renaming to SW • Avoids register spill when few needed • Hides register spill/fill • Programmable sizes • only use as many registers as you need

  17. Outline • Register Stack • Register Stack Engine • Register Rotation • Loop Branches • Modulo-Scheduling of Loops • Summary

  18. Register Rotation • Motivation: • pipeline-schedule loops onto HW • remove extraneous work from loop • minimize start-up overhead • small code footprint • maximum computational throughput with few instructions

  19. GR Stack Frame w/ Rotation 127 sof outputs sol locals Size of Rotating (sor) 32 31 Static 0 Current Frame Marker (CFM) rrb.pr rrb.fr rrb.gr sor sol sof

  20. GR Rotation • Size of rotating region multiple of 8 • Rotating region overlays current frame • Starts at r32 • Overlay allows rotation & stack renaming in a single level of adders • Must copy input registers before loop

  21. FR Rotation 127 Rotating Upper 3/4 of register file rotates 32 31 Static 0

  22. Predicate Rotation 63 Rotating Upper 3/4 of register file rotates 16 15 Static 0

  23. ld1 R35 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. . . . 36: Palm 35: 34: 33: 32: . . . Palm Sunny Springs is RRB=0

  24. st1 R35 ld2 R34 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm . . . 36: Palm 35: Springs 34: 33: 32: . . . Palm Sunny Springs is RRB=0

  25. st2 R35 ld3 R34 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Springs . . . Palm 35: Springs 34: is 33: 32: 127: . . . Palm Sunny Springs is RRB=-1

  26. st3 R35 ld4 R34 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Springs is . . . Springs 34: is 33: Sunny 32: 127: 126: . . . Palm Sunny Springs is RRB=-2

  27. st4 R35 Register Rotation & RRB • Separate Rotating Register Base for each: GRs, FRs, PRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Sunny Springs is . . . is 33: Sunny 32: 127: 126: 125: . . . Palm Sunny Springs is RRB=-3

  28. Loop Branches • br.cloop uses LC for simple, non-pipelined loops • decrements LC and loops until LC is 0 • br.ctop uses LC and EC for pipelined counted loops • br.wtop uses branch predicate and EC for pipelined “while” loops • br.cexit, br.wexit used for unrolled, pipelined loops

  29. br.ctop • Function (simplified): • if (LC>0) {LC--; pr[63]=1; rrb--; loop;}else if (EC>1) {EC--; pr[63]=0; rrb--; loop;}else {EC--; pr[63]=0; rrb--; fall_through;} • LC counts main loop iterations • EC counts pipeline stages for drain

  30. Software Pipelining • Overlapping execution of different loop iterations vs. • More iterations in same amount of time

  31. Software Pipelining • Traditional architectures use loop unrolling • High overhead: extra code for loop body, prologue, and epilogue • Synergistic use of IA-64 features: • Full Predication • Special branches • Register rotation: removes loop copy overhead • Predicate rotation: removes prologue & epilogue Especially Useful for Integer Code With Small Number of Loop Iterations

  32. Pipelined Loop Example • DAXPY inner loop • dy[i] = dy[i] + (da * dx[i]) • 2 loads, 1 fma, 1 store / iteration • Machine assumptions • can do 2 loads, 1 store, 1 fma, 1 br / cycle • load latency of 2 clocks • fma latency of 1 clocks

  33. Example: Pipeline • Each column represents 1 source iteration load dx,dy tmp = dy + da * dx store dy

  34. Example Code .rotf dx[3], dy[3], tmp[2] mov ar.lc = 3 // #iterations-1 mov ar.ec = 4 // #stages mov pr.rot = 0x10000 ;; looptop: (p16) ldfd dx[0] = [dxsp],8 (p16) ldfd dy[0] = [dysp],8 (p18) fma.d tmp[0] = da, dx[2], dy[2] (p19) stfd [dydp] = tmp[1],8 br.ctop looptop ;;

  35. ... 19: 0 18: 0 17: 0 16: 1 63: 0 . Loop Execution Execution Sequence (p16) ldx (p16) ldy(p18) fma (p19) st (p19) (p18) (p16) (p63) LC=3 EC=4 RRB=0 Initialization

  36. Execution Sequence ... ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p19) 19: 0 18: 0 19: 0 (p18) 18: 0 18: 0 17: 0 17: 0 16: 1 17: 0 (p16) 16: 1 63: 1 16: 1 1 62: 0 63: 1 63: 0 ... . ... LC=3 EC=4 RRB=0 Branch 1 Loop Execution (p63) LC=2 EC=4 RRB=-1

  37. Execution Sequence ... ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p19) 18: 0 17: 0 18: 0 (p18) 17: 0 17: 0 16: 1 16: 1 63: 1 16: 1 (p16) 63: 1 62: 1 63: 1 1 61: 0 62: 1 62: 0 ... . ... LC=2 EC=4 RRB=-1 Branch 2 Loop Execution (p63) LC=1 EC=4 RRB=-2

  38. Execution Sequence ... ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma(p19) st (p19) 17: 0 16: 1 17: 0 (p18) 16: 1 16: 1 63: 1 63: 1 62: 1 63: 1 (p16) 62: 1 62: 1 61: 1 1 60: 0 61: 1 61: 0 . ... ... LC=1 EC=4 RRB=-2 Branch 3 Loop Execution (p63) LC=0 EC=4 RRB=-3

  39. Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma(p19) st (p19) 16: 1 63: 1 (p18) 63: 1 62: 1 62: 1 61: 1 (p16) 60: 0 61: 1 0 60: 0 59: 0 . ... LC=0 EC=4 RRB=-3 Branch 4 Loop Execution (p63) LC=0 EC=3 RRB=-4

  40. Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma(p19) st (p19) 63: 1 62: 1 (p18) 62: 1 61: 1 61: 1 60: 0 (p16) 59: 0 60: 0 0 59: 0 58: 0 . ... LC=0 EC=3 RRB=-4 Branch 5 Loop Execution (p63) LC=0 EC=2 RRB=-5

  41. Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p19) 62: 1 61: 1 (p18) 61: 1 60: 0 60: 0 59: 0 (p16) 58: 0 59: 0 0 58: 0 57: 0 . ... LC=0 EC=2 RRB=-5 Branch 6 Loop Execution (p63) LC=0 EC=1 RRB=-6

  42. Execution Sequence ... ... (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st (p16) ldx (p16) ldy(p18) fma (p19) st fall through (p19) 61: 1 60: 0 (p18) 60: 0 59: 0 59: 0 58: 0 (p16) 57: 0 58: 0 0 57: 0 56: 0 . ... LC=0 EC=1 RRB=-6 Branch 7 Loop Execution (p63) LC=0 EC=0 RRB=-7

  43. Pipelining & Latency • Suppose we change the latencies • load latency of 6 clocks • fma latency of 4 clocks

  44. Example: New Pipeline • Each column represents 1 source iteration load dx,dy tmp = dy + da * dx store dy

  45. Updated Loop .rotf dx[7], dy[7], tmp[5] mov ar.lc = 3 // #iterations-1 mov ar.ec = 11 // #stages mov pr.rot = 0x10000 ;; looptop: (p16) ldfd dx[0] = [dxsp],8 (p16) ldfd dy[0] = [dysp],8 (p22) fma.d tmp[0] = da, dx[6], dy[6] (p26) stfd [dydp] = tmp[4],8 br.ctop looptop ;;

  46. Rotation: Summary • Loop pipelining maximizes performance; minimizes overhead • Avoids code expansion of unrolling and code explosion of prologue and epilogue • Smaller code means fewer cache misses • Greater performance improvements in higher latency conditions • Reduced overhead allows S/W pipelining of small loops with unknown trip counts • Typical of integer scalar codes

  47. Outline • Register Stack • Register Stack Engine • Register Rotation • Loop Branches • Modulo-Scheduling of Loops • Summary

  48. Register Model Summary • GR Stack • Overlap call/ret operations with real work • RSE hides spills/fillls • GR, FR, PR Rotation • General acceleration for all types of loops • SW-visible resources • Large named register files & renaming • HW simplicity and explicit control

  49. IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

More Related