The IA-64 Architectural Innovations

The IA-64 Architectural Innovations Hardware Support for Software Pipelining José Nelson Amaral

Suggested Reading Intel IA-64 Architecture Software Developer’s Manual, Chapters 8, 9

Instruction Group An instruction group is a set of instructions that have no read after write (RAW) or write after write (WAW) register dependencies. Consecutive instruction groups are separated by stops (represented by a double semi-column in the assembly code). ld8 r1=[r5] // First group sub r6=r8, r9 // First group add r3=r1,r4 ;; // First group st8 [r6]=r12 // Second group

Instruction Bundles Instructions are organized in bundles of three instructions, with the following format: 127 87 86 46 45 5 4 0 instruction slot 2 instruction slot 1 instruction slot 0 template 41 41 41 5

Bundles In assembly, each 128-bit bundle is enclosed in curly braces and contains a template specification { .mii ld4 r28=[r8] // Load a 4-byte value add r9=2,r1 // 2+r1 and put in r9 add r30=1,r1 // 1+r1 and put in r30 } An instruction group can extend over an arbitrary number of bundles.

Templates There are restrictions on the type of instructions that can be bundled together. The IA-64 has five slot types (M, I, F, B, and L), six instruction types (M, I, A, F, B, L), and twelve basic template types (MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, and MFB). The underscore in the bundle accronym indicates a stop. Every basic bundle type has two versions: one with a stop at the end of the bundle and one without.

block A br block B ld Control Dependency Preventing Code Motion In the code below, the ld4 is control dependent on the branch, and thus cannot be safely moved up in conventional processor architectures. add r7=r6,1 // cycle 0 add r13=r25, r27 cmp.eq p1, p2=r12, r23 (p1) br. cond some_label ;; ld4 r2=[r3] ;; // cycle 1 sub r4=r2, r11 // cycle 3

Control Speculation In the following code, suppose a load latency of two cycles (p1) br.cond.dptk L1 // cycle 0 ld8 r3=[r5] ;; // cycle 1 shr r7=r3,r87 // cycle 3 However, if we execute the load before we know that we actually have to do it (control speculation), we get: ld8.s r3=[r5] // earlier cycle // other, unrelated instructions (p1) br.cond.dptk L1 ;; // cycle 0 chk.s r3, recovery // cycle 1 shr r7=r3,r87 // cycle 1

Control Speculation The ld8.s instruction is a speculative load, and the chk.s instruction is a check instruction that verifies if the value loaded is still good. ld8.s r3=[r5] // earlier cycle // other, unrelated instructions (p1) br.cond.dptk L1 ;; // cycle 0 chk.s r3, recovery // cycle 1 shr r7=r3,r87 // cycle 1

Ambiguous Memory Dependencies An ambiguous memory dependency is a dependence between a load and a store, or between two stores, where it cannot be determined if the instructions involved access overlapping memory locations. Two or more memory references are independent if it is known that they access non-overlapping memory locations.

Data Speculation An advanced load allows a load to be moved above a store even if it is not known wether the load and the store may reference overlapping memory locations. st8 [r55]=r45 // cycle 0 ld8 r3=[r5] ;; // cycle 0 shr r7=r3,r87 // cycle 2 ld8.a r3=[r5] ;; // Advanced Load // other, unrelated instructions st8 [r55]=r45 // cycle 0 ld8.c r3=[r5] ;; // cycle 0 - check shr r7=r3,r87 // cycle 0

Moving Up Loads + Uses: Recovery Code st8 [r4] = r12 // cycle 0: ambiguous store ld8 r6 = [r8] ;; // cycle 0: load to advance add r5 = r6,r7 // cycle 2 st8 [r18] = r5 // cycle 3 Original Code ld8.a r6 = [r8] ;; // cycle -3 // other, unrelated instructions add r5 = r6,r7 // cycle -1; add that uses r6 // other, unrelated instructions st8 [r4]=r12 // cycle 0 chk.a r6, recover // cycle 0: check back: // Return point from jump to recover st8 [r18] = r5 // cycle 0 recover: ld8 r6 = [r8] ;; // Reload r6 from [r8] add r5 = r6,r7 // Re-execute the add br back // Jump back to main code Speculative Code

ld.c, chk.a and the ALAT The execution of an advanced load, ld.a, creates an entry in a hardware structure, the Advanced Load Address Table (ALAT). This table is indexed by the register number. Each entry records the load address, the load type, and the size of the load. When a check is executed, the entry for the register is checked to verify that a valid enter with the type specified is there.

ld.c, chk.a and the ALAT An entry e is removed from the ALAT when: (1) A store overlaps with the memory locations specified in e; (2) Another advanced load to the same register is executed; (3) There is a context switch caused by the operating system (or hardware); (4) Capacity limitation of the ALAT implementation requires reuse of the ALAT slot.

Not a Thing (NaT) The IA-64 has 128 general purpose registers, each with 64+1 bits, and 128 floating point registers, each with 82 bits. The extra bit in the GPRs is the NaT bit that is used to indicate that the content of the register is not valid. NaT=1 indicates that an instruction that generated an exception wrote to the register. It is a way to defer exceptions caused by speculative loads. Any operation that uses NaT as an operand results in NaT.

If-conversion If-conversion uses predicates to transform a conditional code into a single control stream code. if(r4) { add r1= r2, r3 ld8 r6=[r5] } cmp.ne p1, p0=r4, 0 ;; Set predicate reg (p1) add r1=r2, r3 (p1) ld8 r6=[r5] if(r1) r2 = r3 + r3 else r7 = r6 - r5 cmp.ne p1, p2 = r1, 0 ;; Set predicate reg (p1) add r2 = r3, r4 (p2) sub r7 = r6,r5

void f(int *p, int *q, int A, int N){ intt, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + A; *q++ = t; } } Optimization of Loops L1: ld4 r4 = [r5], 4 ;; // Cycle 0 load postinc 4 add r7 = r4, r9 ;; // Cycle 2 st4 [r6] = r7, 4 // Cycle 3 store postinc 4 br.cloop L1 ;; // Cycle 3 Instructions Description: ld4 r4 = [r5], 4 ;; r4  MEM[r5] r5  r5 + 4 st4 [r6] = r7, 4 MEM[r6]  r7 r6  r6 + 4 br.cloop L1 if LC  0 then LC  LC -1 goto L1

10 11 c/d b Optimization of Loops Iterations 1 2 3 4 0 a 1 2 b 3 c/d (a) L1: ld4 r4 = [r5], 4 ;; (b) add r7 = r4, r9 ;; (c) st4 [r6] = r7, 4 (d) br.cloop L1 ;; 4 a Cycles 5 6 b 7 c/d 8 a If LC=1000, how long does it take for this loop to execute? 9 12 a It takes 4000 cycles. 13 14 b

1 2 3 4 0 a 1 b 2 c 3 d/e 4 f/g 5 a 6 b 7 c 8 d/e 9 f/g 11 10 b a 12 c 13 d/e 14 f/g Optimization of Loops:Loop Unrolling Iterations (a) L1: ld4 r4 = [r5], 4 ;; (b) ld4 r14 = [r5], 4 ;; (c) add r7 = r4, r9 ;; (d) add r17 = r14, r9 (e) st4 [r6] = r7,4 ;; (f) st4 [r6] = r17,4 (g) br.cloop L1 ;; Cycles For simplicity we assume that N is a multiple of 2. Because the loads (a) and (b) both update r5 they have to be serialized

1 2 3 4 0 a 1 b 2 c 3 d/e 4 f/g 5 a 6 b 7 c 8 d/e 9 f/g 11 10 b a 12 c 13 d/e 14 f/g Optimization of Loops:Loop Unrolling Iterations (a) L1: ld4 r4 = [r5], 4 ;; (b) ld4 r14 = [r5], 4 ;; (c) add r7 = r4, r9 ;; (d) add r17 = r14, r9 (e) st4 [r6] = r7,4 ;; (f) st4 [r6] = r17,4 (g) br.cloop L1 ;; Cycles If LC=1000 for the original loop, how long does it take for this loop to execute? It takes 2500 cycles. Thus the loop is 4000/2500 = 1.6 times faster

Optimization of Loops:Expanding the Induction Variable Iterations add r15 = 4, r5 add r16 = 4, r6 ;; (a) L1: ld4 r4 = [r5], 8 (b) ld4 r14 = [r15], 8 ;; (c) add r7 = r4, r9 (d) add r17 = r14, r9 (e) st4 [r6] = r7,8 ;; (f) st4 [r16] = r17,8 (g) br.cloop L1 ;; 1 2 3 4 0 a/b 1 2 c/d 3 e/f/g 4 a/b Cycles 5 6 c/d 7 e/f/g 8 a/b We use twice as many functional units as the original code. But no instruction is issued in cycle 1, and functional units are still under-utilized. 9 10 c/d 11 e/f/g 12 a/b 13 14 c/d

Optimization of Loops:Expanding the Induction Variable Iterations add r15 = 4, r5 add r16 = 4, r6 ;; (a) L1: ld4 r4 = [r5], 8 (b) ld4 r14 = [r15], 8 ;; (c) add r7 = r4, r9 (d) add r17 = r14, r9 (e) st4 [r6] = r7,8 (f) st4 [r6] = r17,8 (g) br.cloop L1 ;; 1 2 3 4 0 a/b 1 2 c/d 3 e/f/g 4 a/b Cycles 5 6 c/d 7 e/f/g If LC=1000 for the original loop, how long does it take for this loop to execute? 8 a/b 9 10 c/d 11 e/f/g It takes 2000 cycles. Thus the loop is 4000/2000 = 2.0 times faster 12 a/b 13 14 c/d

Iterations 1 2 3 4 0 a/b 1 c/d 2 e/f 3 g/h/i/j 4 k/l/m Cycles 5 a/b 6 c/d 7 e/f 8 g/h/i/j 9 k/l/m 10 a/b 11 c/d 12 e/f 13 g/h/i/j 14 k/l/m Optimization of Loops:Further Loop Unrolling add r15 = 4, r5 add r25 = 8, r5 add r35 = 12, r5 add r16 = 4, r6 add r26 = 8, r6 add r36 = 12, r6 ;; add r16 = 4, r6 ;; (a) L1: ld4 r4 = [r5], 16 (b) ld4 r14 = [r15], 16 ;; (c) ld4 r24 = [r25], 16 (d) ld4 r34 = [r35], 16 ;; (e) add r7 = r4, r9 (f) add r17 = r14, r9;; (g) st4 [r6] = r7,16 (h) st4 [r16] = r17,16 (i) add r27 = r24, r9 (j) add r37 = r34, r9 ;; (k) st4 [r26] = r27, 16 (l) st4 [r36] = r37, 16 (m) br.cloop L1 ;;

Iterations 1 2 3 4 0 a/b 1 c/d 2 e/f 3 g/h/i/j 4 k/l/m Cycles 5 a/b 6 c/d 7 e/f 8 g/h/i/j 9 k/l/m 10 a/b 11 c/d 12 e/f 13 g/h/i/j 14 k/l/m Optimization of Loops:Further Loop Unrolling If LC=1000 for the original loop, how long does it take for this loop (unrolled 4 times) to execute? It takes 250*5=1250 cycles. Thus the loop is 4000/1250 = 3.2 times faster

Loop Optimization:Loop Unrolling In the previous example we obtained a good utilization of the functional units through loop unrolling. But at the cost of code expansion and higher register pressure. Software Pipelining offers an alternative by overlapping the execution of operations from multiple iterations of the loop.

Loop Optimization:Software Pipelining (S1) ld4 r4 = [r5], 4 (S2) - - - (S3) add r7 = r4, r9 (S4) st4 [r6] = r7, 4 * This is not real code Iterations 1 2 3 4 5 6 7 0 S1 prologue 1 S1 2 S3 S1 3 S4 S3 S1 4 S4 S3 S1 kernel Cycles 5 S4 S3 S1 6 S4 S3 S1 7 S4 S3 8 S4 S3 epilogue 9 S4

Loop Optimization:Software Pipelining Code ld4 r4 = [r5], 4 ;; // load x[1] ld4 r4 = [r5], 4 ;; // load x[2] add r7 = r4, r9 // y[1] = x[1]+ k ld4 r4 = [r5], 4 ;; // load x[3] L1: ld4 r4 = [r5], 4 // load x[i+3] add r7 = r4, r9 // y[i+1] = x[i+1] + k st4 [r6] = r7, 4 // store y[i] br.cloop L1 ;; st4 [r6] = r7, 4 // store y[n-2] add r7 = r4, r9 ;; // y[n-1] = x[n-1] + k st4 [r6] = r7, 4 // store y[n-1] add r7 = r4,r9 ;; // y[n] = x[n] + k st4 [r6] = r7, 4 // store y[n] prologue kernel epilogue

Software Pipelining and Data Dependencies. Naïve Code: void f(int *p, int *q, int N){ intt, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; } } loop: ldl RA ← [RC] incr RC ← RC+1 add RB ← 1 + RA stl [RD] ← RB incr RD ← RD+1 if(loop not done) goto loop Create an auto-increment addressing mode

Software Pipelining and Data Dependencies. Naïve Code: void f(int *p, int *q, int N){ intt, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; } } loop: ldl RA ← [RC]+ add RB ← 1 + RA stl [RD]+ ← RB if(loop not done) goto loop Scalar Expansion: Write to a different register in each iteration

Software Pipelining and Data Dependencies. Naïve Code: void f(int *p, int *q, int N){ intt, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; } } loop: ldl RAi ← [RC]+ add RBi ← 1 + RAi stl [RD]+ ← RBi if(loop not done) goto loop Rotate the Registers! How to create an unbounded number of registers? Still have RAW dependencies!

Software Pipelining and Data Dependencies. loop: ldl R32 ← [RC]+ add R34 ← 1 + R33 stl [RD]+ ← R35 if(loop not done) copy temp ← R35 copy R35 ← R34 copy R34 ← R33 copy R33 ← R32 copy R32 ← temp goto loop void f(int *p, int *q, int N){ intt, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; } } Dependencies on the copies! Hardware Rotates Registers Automatically!

Simulating an Infinite Register File void f(int *p, int *q, int N){ intt, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; } } prolog: ldl r32 ← [r12]+ (rotate) r33 ← r32 ldl r32 ← [r12] add r34 ← 1 + r33 (rotate) r35 ← r34 (rotate) r34 ← r33 (rotate) r33 ← r32 loop: ldl r32 ← [r12]+ add r34 ← 1 + r33 stl [r13]+ ← r35 if(loop is not done) (rotate) temp ← r39 (rotate) r39 ← r38 (rotate) r38 ← r37 (rotate) r37 ← r36 (rotate) r36 ← r35 (rotate) r35 ← r34 (rotate) r34 ← r33 (rotate) r33 ← r32 (rotate) r32 ← temp goto loop Would be better to not generate separate code for prolog and epilog. epilog: add r34 ← 1 + r33 stl [r13]+ ← r35 (rotate) r35 ← r34 stl [r13]+ ← r35 Use predicate Registers

Simulating an Infinite Register File void f(int *p, int *q, int N){ intt, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; } } prolog: (1) ldl r32 ← [r12]+ (0) add r34 ← 1 + r33 (0) stl [r13]+ ← r35 (rotate all) (1) ldl r32 ← [r12]+ (1) add r34 ← 1 + r33 (0) stl [r13]+ ← r35 (rotate all) loop: (1) ldl r32 ← [r12]+ (1) add r34 ← 1 + r33 (1) stl [r13]+ ← r35 if(loop is not done) (rotate all) goto loop prolog: (0) ldl r32 ← [r12]+ (1) add r34 ← 1 + r33 (1) stl [r13]+ ← r35 (rotate all) (0) ldl r32 ← [r12]+ (0) add r34 ← 1 + r33 (1) stl [r13]+ ← r35 (rotate all) Still need separate code for prolog and epilog. Rotate predicate Registers!

Simulating an Infinite Register File void f(int *p, int *q, int N){ intt, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; } } loop: (p16) ldl r32 ← [r12]+ (p17) add r34 ← 1 + r33 (p18) stl [r13]+ ← r35 if(loop is not done) (rotate all) goto loop We have been ignoring the loop counter c and the test at the end of the loop. Create a Special Software Pipelining Branch

Support for Software Pipelining in the IA-64 After a loop is converted into a software pipeline, it looks quite different from the original loop, Intel adopts the following terminology: source loop and source iteration: refer to the original source code kernel loop and kernel iteration: refer to the code that implements the software pipeline.

Loop Support in the IA-64:Register Rotation The IA-64 has a rotating register base (rrb) register that is decremented by special software pipelined loop branches. When the rrb is decremented the valued stored in register X appear to move to register X+1, and the value of the highest numbered rotating register appears to move to the lowest numbered rotating register.

Loop Support in the IA-64:Register Rotation • What registers can rotate? • The predicate registers p16-p63; • The floating-point registers f32-f127; • A programable portion of the general registers: • The function alloc can allocate 0, 8, 16, 24, …, 96 general registers as rotating registers • The lowest numbered rotating register is r32. • There are three rrb: rrb.gr, rrb.fr rrb.pr

How Register Rotation Helps Software Pipeline The concept of a software pipelining branch: L1: ld4 r35 = [r4], 4 // post-increment by 4 st4 [r5] = r37, 4 // post-increment by 4 swp_branch L1 ;; The pseudo-instruction swp_branch in the example rotates the general registers. Therefore the value stored into r35 is read in r37 two kernel iterations (and two rotations) later. The register rotation eliminated a dependence between the load and the store instructions, and allowed the loop to execute in one cycle.

Logical Logical Logical Physical Physical Physical R32 R32 R32 R33 R33 R33 R35 9 RRB RRB RRB R34 R34 R34 R35 8 8 0 -1 -2 R35 R35 R35 R35 R37 7 7 7 R36 R36 R36 R37 R37 R37 R37 R37 R38 R38 R38 R39 R39 R39 How Register Rotation Helps Software Pipeline The concept of a software pipelining branch: L1: ld4 r35 = [r4], 4 // post-increment by 4 st4 [r5] = r37, 4 // post-increment by 4 swp_branch L1 ;;

The stage predicate When assembling a software pipeline the programmer can assign a stage predicate to each stage of the pipeline to control the execution of the instructions in that stage. p16 is architecturally defined as the predicate for the first stage, p17 for the second, and so on. The software pipeline branchrotates the predicate registers and injects a 1 in p16. Thus enabling one stage of the pipeline at a time for the execution of the prolog. (S1): (p16) ld4 r4 = [r5], 4 (S2): (p17) - - - (S3): (p18) add r7 = r4, r9 (S4): (p19) st4 [r6] = r7, 4

The stage predicate (S1): (p16) ld4 r4 = [r5], 4 (S2): (p17) - - - (S3): (p18) add r7 = r4, r9 (S4): (p19) st4 [r6] = r7, 4 When the kernel counter reaches zero, the software pipeline branchstarts to decrement the epilog counter and injects 0 in p16 at every rotation to execute the epilogue of the software pipelined loop.

== 0 (epilog) EC? =0 >1 (prolog/kernel)  0 =1 LC-- EC EC-- EC-- PR[16]=0 PR[16]=1 PR[16]=0 PR[16]=0 RRB-- RRB-- RRB-- branch Anatomy of a Software Pipelining Branch special unrolled loops LC? fall-thru

Software Pipelining Example in the IA-64 mov pr.rot = 0 // Clear all rotating predicate registers cmp.eq p16,p0 = r0,r0 // Set p16=1 mov ar.lc = 4 // Set loop counter to n-1 mov ar.ec = 3 // Set epilog counter to 3 … loop: (p16) ldl r32 = [r12], 1 // Stage 1: load x (p17) add r34 = 1, r33 // Stage 2: y=x+1 (p18) stl [r13] = r35,1 // Stage 3: store y br.ctop loop // Branch back

34 36 32 33 35 37 38 39 EC LC 4 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16)ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 34 36 32 33 35 37 38 39 General Registers (Logical) Predicate Registers Memory 1 0 0 18 16 17 x1 x2 x3 x4 x5 RRB 0

34 36 32 33 35 37 38 39 EC LC 4 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 34 36 32 33 35 37 38 39 General Registers (Logical) Predicate Registers Memory 1 0 0 18 16 17 x1 x2 x3 x4 x5 RRB 0

34 36 32 33 35 37 38 39 EC LC 4 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 35 37 33 34 36 38 39 32 General Registers (Logical) Predicate Registers Memory 1 0 0 1 18 16 17 x1 x2 x3 x4 x5 RRB -1

34 36 32 33 35 37 38 39 EC LC 3 3 1 1 0 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 35 37 33 34 36 38 39 32 General Registers (Logical) Predicate Registers Memory 18 16 17 x1 x2 x3 x4 x5 RRB -1

34 36 32 33 35 37 38 39 EC LC 3 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16)ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 x2 35 37 33 34 36 38 39 32 General Registers (Logical) Predicate Registers Memory 1 1 0 18 16 17 x1 x2 x3 x4 x5 RRB -1

34 36 32 33 35 37 38 39 EC LC 3 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y1 x1 x2 35 37 33 34 36 38 39 32 General Registers (Logical) Predicate Registers Memory 1 1 0 18 16 17 x1 x2 x3 x4 x5 RRB -1

The IA-64 Architectural Innovations

The IA-64 Architectural Innovations

Presentation Transcript

IA-64 Architecture (Think Intel Itanium)

IA-64 Microarchitecture --- Itanium Processor

Lecture: EPIC, IA-64 and Merced

Chapter 15 IA-64 Architecture

Chapter 15 IA 64 Architecture Review

Dynamic Instrumentation on the IA-64

IA-64 Architecture Innovations

Intel IA-64

Pertemuan 22 IA-64 Architecture

Using SDSC TeraGrid IA-64 Cluster

Compiling for IA-64

Microprocessor system architectures – IA 64

IA-64

IA-64 Application Architecture Tutorial

Chapter 15 IA-64 Architecture

IA-64 Architecture Innovations

Comparing IA-64 and HPL-PD