1 / 27

CS 3214 Computer Systems

This announcement provides important information regarding Exercise 3 and Project 1 deadlines for CS 3214. Please read the instructions carefully and ensure that your work is submitted on time.

vasquezs
Télécharger la présentation

CS 3214 Computer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 3214Computer Systems Godmar Back Lecture 6

  2. Announcements • Exercise 3 due today • Not on Scholar, use submit.pl or submission script • Stay tuned for exercise 4 • Project 1 due Wed, Feb 10 • Please read instructions first • Must be done on McB 124 machines or on rlogin cluster • Auto-fail rule 1: • Need at least phase_4 defused to pass class. CS 3214 Spring 2010

  3. zip_dig pgh[4]; 1 1 1 1 5 5 5 5 2 2 2 2 0 2 1 1 3 1 6 7 76 96 116 136 156 Nested Array Example #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; • Declaration “zip_digpgh[4]” equivalent to “intpgh[4][5]” • Variable pgh denotes array of 4 elements • Allocated contiguously • Each element is an array of 5 int’s • Allocated contiguously • “Row-Major” ordering of all elements guaranteed CS 3214 Spring 2010

  4. 0 9 1 2 4 5 2 1 7 1 3 2 3 9 0 a 16 56 36 60 20 40 64 24 44 48 28 68 52 32 72 36 56 76 univ 160 36 b 164 16 c 168 56 Multi-Level Array Example zip_dig a = { 1, 5, 2, 1, 3 }; zip_dig b = { 0, 2, 1, 3, 9 }; zip_dig c = { 9, 4, 7, 2, 0 }; • Variable univ denotes array of 3 elements • Each element is a pointer • 4 bytes • Each pointer points to array of int’s #define UCOUNT 3 int *univ[UCOUNT] = {a, b, c}; CS 3214 Spring 2010

  5. i a p 20 0 4 16 Structures • Concept • Contiguously-allocated region of memory • Refer to members within structure by names • Members may be of different types • Accessing Structure Member struct rec { int i; int a[3]; int *p; }; Memory Layout Assembly void set_i(struct rec *r, int val) { r->i = val; } # %eax = val # %edx = r movl %eax,(%edx) # Mem[r] = val CS 3214 Spring 2010

  6. Generating Pointer to Struct. Member r struct rec { int i; int a[3]; int *p; }; i a p • Generating Pointer to Array Element • Offset of each structure member determined at compile time 0 4 16 r + 4 + 4*idx int * find_a (struct rec *r, int idx) { return &r->a[idx]; } # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # 4*idx leal 4(%eax,%edx),%eax # r+4*idx+4 CS 3214 Spring 2010

  7. c i[0] i[1] c i[0] i[1] v v up+0 up+4 up+8 sp+0 sp+4 sp+8 sp+16 sp+24 Union Allocation • Principles • Overlay union elements • Allocate according to largest element • Can only use one field at a time union U1 { char c; int i[2]; double v; } *up; struct S1 { char c; int i[2]; double v; } *sp; (Windows alignment) CS 3214 Spring 2010

  8. The following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html Part 4 Programs and Data CS 3214 Spring 2010

  9. Today • x86_64 • Advanced compiler use • Extended/inline asm • Vectorization • SIMD Intrinsics • Floating point • Buffer Overflows (Part 1) CS 3214 Spring 2010

  10. x86_64 • 64-bit extension of IA32 • aka EM64T (Intel) • Please read x86_64 supplemental material • http://csapp.cs.cmu.edu/public/docs/asm64-handout.pdf • Don’t confuse with IA64 “Itanium” CS 3214 Spring 2010

  11. x86_64 Highlights • Extends 8 general purpose registers to 64bit lengths • And add 8 more 64bit registers • C Binding: sizeof(int) still 4!; sizeof(anything *), sizeof(long), sizeof(long int) now 8. • NB: sizeof(long long) is 8 both on IA32 and x86_64 • Passing arguments in registers by default CS 3214 Spring 2010

  12. x86_64 See http://www.x86-64.org/documentation.html CS 3214 Spring 2010

  13. Inlined Assembly • asm(“…” : <output> : <input> : <clobber>) • Means to inject assembly into code and link with remained in a controlled manner • Compiler doesn’t “know” what instructions do – thus must describe • a) state compiler must create upon enter: which values must be in which registers, etc. • b) state produced by inline instructions: which registers contain which values, etc. – also: any registers that may be clobbered CS 3214 Spring 2010

  14. Inlined Assembly Example bool imul32x32_64(uint32_t leftop, uint32_t rightop, uint64_t *presult) { uint64_t result; bool overflow; asm("imull %2" "\n\t" "seto %%bl" "\n\t" : "=A" (result), "=b" (overflow) // output constraint : "r" (leftop), "a" (rightop) // input constraint ); *presult = result; return overflow; } Goal: exploit imull’s property to compute 32x32 bit product: imull %ecx means (%edx, %eax) := %ecx * %eax Magic instructions: “r”(leftop) – pick any 32bit register and put leftop in it “a” (rightop) – make sure %eax contains rightop “%2” substitute whichever register picked for ‘leftop’ “=A” result is in (%edx, %eax) “=b” result is in %ebx CS 3214 Spring 2010

  15. imul32x32_64: pushl %ebp movl %esp, %ebp subl $12, %esp movl %ebx, (%esp) movl %esi, 4(%esp) movl %edi, 8(%esp) movl 8(%ebp), %ecx movl 12(%ebp), %eax #APP imull %ecx seto %bl #NO_APP movl %eax, %esi movl 16(%ebp), %eax movl %esi, (%eax) movl %edx, 4(%eax) movzbl %bl, %eax movl (%esp), %ebx movl 4(%esp), %esi movl 8(%esp), %edi movl %ebp, %esp popl %ebp ret Inlined Assembly (2) bool imul32x32_64(uint32_t leftop, uint32_t rightop, uint64_t *presult) { uint64_t result; bool overflow; asm("imull %2" "\n\t" "seto %%bl" "\n\t" : "=A" (result), "=b" (overflow) // output constraint : "r" (leftop), "a" (rightop) // input constraint ); *presult = result; return overflow; } CS 3214 Spring 2010

  16. Floating Point on IA32 • History: • First implemented in 8087 coprocessor • “stack based” – FPU has 8 registers that form a stack %st(0), %st(1), … • Known as ‘x87’ floating point • Weirdness: internal accuracy 80bit (rather than IEEE745 64bit) – thus storing involves rounding • Results depends on how often values are moved out of the FPU registers into memory (which depends on compiler’s code generation strategy/optimization level) – not good! CS 3214 Spring 2010

  17. Floating Point Code Example • Compute Inner Product of Two Vectors • Single precision arithmetic • Common computation pushl %ebp # setup movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx # %ebx=&x movl 12(%ebp),%ecx # %ecx=&y movl 16(%ebp),%edx # %edx=n fldz # push +0.0 xorl %eax,%eax # i=0 cmpl %edx,%eax # if i>=n done jge .L3 .L5: flds (%ebx,%eax,4) # push x[i] fmuls (%ecx,%eax,4) # st(0)*=y[i] faddp # st(1)+=st(0); pop incl %eax # i++ cmpl %edx,%eax # if i<n repeat jl .L5 .L3: movl -4(%ebp),%ebx # finish movl %ebp, %esp popl %ebp ret # st(0) = result float ipf (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; } CS 3214 Spring 2010

  18. Floating Point: SSE(*) • Various extensions to x87 were introduced: • SSE, SSE2, SSE3, SSE4, SSE5 • Use 16 128bit %xmm registers • Can be used as 16x8bit, 4x32bit, 2x64bit, etc. for both integer and floating point operations • Use –fpmath=sse –msseswitch to enable (or –msse2, -msse3, -msse4) • All doubles are 64bits internally - gives reproducible results independent of load/stores • Aside: if 80bit is ok, can combine –fpmath=sse,x87 for 24 registers CS 3214 Spring 2010

  19. Floating Point SSE • Same code compiled with:-msse2 -fpmath=sse ipf: pushl %ebp movl %esp, %ebp pushl %ebx subl $4, %esp movl 8(%ebp), %ebx movl 12(%ebp), %ecx movl 16(%ebp), %edx xorps %xmm1, %xmm1 testl %edx, %edx jle .L4 movl $0, %eax ; i = 0 xorps %xmm1, %xmm1; result = 0.0 .L5: movss (%ebx,%eax,4), %xmm0 ; t = x[i] mulss (%ecx,%eax,4), %xmm0 ; t *= y[i] addss %xmm0, %xmm1 ; result += t addl $1, %eax ; i = i+1 cmpl %edx, %eax jne .L5 .L4: movss %xmm1, -8(%ebp) flds -8(%ebp) ; %st(0) = result addl $4, %esp popl %ebx popl %ebp ret float ipf (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; } CS 3214 Spring 2010

  20. Vectorization • SSE* instruction sets can operate on ‘vectors’ • For instance, if 128bit register is treated as (d1, d0) and (e1, e0), can compute (d1+e1, d0+e0) using single instruction – executes in parallel • Also known as “SIMD” • Single instruction, multiple data CS 3214 Spring 2010

  21. Floating Point SSE - Vectorized • Trying to make compiler achieve transformation shown on right float ipf_vector (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i+=4) { p[0] = x[i] * y[i]; p[1] = x[i+1] * y[i+1]; p[2] = x[i+2] * y[i+2]; p[3] = x[i+3] * y[i+3]; result += p[0]+p[1]+p[2]+p[3]; } return result; } float ipf (float x[], float y[], int n) { inti; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; } Logical transformation, not actual code CS 3214 Spring 2010

  22. Example: GCC Vector Extension magic attribute that tells gcc that v4sf is a type denoting vectors of 4 floats typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; float * v = (float *)&p; // treat vector as float * partialsum = v[0] + v[1] + v[2] + v[3]; result += partialsum; } return result; } CS 3214 Spring 2010

  23. ipf: pushl %ebp movl %esp, %ebp pushl %ebx subl $36, %esp movl 16(%ebp), %ebx movl 8(%ebp), %edx movl 12(%ebp), %eax movl $0, %ecx xorps %xmm1, %xmm1 .L5: movaps (%eax), %xmm0 mulps (%edx), %xmm0 movaps %xmm0, -24(%ebp) movss -24(%ebp), %xmm0 addss -20(%ebp), %xmm0 addss -16(%ebp), %xmm0 addss -12(%ebp), %xmm0 addss %xmm0, %xmm1 addl $1, %ecx addl $16, %edx addl $16, %eax cmpl %ebx, %ecx jne .L5 movss %xmm1, -28(%ebp) flds -28(%ebp) addl $36, %esp popl %ebx popl %ebp ret Example: GCC Vector Extensions typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; float * v = (float *)&p; partialsum = v[0] + v[1] + v[2] + v[3]; result += partialsum; } return result; } CS 3214 Spring 2010

  24. Comments • Assembly code on previous slide is slightly simplified (omits first i < n check in case n ==0) • Two problems with it • Problem 1: ‘partialresult’ is allocated on the stack • value is said to be “spilled” to the stack • Problem 2: • Does not use vector unit for computing sum CS 3214 Spring 2010

  25. SSE3: hadd_ps • Treats 128bit as 4 floats (“parallel single”) • Input are 2x128bit (A3, A2, A1, A0) and (B3, B2, B1, B0) • Computes (B3 + B2, B1 + B0, A3 + A2, A1 + A0) – “horizontal” operation “hadd” • Apply twice to compute sum of all 4 elements in lowest element • Use “intrinsics” – look like function calls, but are instructions for the compiler to use certain instructions • Unlike ‘asm’, compiler knows their meaning: no need to specify input, output constraints, or what’s clobbered • Compiler performs register allocation CS 3214 Spring 2010

  26. GCC Vector Extensions + XMM Intrinsics #include <pmmintrin.h> typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; v4sf zero = _mm_setzero_ps(); // intrinsic, produces vector of 4 0.0f for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; _mm_store_ss( &partialsum, _mm_hadd_ps(_mm_hadd_ps(p, zero), zero)); result += partialsum; } return result; } CS 3214 Spring 2010

  27. ipf: pushl %ebp movl %esp, %ebp pushl %ebx subl $4, %esp movl 16(%ebp), %ebx movl 8(%ebp), %edx movl 12(%ebp), %eax movl $0, %ecx xorps %xmm2, %xmm2 xorps %xmm1, %xmm1 .L5: movaps (%eax), %xmm0 mulps (%edx), %xmm0 haddps %xmm1, %xmm0 haddps %xmm1, %xmm0 addss %xmm0, %xmm2 addl $1, %ecx addl $16, %edx addl $16, %eax cmpl %ebx, %ecx jne .L5 movss %xmm2, -8(%ebp) flds -8(%ebp) addl $4, %esp popl %ebx popl %ebp ret Example: GCC Vector Extensions + XMM Intrinsics #include <pmmintrin.h> typedef float v4sf __attribute__ ((vector_size (16))); float ipf (v4sf x[], v4sf y[], int n) { inti; float partialsum, result = 0.0; v4sf zero = _mm_setzero_ps(); for (i = 0; i < n; i++) { v4sf p = x[i] * y[i]; _mm_store_ss( &partialsum, _mm_hadd_ps(_mm_hadd_ps(p, zero), zero)); result += partialsum; } return result; } CS 3214 Spring 2010

More Related