EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism

EECS 583 – Class 22Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012

Announcements • Last class today! • No more reading • Dec 12-18 – Project presentations • Each group sign up for 30-minute slot • See me after class if you have not signed up • Course evaluations reminder • Please fill one out, it will only take 5 minutes • I do read them • Improve the experience for future 583 students

Notes on Project Demos • Demo format • Each group gets 30 minutes • Strict deadlines enforced because many back to back groups • Don’t be late! • Figure out your room number ahead of time (see schedule on my door) • Plan for 20 mins of presentation (no more!), 10 mins questions • Some slides are helpful, try to have all group members say something • Talk about what you did (basic idea, previous work), how you did it (approach + implementation), and results • Demo or real code examples are good • Report • 5 pg double spaced including figures – what you did + why, implementation, and results • Due either when you do your demo or Dec 18 at 6pm

SIMD Processors: Larrabee (now called Knights Corner) Block Diagram

Vector Unit Block Diagram

Processor Core Block Diagram

Larrabee vs Conventional GPUs • Each Larrabee core is a complete Intel processor • Context switching & pre-emptive multi-tasking • Virtual memory and page swapping, even in texture logic • Fully coherent caches at all levels of the hierarchy • Efficient inter-block communication • Ring bus for full inter-processor communication • Low latency high bandwidth L1 and L2 caches • Fast synchronization between cores and caches • Larrabee: the programmability of IA with the parallelism of graphics processors

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Multimedia Extensions • Additions to all major ISAs • SIMD operations

Using Multimedia Extensions • Library calls and inline assembly • Difficult to program • Not portable • Different extensions to the same ISA • MMX and SSE • SSE vs. 3DNow! • Need automatic compilation

Vector Compilation • Pros: • Successful for vector computers • Large body of research • Cons: • Involved transformations • Targets loop nests

Superword Level Parallelism (SLP) • Small amount of parallelism • Typically 2 to 8-way • Exists within basic blocks • Uncovered with a simple analysis • Independent isomorphic operations • New paradigm

R R XR 1.08327 G = G + XG * 1.89234 B B XB 1.29835 1. Independent ALU Ops R = R + XR * 1.08327 G = G + XG * 1.89234 B = B + XB * 1.29835

R R G = G + X[i:i+2] B B 2. Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2]

3. Vectorizable Loops for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]

for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] 3. Vectorizable Loops for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3]

4. Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1) 4. Partially Vectorizable Loops for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L)

Exploiting SLP with SIMD Execution • Benefit: • Multiple ALU ops  One SIMD op • Multiple ld/st ops  One wide mem op • Cost: • Packing and unpacking • Reshuffling within a register

C A 2 D B 3 = + Packing/Unpacking Costs C = A + 2 D = B + 3

A A B B Packing/Unpacking Costs • Packing source operands A = f() B = g() C A 2 D B 3 C = A + 2 D = B + 3 = +

A A B B C C D D Packing/Unpacking Costs • Packing source operands • Unpacking destination operands A = f() B = g() C A 2 D B 3 C = A + 2 D = B + 3 = + E = C / 5 F = D * 7

Optimizing Program Performance • To achieve the best speedup: • Maximize parallelization • Minimize packing/unpacking • Many packing possibilities • Worst case: n ops n! configurations • Different cost/benefit for each choice

Observation 1:Packing Costs can be Amortized • Use packed result operands A = B + C D = E + F G = A - H I = D - J

Observation 1:Packing Costs can be Amortized • Use packed result operands • Share packed source operands A = B + C D = E + F A = B + C D = E + F G = A - H I = D - J G = B + H I = E + J

Observation 2:Adjacent Memory is Key • Large potential performance gains • Eliminate ld/str instructions • Reduce memory bandwidth • Few packing possibilities • Only one ordering exploits pre-packing

SLP Extraction Algorithm • Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

A B = X[i:i+1] SLP Extraction Algorithm • Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

A B = X[i:i+1] SLP Extraction Algorithm • Follow def-use chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

A B = X[i:i+1] H J C D A B = - SLP Extraction Algorithm • Follow def-use chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

A B = X[i:i+1] H J C D A B = - SLP Extraction Algorithm • Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

A B = X[i:i+1] C D E F 3 5 = * H J C D A B = - SLP Extraction Algorithm • Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

SLP Availability

SLP vs. Vector Parallelism

Conclusions • Multimedia architectures abundant • Need automatic compilation • SLP is the right paradigm • 20% non-vectorizable in SPEC95fp • SLP extraction successful • Simple, local analysis • Provides speedups from 1.24 – 6.70 • Found SLP in general-purpose codes

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism