150 likes | 153 Vues
The end of programming. PetaQCD collaboration. End of Parallelith. The era when we were in control of what gets executed where and how is pretty much over No modern compiler can generate efficient code for modern architectures Because they are SPMD of various width
E N D
The end of programming PetaQCD collaboration
End of Parallelith • The era when we were in control of what gets executed where and how is pretty much over • No modern compiler can generate efficient code for modern architectures • Because they are SPMD of various width • So companies have to do something else
Warps • To abstract and hide the execution of programs nVidia came up with Thread Warp, a bunch of threads executed at the same time at some part of the hardware
ISPC • Intel noticed that people don’t quite like CUDA • Yet figured out that no compiler can figure out where and how to parallelize • So it abandoned idea of having own compiler • And conceived a syntax which would appear user-friendly • Wrote an LLVM(low-level virtual machine) frontend and few backends • Just like NVIDIA did.
Gangs • So whatdoes brave new code like? export void simple(uniform float vin[], uniform float vout[], uniform int count) { foreach (index = 0 ... count) { float v = vin[index]; if (v < 3.) v = v * v; else v = sqrt(v); vout[index] = v; } } float vin[16], vout[16]; for (int i = 0; i < 16; ++i) vin[i] = i; simple(vin, vout, 16);
Gangs • So instead of Thread Warps we have Gangs • Bunch of program « instances » mapped onto SIMD • About 2-4x the width of SIMD • And variables can be shared or unique across gangs • Atomic operations should be used inside a gang • No threads, no context switching, just an army of marching ants
So what? • Intel’s ISPC is poor man’s CUDA • Flow control is much simpler (things go uniformly) • Obviously at the cost of performance • And is designed to deceive users that it is simpler • While all that it does is mapping code to instances • Which then get mapped to low-level functions • So in principle ISPC can be used for any architecture
So we care? • Yes we do. • First, Intel confirmed that classic compilers are dead • Second, it has shown us what a proper backend for AVX and SSE should look like (opensource code) define <16 x double> @__gather_base_offsets64_double(i8 * %ptr, i32 %scale, <16 x i64> %offsets, <16 x i32> %mask32) nounwind readonly alwaysinline { … %v1 = call <4 x double> @llvm.x86.avx2.gather.q.pd.256(<4 x double> undef, i8 * %ptr, <4 x i64> %offsets_1, <4 x double> %vecmask_1, i8 %scale8) assemble_4s(double, v, v1, v2, v3, v4) ret <16 x double> %v }
So what? • Third, we have a glimpse at how intel optimises. It doesn’t. • It just parses the commands via AST(abstract sysntax tree, basic notion of LLVM ) walking ASTNode * WalkAST(ASTNode *node, ASTPreCallBackFunc preFunc, ASTPostCallBackFunc postFunc, void *data) { if (node == NULL) return node; // Call the callback function if (preFunc != NULL) { if (preFunc(node, data) == false) // The function asked us to not continue recursively, so stop. return node; }
And... • Generates IR • And maps it to SIMD backends void AST::GenerateIR() { for (unsigned int i = 0; i < functions.size(); ++i) functions[i]->GenerateIR(); }
So we have … • An unstable quickly developing utility • Which may or may not be actually useful beyond AVX(no MIC support yet) • Is comparably obscure to CUDA (but with source!) • and not GPL.
Short term • ISPC may be a valueble tool to investigate the efficiency of the llvm-based compiler for the Intel architectures • One may try to generate gang-compatible code • Because it obviously will be superiour to the icc • Which fails to vectorize properly
Medium term • And wefinally have access to the L1 prefetcher! • NT means data willbediscardedafter use uniform int32 array[...]; for (uniform int i = 0; i < count; ++i) { // do computation with array[i] prefetch_l1(&array[i+32]); } void prefetch_{l1,l2,l3,nt}(void * uniform ptr) void prefetch_{l1,l2,l3,nt}(void * varying ptr)
Long term • It islikelythatmovingfromQiralIR to ISPC to IntelIRwe are loosing information • And certainlyaddingoverhead • So havingourownWalkASTmapping to intel LLVM backendsshouldbewaybetter • And wecanalsomodifythem for IBM SIMD (remember, IBM alsowantssame model, but withdifferent images instead of instances within one image) • And maybe CUDA just as well
Conclusions • All modern achitectures are damnGPUs • Divide variables into unique and uniform • Give up the control of whatisexecutedwhen and how, to varyinglevel • Have eitherimplicit or explicit barriers/sync • Have gangs/thread warps as sort of uniform threads • And LLVM seems to be the choice to deal withthem.