1 / 15

The end of programming

The end of programming. PetaQCD collaboration. End of Parallelith. The era when we were in control of what gets executed where and how is pretty much over No modern compiler can generate efficient code for modern architectures Because they are SPMD of various width

meghannb
Télécharger la présentation

The end of programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The end of programming PetaQCD collaboration

  2. End of Parallelith • The era when we were in control of what gets executed where and how is pretty much over • No modern compiler can generate efficient code for modern architectures • Because they are SPMD of various width • So companies have to do something else

  3. Warps • To abstract and hide the execution of programs nVidia came up with Thread Warp, a bunch of threads executed at the same time at some part of the hardware

  4. ISPC • Intel noticed that people don’t quite like CUDA • Yet figured out that no compiler can figure out where and how to parallelize • So it abandoned idea of having own compiler • And conceived a syntax which would appear user-friendly • Wrote an LLVM(low-level virtual machine) frontend and few backends • Just like NVIDIA did.

  5. Gangs • So whatdoes brave new code like? export void simple(uniform float vin[], uniform float vout[], uniform int count) { foreach (index = 0 ... count) { float v = vin[index]; if (v < 3.) v = v * v; else v = sqrt(v); vout[index] = v; } } float vin[16], vout[16]; for (int i = 0; i < 16; ++i) vin[i] = i; simple(vin, vout, 16);

  6. Gangs • So instead of Thread Warps we have Gangs • Bunch of program « instances » mapped onto SIMD • About 2-4x the width of SIMD • And variables can be shared or unique across gangs • Atomic operations should be used inside a gang • No threads, no context switching, just an army of marching ants

  7. So what? • Intel’s ISPC is poor man’s CUDA • Flow control is much simpler (things go uniformly) • Obviously at the cost of performance • And is designed to deceive users that it is simpler • While all that it does is mapping code to instances • Which then get mapped to low-level functions • So in principle ISPC can be used for any architecture

  8. So we care? • Yes we do. • First, Intel confirmed that classic compilers are dead • Second, it has shown us what a proper backend for AVX and SSE should look like (opensource code) define <16 x double> @__gather_base_offsets64_double(i8 * %ptr, i32 %scale, <16 x i64> %offsets, <16 x i32> %mask32) nounwind readonly alwaysinline { … %v1 = call <4 x double> @llvm.x86.avx2.gather.q.pd.256(<4 x double> undef, i8 * %ptr, <4 x i64> %offsets_1, <4 x double> %vecmask_1, i8 %scale8) assemble_4s(double, v, v1, v2, v3, v4) ret <16 x double> %v }

  9. So what? • Third, we have a glimpse at how intel optimises. It doesn’t. • It just parses the commands via AST(abstract sysntax tree, basic notion of LLVM ) walking ASTNode * WalkAST(ASTNode *node, ASTPreCallBackFunc preFunc, ASTPostCallBackFunc postFunc, void *data) { if (node == NULL) return node; // Call the callback function if (preFunc != NULL) { if (preFunc(node, data) == false) // The function asked us to not continue recursively, so stop. return node; }

  10. And... • Generates IR • And maps it to SIMD backends void AST::GenerateIR() { for (unsigned int i = 0; i < functions.size(); ++i) functions[i]->GenerateIR(); }

  11. So we have … • An unstable quickly developing utility • Which may or may not be actually useful beyond AVX(no MIC support yet) • Is comparably obscure to CUDA (but with source!) • and not GPL.

  12. Short term • ISPC may be a valueble tool to investigate the efficiency of the llvm-based compiler for the Intel architectures • One may try to generate gang-compatible code • Because it obviously will be superiour to the icc • Which fails to vectorize properly

  13. Medium term • And wefinally have access to the L1 prefetcher! • NT means data willbediscardedafter use uniform int32 array[...]; for (uniform int i = 0; i < count; ++i) { // do computation with array[i] prefetch_l1(&array[i+32]); } void prefetch_{l1,l2,l3,nt}(void * uniform ptr) void prefetch_{l1,l2,l3,nt}(void * varying ptr)

  14. Long term • It islikelythatmovingfromQiralIR to ISPC to IntelIRwe are loosing information • And certainlyaddingoverhead • So havingourownWalkASTmapping to intel LLVM backendsshouldbewaybetter • And wecanalsomodifythem for IBM SIMD (remember, IBM alsowantssame model, but withdifferent images instead of instances within one image) • And maybe CUDA just as well

  15. Conclusions • All modern achitectures are damnGPUs • Divide variables into unique and uniform • Give up the control of whatisexecutedwhen and how, to varyinglevel • Have eitherimplicit or explicit barriers/sync • Have gangs/thread warps as sort of uniform threads • And LLVM seems to be the choice to deal withthem.

More Related