Enhancing Auto-Vectorization in C++ Code for Efficient Computation

Auto-vectorization1/ • ::Mask cannot be translated – hence some source code modifications in declaration/definition…: template<TranslationIdTypetid, RotationIdType rid, class TubeType=TubeTraits::HollowTubeWithPhi, class T=double> … template<typenameVectorType = Vc::Vector<T> > inline __attribute__((always_inline)) typenameVectorType::Mask determineRHit( VectorTypeconst& /*x-vec*/, VectorTypeconst & /*y-vec*/, … inline __attribute__((always_inline)) intdetermineRHitav( T const & /*x-vec*/, T const & /*y-vec*/, T const & /*z-vec*/,

Auto-vectorization2/ • Source code is still too adherent to Vc • Ex PointIsInPhiSector method: void PointIsInPhiSector(T const & phi1normalx, T const & phi1normaly, T const & phi2normalx, T const & phi2normaly, Vc::Vector<T> const & xcoord, Vc::Vector<T> const & ycoord, typenameVc::Vector<T>::Mask & isinphi) { // method based on calculating the scalar product of position vectors with the normals of the (empty) phi sektor // avoids taking the atan2 // this method could be template specialized in case DeltaPhi = 180^o Vc::Vector<T> scalarproduct1 = phi1normalx*xcoord + phi1normaly*ycoord; Vc::Vector<T> scalarproduct2 = phi2normalx*xcoord + phi2normaly*ycoord; isinphi= (scalarproduct1 > Vc::Zero && scalarproduct2 > Vc::Zero); }

Auto-vectorization3/ • Vc master branch + VecGeom prototypev1 compiled with gcc-4.8.1 + CHEP13Benchmark • For 1024 vector length and 1000 repetition vecnav.DistToNextBoundary() best timing is: ~94ms • Machine is a E5-2697 v2 @ 2.70GHz

Auto-vectorization4/ • Intel Vtune gives the following profile (sampling overhead pushed the 94ms to ~130ms) • Decision to “port” to auto-vectorisationthe PlacedUSolidsTube<tid,rid,TubeType,ValueType>::DistanceToInmethod (hotspot#2). Baseline timing 30ms.

Auto-vectorization5/ • Vc generates very nice asm…

Auto-vectorization6/ • Could not instanciatea PlacedUSolidsTube<tid,rid,TubeType,ValueType>::DistanceToIn method for template parameter = double • Choice was to re-implement another overloaded method for such type:

Auto-vectorization7/ • Not a small kernel; but code is quite easily portable to auto-vectorization: • MaskTypeinz_m = safez > Utils::fgToleranceVc; • done_m= !inz_m && ( z*dirz >= Vc::Zero ); // particle outside the z-range and moving away • … • intinz_m= (safez > Utils::fgTolerance) ? 1 : 0; • intdone_m= (!inz_m && (z*dirz >= 0.0)) ? 1 : 0; // particle outside the z-range and moving away • VectorTypedistanceRmax( Utils::kInfinityVc); • distanceRmax( canhitrmax ) = (-b - Vc::sqrt( discriminant ))*inverse2a; • ValueTypedistanceRmax = (canhitrmax) ? ((-b - std::sqrt(discriminant))*inverse2a) : Utils::kInfinity;

Auto-vectorization8/ • Auto-vectorization timings are not too catastrophic …~40ms vs30ms for Vc (speed-down of 33%) • No arithmetic perturbation visible !? • Same behavior for x10 repetitions • Asm inspection shows partial vectorization for g++ -- need to dig in this direction, as g++ is doing a pretty good job in handling recursive template inlining form out of the box….

Auto-vectorization9/ • With Intel compiler vectorisation seems complete (see next slide for asm inspection), although for the rest of the code embedding Vc template our compiler seems to be lost!

Auto-vectorization10/

Auto-vectorization11/ • I would recommend to always keep in mind the auto-vectorisation branch in template writing. • Auto-vectorization can be a pain – depending how compiler handles recursive template inlining..Ours seems to do a very poor job – I will initiate an internal review of this code with compiler engineering. • Need also to review reasons of partial vectorization for g++ • Need to test other kernels more complicated • Try this on Haswell(intwill be promoted to 256bits vl) • Try this on MIC • …

Enhancing Auto-Vectorization in C++ Code for Efficient Computation