70 likes | 272 Vues
Track fitting and v ectorization. Thijs Cornelissen (Wuppertal). Updates in GlobalChi2Fitter. Rewritten calculation of jacobians Fewer temporary matrices calculated Matrix multiplication now uses SSE instructions, factor two faster (w/doubles)
E N D
Track fitting and vectorization ThijsCornelissen (Wuppertal)
Updates in GlobalChi2Fitter • Rewritten calculation of jacobians • Fewer temporary matrices calculated • Matrix multiplication now uses SSE instructions, factor two faster (w/doubles) • Reorganized internal storage of matrices in fitter, mainly to make them properly aligned in memory (crucial for vector instructions) • In covariance matrix, inserted empty column between perigee and scatter entries, to make total number of entries even • Jacobians stored as a 5x4 matrix (by default they are 5x5, which is very bad for vectorization) • Optimized calculation of track errors at each measurement • Also taking advantage of SSE instructions • After these optimizations, main bottleneck is newing/deleting of Tracking EDM objects
Matrix multiplication: scalar • Simple 4x4 matrix multiplication routine • With gcc 4.7.2, auto-vectorization makes this routine 30% slower!! • Gcc 4.8.1 shows small (~10%) improvement, still nowhere near theoretical speed-up (factor 4)
Matrix multiplication: vectorized • Vectorized 4x4 matrix multiplication routine, calculates four dot products in parallel • Tested to be four times faster than scalar version • Runs in development version of fitter, gives correct results
Profiling runIteration() devval new • Large reduction in derivative calculation thanks to matrix optimizations
Profiling calculateTrackErrors() devval new • Overall factor 2 improvement after all optimizations (not just vectorization) • Performance of errors2() function does not look optimal yet, still investigating additional techniques like loop unrolling, blocking, …
SSE and portability • Intel MIC and ARM processors don’t support SSE, code would crash immediately • Can implement runtime check at initialization using the assembler instruction ‘cpuid’, as explained here • Could then use result from cpuid to set function pointer to scalar or vectorized functions • In the end, using a higher level library like Eigen would be more elegant • But performance will have to match the low level code