120 likes | 224 Vues
P latform stability and track-fit problems. M. Moulson, T. Spadaro, P. Valente Tracking Meeting, 18 Jul 2001. Warning sign: platform dependence. Test of DBV-10 on SunOS and AIX: Input: 1000 raw events Output in ksl stream: AIX: 11 events, incl. 3 not found on SunOS
E N D
Platform stability and track-fit problems M. Moulson, T. Spadaro, P. Valente Tracking Meeting, 18 Jul 2001
Warning sign: platform dependence • Test of DBV-10 on SunOS and AIX: • Input: 1000 raw events • Output in ksl stream: • AIX: 11 events, incl. 3 not found on SunOS • SunOS: 10 events, incl. 2 not found on AIX • AIX or SunOS: 13 events • Mostly KSTAG (one INTERTAG) • 2 events found on both platforms with different length • General caveat (?’s): • Parameter space is huge; this is a quick survey • Most tests done on very small (single-event) samples • Tests should be done methodically once direction is clear
Where do the differences arise? • Differences appear at track-fit level • Reconstruction through PR identical • including DTCE(1), DHRE(1) banks • Differences appear in DBV-9 • Test DBV’s 7, 8, 9, 10 on a single event (?) • Reconstruction in general different in each version • Same on AIX and SunOS platforms for DBV-7, 8 • CVS history: changes in DBV-9 • dconvr: Spatial resolutions from data • vtxfin: Various small bug fixes • Suspect effect from new parameterization of hit resolution
First-crack diagnostics • Cannot eliminate effect just by switching off algorithms • Kink finding, track joining, M.S., hit add./rej., etc. • Hit flipping not switchable • Fine t-s relations a possible exception • Known changes correspond to onset of differences (?) • Provide a plausible mechanism for effect • Cannot eliminate effect by disabling code-optimization
Summary of first-crack diagnostics Input: 1000 raw events from run 18805 Table summarizes differences in ksl stream
DFITER: Fundamental track-fit routine DFITER Get space points from track pars. (q) DFTRAC Time-space conversion i < 1 DFBCOR DFDRV Get residuals, c2, V = dc2/dq c2 > c2(old) LEQU64 Vdq = q; q = q + dq Dc2 < cut CONVERGED Max iter? FAILED
Issues with DFITER and call limits • DFITER called at various points • at start of event • after each hit flipped in DFLIP • after DFMUSC, DFDEDX, etc. • at end of event • Max iterations in a single call: 8 • On failure, convergence criterion relaxed and called again • up to 15 times per track from most places • up to 15 times more for dE/dx and at end of event • Most tracks reconstructed differently show convergence problems
Beginnings of an explanation Track • At first call to DFITER, parameters different by 10-5-10-6 • Inside DFITER, after LEQU64, difference increases to 10-5-10-4 • Differences accumulate with each call to DFITER • Eventually jump bins in fine t-s • Differences in % • Problem exacerbated when convergence difficult • Most critical track parameter: z • Can diverge by tens of cm, esp. in DFLIP Tries DFITER End Tries Hits Tries DFITER DFLIP End Tries End Hits Other Alg
Notes on machine precision • Why do we see differences at 10-5 level at input to DFITER? • In principle possible to have exact agreement between platforms for single calculation • In practice depdends on optimization, autopromotion, rounding modes • E.g., AIX: our standard compiler flags do not round in single precision but autopromote single to double • Fair amount of code before this point; numerical errors accumulate rapidly • Part of solution will involve: • Tuning compilation parameters • Promoting key parts of track fit to double precision • NB: Matrix inversion already in double, looks OK • Worst case: V-1V = 1 to within < 10-12 (diag), 10-9 (off diag)
1. Residuals and drift distances not updated if hit not flipped • Basis for choice of next hit to flip • Always assume previous hit was flipped • 2. Failed DFITER calls count against max. retries • Looser convergence if lots of hits to flip (lots of failures) • Fewer calls to DFITER allowed • If a hit won’t flip, do we want to retry? • 3. Criterion for keeping flip: c2 < input c2 • Flip is kept even if c2 worse than best so far • 4. More subjective issues • No use of information on z-progression? Problems with DFLIP DFLIP Get hits to flip Sort; pick worst Store track pars. Flip hit DFITER (15 times) c2 < input Restore track pars. Return
A possible wish list • More study of the problem • More sensible calling strategy for DFITER • Small fixes to DFLIP (for now) • Use l/2 instead of sampling for small drift distances • Double precision at key points • Compiler flags to consistently handle numerical inaccuracy • Smoothing of t-s relations, resolution curves • Evaluate efficacy of changes based on: • parameter resolution, track c2, L/R resolution accuracy, track splitting, machine stability • While monitoring traditional quantities: • efficiency, hit efficiency, purity, CPU time
A final note: Time is critical! August downtime will be only point in near future when large-scale reprocessing possible