100 likes | 231 Vues
A detailed report highlights issues with offline data chunks from July 7, 2012, involving LHC11b10a MC simulations. Notably, multiple sub-jobs lack global tracks, causing a 3.5% efficiency drop and incorrect normalization factors. The problem arises during simulation when the Trigger.root file fails to create, resulting in missing ITS tracks. Validation scripts did not catch the erroneous jobs, leading to silent crashes in AliRoot. Proposed code fixes and methods for re-validation aim to address the corrupted datasets and ensure more robust job validation in future productions.
E N D
Corrupted MC data chunks Offline weekly July 7, 2012
The issue • As reported by PWG-LF, numerous sub-jobs from LHC11b10a MC have no global tracks (back-propagated ITS tracks) • Matching efficiency drop and incorrect normalization factors • In the above production, the effect is 3.5%(+1.2%) • Full report in Savannah • The effect is only in MC
Forensics • If a file (Trigger.root) is not created during the simulation phase the string of detectors in the trigger cluster are left empty and all ITS layers are skipped (no ITS tracks) • The error generates only a warning in the reconstruction • W-AliReconstruction::GetEventInfo: No trigger can be loaded! The trigger information will not be used! • The conditions for this are always in the late part of the simulation, usually, but not always, during digitisation
Forensics (2) • Two ‘events’ have been discovered so far • AliRoot aborts during a failed access to OCDB (biggest contriibutor) • Silent crash, no specific error • The AliRoot abort generates ‘Abort’ signal, which should have been printed in sim.log (redirect from standard error stream) • However in some of the cases it does not appear… • … and subsequently is not caught by the job validation script • The silent crash is not caught by any of the ‘per job’ validations
Forensics (3) • The defective jobs are not caught by • validation script – parses only *.log, not stderr/stdout • Per job CheckESD macro, successful also in the ‘corrupted’ case • The per run QA – there is a ‘hint’, but it is dissolved as the error is on ~4% level • …In addition, the mean vertex cut eliminates the events
Re-validation of the productions • Fast and indirect method – size of the sim.log LHC11b10a Good production Bad chunks, 4.9%
Re-validation of the productions (2) • Other cases and Pb+Pb LHC11b10c – not straightforward PbPb, OK period
‘Suspicious’ cycles • Tested all 2010 (149 cycles), 2011 (104 cycles), 2012 (62 cycles)
Past productions remedy • From the above table, scan rec.log for • ‘W-AliReconstruction::GetEventInfo: No trigger…’ • to positively identify affected chunks • Ongoing… • Rename the ESDs and AODs in the catalogue to ‘something else’, which will not show up in the standard analysis searches • Mild danger for analysis, which uses ‘prepared’ collections – jobs will fail… • Merged AOD (deltas) will have to be re-merged • For Pb+Pb, a cut on ‘zero ITS tracks’ will eliminate the bad chunks
Code fixes • job validation – scan all files (implemented) • per job ‘checkESD’ macro – strengthen the script, positive feedback to validate the job • QA – to be discussed • reconstruction logic – abort in case the Trigger.root file is not found • Follow-up by Offline, discussion in the weekly meetings