Notes on offline data handling

Notes on offline data handling M. Moulson Frascati, 29 March 2006

Data flow in reprocessing • Jobs: 1 per raw file and run (datarec_reproc_ibm.csh) • Prefetch: All raw files for run • Script: start_recall_files.csh, recall_one_dr.tcl • Method: “recall raw (PROD areas)” • List of PROD areas specified explicitly (from DB2 query) • Files recalled one at a time • Jobs started when files on disk • Explore option of recalling all files for run at once? • Input: raw files (1 per job) • Script: runXreproc_ibm.csh • Method: KID URL dbraw: GROUP_ID=PROD • Output: datarec files (1 per stream and job) • Written to /datarec and archived • Advance recall to DSTPROD area: ”recall datarec dstprod”

Data flow in DST production • Jobs: 1 per stream and run • Disk mode (i.e. for output from reprocessing: datarec_dstfd_ibm.csh) • Prefetch: None • Input: All datarec files for stream and run • Script: dstXprocfd_ibm.csh • Method: KID URL dbdatarec GROUP_ID DSTPROD • Tape mode (datarec_dst_ibm.csh) • Prefetch: All datarec files for stream and run • Script: start_recall_files.csh recall_one_dr.tcl • Method: “recall datarec (PROD areas)” • Change to DSTPROD area to avoid inconsistency? • Input: datarec files • Script: dstXprod_ibm.csh • Method: KID URL dbdatarec GROUP_ID DSTPROD • Output: DST files (1 per stream and run) • Written to /datarec and archived

Data flow in MC production (1/2) • Jobs: 1 per run and card type (mcprod.pl) • Processes: • 1 or more GEANFI processes, each followed by a datarec process • 1 DST job per requested DST stream at end • GEANFI output: 1 mco file per GEANFI process • Written to /datarec, not archived • Reconstruction prefetch: All bgg/lsb (datarec) files for run • Method: “recall datarec all” • Currently, prefetch all files at start of each reconstuction job • Instead, prefetch once before first GEANFI process • Reconstruction input: mco file • Method: KID URL “ybos:” (files on /datarec) • Reconstruction input: Subset of bgg/lsb files for run • Method: KID URL “dbdatarec:”

Data flow in MC production (2/2) • Reconstruction output: 1 mcr file per mco file (GEANFI process) • Written to /datarec and archived • Advance recall to DSTPROD area ”recall mc dstprod” • Is this really a good idea? • DSTs start right away from same directory • See notes below • DST input: All mcr files for job, for each requested DST stream (process) • Method: KID URL dbmc DSTPROD • DST output: 1 MC DST file per process • Written to /datarec and archived

Data flow for standalone MC DSTs • Jobs: 1 per run and card type (mcprod_dst.pl) • Processes: 1 per requested DST stream • Prefetch: None • Input: All mcr files for run and card type • Method: KID URL dbmc DSTPROD • Output: 1 MC DST per process • Written to /datarec and archived

Standard offline file types MC files (mcr) are technically datarec streams of type ALL (stream_id = 0) DESCRIPT.STREAM_OFFLINE contains separate DSRV groups for data and MC DSRV groups shown are for data, except for mcr files, for which the MC DSRV group is shown

DSRV groups for background files • All datarec types already have DSRV group dir = USER • All DST types (data and MC) already have DSRV group dir = DST • By default these are recalled to “DST cache” • Exception is background files (bgg, lsb): • Change to DST group? • Leave as USER and recall to PROD? • Leave as USER and recall with kcp?

Summary of recall areas • /datarec currently 470 GB SSA disk • Must add more disks to string: • Access bandwidth: • All MC output to /datarec • 84 MB/s with 300 B80 for MC • 138 MB/s if input to MCDST also from /datarec • Adding disks to string helps parallelism • Size: • Archiver maintains /datarec at 40% full • MC requires <90% full to start • /datarec filled to 50% within 1 hour • Archiving bandwidth: • Saturated with 150 B80 for MC • Must increase by system tuning: • Any amount of /datarec space “immediately” filled if archiving bandwidth insufficient

Transfers to and from /datarec Assumes: 0.5 B80 s to fully produce 1 event, including DSTs 4 DST processes per job, zero overlap in DSTs

Recommended tape space allocation • Allocations include currently occupied space • MC DSTs probably appear as datarec files to archiver • Current library system capacity ~720 GB New cassettes will have to be ordered in future • Temporary allocation based on 720 GB library Assumes MC production slow • Final allocation assumes completion of KLOE offline program

Notes on offline data handling

Notes on offline data handling

Presentation Transcript

Data Handling

Handling Data

Data Handling

DATA HANDLING

Handling Data

Handling Data

Handling Missing Data on ALSPAC

Notes on the data

Data Handling

Handling Data

Data Handling

Data handling

Data Handling

Data Handling

Handling Data

Data handling

Offline Possibilities for Ancillary Data Handling

Data Handling

DATA HANDLING

Data Handling

Handling data

Handling Data