Future Data Management Strategies for Efficient Job Brokering

Data Managementafter LS1

Brief overview of current DM • Replica catalog: LFC • LFN -> list of SEs • SEs are defined in the DIRAC Configuration System • For each protocol: end-point, SAPath, [space token, WSUrl] • Currently only used: SRM and rfio • File placement according to Computing Model • FTS transfers from original SE (asynchronous) • Disk replicas and archives completely split: • Only T0D1 and T1D0, no T1D1 SE any longer • Production jobs: • Input file download to WN (max 10 GB) using gsiftp • User jobs: • Protocol access from SE (on LAN) • Output upload: • From WN to (local) SE (gsiftp). Upload policy defined in the job • Job splitting and brokering: • According to LFC information • If file is unavailable, the job is rescheduled PhC

Caveats with current system • Inconsistencies between FC, SE catalog and actual storage • Some files are temporarily unavailable (server down) • Some files are lost (unrecoverable disk, tape) • Consequences: • Wrong brokering of jobs: cannot access files • Except for download policy if another replica is on disk/cache • SE overload • Busy, or not enough movers • As if files are unavailable • Jobs are rescheduled PhC

Future of replica catalog • We probably still need one • Job brokering: • Don’t want to transfer files all over the place (even with caches) • DM accounting: • Want to know what/how much data is where • But… • Should not need to be highly accurate as now • Allow files to be unavailable without the job failing • Considering the DIRAC File Catalog • Mostly replica location (as used in LFC) • Built-in space usage accounting per directory and SE PhC

Access and transfer protocols • Welcome gfal2 and FTS3! • Hopefully transparent protocol usage for transfers • However transfer requests should be expressed with compatible URLs • Access to T1D0 data • 99% for reconstruction or re-stripping, i.e. download • Read once, therefore still require a sizeable staging pool • Unnecessary to copy to T0D1 before copying to WN • xrootvs http/webdav • No strong feelings • What is important is unique URL, redirection and WAN access • However why not use (almost) standard protocols • CVMFS experience is very positive, why not http for data? • Of course better if all SEs provide the same protocol • http/webdav for EOS and Castor? • We are willing to look at the http ecosystem PhC

Other DM functionality • File staging from tape • Currently provided by SRM • Keep SRM for T1D0 handling • Limited usage for bringOnline • Not used for getting tURL • Space tokens • Can easily be replaced by different endpoints • Preferred to using namespace! • Storage usage • Also provided by SRM • Is there a replacement? PhC

Next steps • Re-implement DIRAC DM functionality with gfal2 • Exploit new features of FTS3 • Migrate to DIRAC File Catalog • In parallel with LFC • Investigate http/webdav for file location and access • First, use it for healing • Still brokering using a replica catalog • Usage for job brokering (replacing replica catalog)? • Scalability? PhC

What else? • Dynamic data caching • Not clear yet how to best use this without replicating everything everywhere • When do caches expire? • Job brokering? • Don’t want to hold jobs while a dataset is replicated • Data popularity • Information collection in place • Can it be used for automatic replication/deletion? • Or better as a hint for Data managers? • What is the metrics to be used? • What if 10 files out of a 100TB dataset are used for tests, but none is interested in the rest? • Fraction of dataset used or absolute number of accesses? • Very few analysis passes on full dataset • Many iterative usage of same subset PhC

Future Data Management Strategies for Efficient Job Brokering

Future Data Management Strategies for Efficient Job Brokering

Presentation Transcript

Looked After Children Data

Going after the data

Control of the PS 10 MHz system after LS1

OP during LS1

LS1 DAY

Muon LS1 Status

LS1 planning meeting

Muon LS1 Status

The restart of the NA after LS1

After LS1: wish list from LHCf

Collimation after LS1: cleaning and β* reach

LS1 planning meeting

LS1 CPS Planning

SSD LS1 activities

OP@BI Day 2012 After LS1

After Lunch… Data Crunch

LS1 plan

LS1 Project

Interventions for LS1

Timing upgrades after LS1

Getting Data after EPIC

Automation Strategies for LHC System Tests and Re-Commissioning after LS1