130 likes | 144 Vues
Explore concrete plans for federated data access, transfer, fault tolerance, caching, and brokerage in a federated environment for PanDA. Detailed overview including managed production, distributed analysis, and speculative ideas.
E N D
PanDA in a Federated Environment Kaushik De Univ. of Texas at Arlington CC-IN2P3, Lyon September 13, 2012
Outline • Overview • Concrete plans • Federated data access/stageout for fault tolerance • Federated data transfer for managed production • Federated data access for distributed analysis • Speculative ideas • Data caching • Event caching • Cache aware brokerage Kaushik De
PanDA FAX Status • Last year, I talked about local federations • Direct access through local redirectors are in use by PanDA at SLAC and SouthWest Tier 2 – working well for many years • This year, the emphasis has been on global federations • Global redirectors have been set up and tested in ATLAS • Changes were implemented in the PanDA pilot to enable these global redirectors in the default workflow • But progress has been somewhat slow • PanDA under continuous use in ATLAS • Development activities not related to LHC data have been minimal Kaushik De
FAX for Fault Tolerance • Phase I goal • If input file cannot be transferred/accessed from local SE, PanDA pilot currently fails the job after a few retries • We plan to use Federated storage for these (rare) cases • Start with file staging/transfers using FAX • Implemented in recent release of pilot, works fine at two test sites • Next step – wider scale testing at production/DA sites • Phase 2 • Once file transfers work well, try FAX Direct Access • Phase 3 • Try FAX for transfer of output files, if default destination fails • Next few slides from Tadashi/Paul Kaushik De
FAX for Managed Production • Managed production has well defined workflow • PanDA schedules all input/output file transfers through DQ2 • DQ2 provides dataset level callback when transfers are completed • FAX can provide alternate transport mechanism • Transfers handled by FAX • Dataset level callback provided by FAX • Dataset discovery/registration handled by DQ2 • File level callback • Recent development – use activeMQ for file level callbacks • On best effort basis for scalability – dataset callbacks still used • FAX can use same mechanism • Work in progress Kaushik De
FAX for Distributed Analysis • Most challenging and most rewarding • Currently, DA jobs are brokered to sites which have input datasets • This may limit and slow the execution of DA jobs • Use FAX to relax constraint on locality of data • Use cost metric generated with Hammercloud tests • Provides ‘typical cost’ of data transfer between two sites • Brokerage will use ‘nearby’ sites • Calculate weight based on usual brokerage criteria (availability of CPU…) plus transfer cost • Jobs will be dispatched to site with best weight – not necessarily the site with local data or available CPU’s • Cost metric already available (see Ilija/Rob talks) Kaushik De
Implementation Schedule • FAX for fault tolerance • Phase 1 (FAX transfers) – done, test for few months • Phase 2 (FAX Direct Access) – before year end • Phase 3 (FAX output) – before year end • FAX for central production • Within 6 months • Maybe sooner – activeMQ is already under testing • FAX in brokerage • Cost metric already available • Few months to setup and test in PanDA database • Next year – enable a few sites for high throughput tests Kaushik De
Data Caching • Local data caching for WAN access • Maybe not for PanDA – can federation do it transparently? • Various alternatives were discussed in WAN meeting at CERN • PanDA could keep site level cache • Not guaranteed file catalog – best effort list • Use FAX to fetch again if file if no longer available Kaushik De
Event Cache • Long term PanDA goal – event service • Granularity of data processing in PanDA – datasets and files • But events are really the atomic unit for HEP • PanDA event service will change current processing model • Challenges of event service • Scalability – keeping track of 100’s of billions of events • Fault tolerance – processing all events without data loss • Chaining of data processing • Efficient use of WAN vs storage Kaushik De
Conclusion • Wide array of FAX plans for PanDA • Schedule depends on availability of effort during LHC run • Do not foresee technical challenges for short/medium term • Long term – many open ideas, some quite challenging Kaushik De