1 / 13

PanDA in a Federated Environment

Explore concrete plans for federated data access, transfer, fault tolerance, caching, and brokerage in a federated environment for PanDA. Detailed overview including managed production, distributed analysis, and speculative ideas.

lesterl
Télécharger la présentation

PanDA in a Federated Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PanDA in a Federated Environment Kaushik De Univ. of Texas at Arlington CC-IN2P3, Lyon September 13, 2012

  2. Outline • Overview • Concrete plans • Federated data access/stageout for fault tolerance • Federated data transfer for managed production • Federated data access for distributed analysis • Speculative ideas • Data caching • Event caching • Cache aware brokerage Kaushik De

  3. PanDA FAX Status • Last year, I talked about local federations • Direct access through local redirectors are in use by PanDA at SLAC and SouthWest Tier 2 – working well for many years • This year, the emphasis has been on global federations • Global redirectors have been set up and tested in ATLAS • Changes were implemented in the PanDA pilot to enable these global redirectors in the default workflow • But progress has been somewhat slow • PanDA under continuous use in ATLAS • Development activities not related to LHC data have been minimal Kaushik De

  4. FAX for Fault Tolerance • Phase I goal • If input file cannot be transferred/accessed from local SE, PanDA pilot currently fails the job after a few retries • We plan to use Federated storage for these (rare) cases • Start with file staging/transfers using FAX • Implemented in recent release of pilot, works fine at two test sites • Next step – wider scale testing at production/DA sites • Phase 2 • Once file transfers work well, try FAX Direct Access • Phase 3 • Try FAX for transfer of output files, if default destination fails • Next few slides from Tadashi/Paul Kaushik De

  5. Kaushik De

  6. FAX for Managed Production • Managed production has well defined workflow • PanDA schedules all input/output file transfers through DQ2 • DQ2 provides dataset level callback when transfers are completed • FAX can provide alternate transport mechanism • Transfers handled by FAX • Dataset level callback provided by FAX • Dataset discovery/registration handled by DQ2 • File level callback • Recent development – use activeMQ for file level callbacks • On best effort basis for scalability – dataset callbacks still used • FAX can use same mechanism • Work in progress Kaushik De

  7. FAX for Distributed Analysis • Most challenging and most rewarding • Currently, DA jobs are brokered to sites which have input datasets • This may limit and slow the execution of DA jobs • Use FAX to relax constraint on locality of data • Use cost metric generated with Hammercloud tests • Provides ‘typical cost’ of data transfer between two sites • Brokerage will use ‘nearby’ sites • Calculate weight based on usual brokerage criteria (availability of CPU…) plus transfer cost • Jobs will be dispatched to site with best weight – not necessarily the site with local data or available CPU’s • Cost metric already available (see Ilija/Rob talks) Kaushik De

  8. Implementation Schedule • FAX for fault tolerance • Phase 1 (FAX transfers) – done, test for few months • Phase 2 (FAX Direct Access) – before year end • Phase 3 (FAX output) – before year end • FAX for central production • Within 6 months • Maybe sooner – activeMQ is already under testing • FAX in brokerage • Cost metric already available • Few months to setup and test in PanDA database • Next year – enable a few sites for high throughput tests Kaushik De

  9. Data Caching • Local data caching for WAN access • Maybe not for PanDA – can federation do it transparently? • Various alternatives were discussed in WAN meeting at CERN • PanDA could keep site level cache • Not guaranteed file catalog – best effort list • Use FAX to fetch again if file if no longer available Kaushik De

  10. Event Cache • Long term PanDA goal – event service • Granularity of data processing in PanDA – datasets and files • But events are really the atomic unit for HEP • PanDA event service will change current processing model • Challenges of event service • Scalability – keeping track of 100’s of billions of events • Fault tolerance – processing all events without data loss • Chaining of data processing • Efficient use of WAN vs storage Kaushik De

  11. Kaushik De

  12. Kaushik De

  13. Conclusion • Wide array of FAX plans for PanDA • Schedule depends on availability of effort during LHC run • Do not foresee technical challenges for short/medium term • Long term – many open ideas, some quite challenging Kaushik De

More Related