Distributed Computing A Status Report

Distributed ComputingA Status Report Kaushik De University of Texas At Arlington Tier 2/Tier 3 Meeting, SLAC November 28, 2007

Introduction • For this talk: Distributed computing == Production and Distributed Analysis (DA) • Question: are we ready for data in ~6 months? • What are the biggest challenges to Distributed Computing? • Lessons learned from 2 years of continuous production, distributed analysis experience, and data management • In this talk, I will concentrate on path to readiness • No details on future production plans (FDR etc) or computing model: covered in Jim’s talk • Also Rob and Michael’s talk on facilities organization • I will concentrate on open issues, leading to discussion • Discuss process and functional requirements Kaushik De

Some High Level Issues • Can we handle MC production and data flow / data processing from ATLAS simultaneously? • Past experience from SC and M* raised many issues – work needed • How much do we have to scale up? • Expect number of users to increase factor of ~5? • Tier 2 resources will rise by factor of ~4? • Software releases/patches by factor of 10? Validation? • Missing functionalities? • Integrating Tier 3’s into computing model? • User analysis at Tier 2’s? Interactive analysis - Proof? • Recall Jim’s point – in the U.S. Tier boundaries are not rigid • BNL Tier 1 is also Tier 2 (MC prod, DA), and Tier 3 (Proof) • Tier 2’s do reprocessing (T1 task) and provide Tier 3 functionalities Kaushik De

Site Hierarchy for Production Kaushik De

Capacity Projections Kaushik De

Production Facilities Issues • All Tier 2’s are running in steady production mode • But Tier 2’s need to scale up by factor of four in ~6 months • Tier 1 needs to scale up by a factor of three • Support user analysis at T2’s • Queues need to be setup – no road blocks anticipated • Need AOD replication - urgently • Interactive analysis – beyond Proof of concept! • Tier 3 contributions to production • Issues emerging with data transfer model (uberftp) • Working on solutions (pending SRM v2.2) • Tier 3 data distribution for end user analysis • We should not forget networking Kaushik De

Production Software • PanDA – now mature product with stable team • But software is changing daily now to support non-U.S. sites: development team is stretched to limit • Need to rapidly expand/integrate production/support team • New architecture supports multiple Panda servers • Pathena – working very well • Users love it – had to increase CPU’s by factor of three recently • User support is becoming urgent issue – need to form new team • So far, shift team and developers provide support – will not scale • Challenge – scaling up from one to ten clouds • Scaling from one to two (adding Canada) was easy • Scaling from two to four (adding UK, France) going very slowly • Took 2 years to achieve smooth U.S. operations, ~6 months for rest Kaushik De

Data Production/Processing • ATLAS managed production (MC, reprocessing) • Historically, U.S. has contributed ~25% of MC production • Tier 1 and Tier 2’s provide dedicated queues and storage for this • Physics groups directly manage task requests (we will have quotas/allocations per group arbitrated by RAC) • Detector, calibration, particle ID, test beam, commissioning… groups… will also have allocations • Regional U.S. production • Same as ATLAS managed production – physics groups define tasks needed by U.S. physicists with special group name (ex. ushiggs) • Panda manages quota (currently 20-25% for U.S. production) • So far, U.S. physicists have been slow in taking advantage of this (less than 25% of the quota allocated by RAC is being used) Kaushik De

Panda Production Statistics CSC= Computing System Commissioning Kaushik De 9

Panda Central Production Since 1/1/06 Since 10/1/07 Kaushik De

Data Location Model • Tier 1 – main repository of data (MC & Primary) • Store complete set of ESD, AOD, AANtuple & TAG’s on disk • Fraction of RAW and all U.S. generated RDO data • Tier 2 – repository of analysis data • Store complete set of AOD, AANtuple & TAG’s on disk • Complete set of ESD data divided among 5 Tier 2’s • Data distribution to Tier 1 & Tier 2’s is managed • Tier 3 – unmanaged data matching local interest • Data through locally initiated subscriptions • Mostly AANtuple’s, some AOD’s • Tier 3’s will be associated with Tier 2 sites? • Tier 3 model is still not fully developed – evolving Kaushik De

Storage Management • Tier 1 storage systems • Disk storage projected to grow by factor of three in ~6 months • Additional funding also expected from management reserve • During past few months many new issues have emerged • Disk/tape dcache pools – the default at BNL • Allows unlimited space – write pools automatically push data to tape • Does not work well for small files (log files), or volatile user output • Solution: new disk only pool was set up recently • Remaining issue: need tools to manage space (no longer infinite) • Does not work well for computing model – AOD, RDO, Evgen, DPD etc need to be on disk (large fraction of these got pushed to tape) • Solution: software to manage ‘pinning’ will be rolled out soon • Have not tackled Tier 2 storage issues yet! Kaushik De

Data Management • DQ2 is on critical path • Many performance issues have been identified through operations • Central server load issues – still problem after Oracle migration • Fetcher performance issues – incomplete datasets, QoS • Essential features needed soon: hierarchical (container) datasets, lost file flag, tape handling… • We expect rapid improvements via new ADC organization • Expect higher priority for production and DA needs – since Panda was chosen for ATLAS wide use • Panda not using DQ2 for input file transfers – PandaMover • Need to integrate PandaMover with DQ2 • Test and implement LFC in the U.S. • Support problems with LRC used in U.S. – diverging from DQ2 Kaushik De

Distributed Analysis Challenges • DA usage rising rapidly • Works very well – except data availability issues • But only available at BNL • We increased CPU allocation from ~200 to ~700 recently • Still not sufficient – ex. ~30k jobs waiting to run right now! • We need to bring Tier 2’s rapidly into DA activities • Show stopper: availability of AOD files • Also need: dedicated analysis queues, moving along well • Interactive analysis • Primarily expected at Tier 3’s • BNL and Wisconsin tests with Proof encouraging • Need to scale up and deploy rapidly • Many issues to understand: scaling, multi-user, data movement Kaushik De

Process and Requirements • Well organized in U.S.: both development and operations • Tier 1 and Tier 2 requirements well understood • Functional requirements evolving thorough production operations and facilities integration program • Need to adapt quickly as we scale up rapidly • Some reorganization of operations will be needed • Need user support team • Often, issues are beyond U.S. control – software, trf, DQ2 etc: need help from new ADC organization Kaushik De

Summary • Distributed computing working well in the U.S. • But many challenges still to overcome in short time • Expanding Panda ATLAS-wide is big task – but will help ATLAS in the long run • DDM and storage issues on critical path • Tier 2’s need to expand roles beyond MC production • Will everything be ready in ~6 months: still open question • Always, new people and new ideas are welcome Kaushik De

Production – Live! http://panda.atlascomp.org/?dash=prod Kaushik De

Distributed Computing A Status Report

Distributed Computing A Status Report

Presentation Transcript

Distributed computing

Distributed Computing

DISTRIBUTED COMPUTING

Cloud Computing Services Status Report

Distributed Computing

Distributed Computing

Distributed Computing

DISTRIBUTED COMPUTING

Distributed Computing

DISTRIBUTED COMPUTING

Distributed Computing

Distributed Computing

DISTRIBUTED COMPUTING

DISTRIBUTED COMPUTING

Distributed Computing

Distributed Computing

Distributed Computing

Distributed computing

DISTRIBUTED COMPUTING