170 likes | 173 Vues
Distributed Computing A Status Report. Kaushik De University of Texas At Arlington Tier 2/Tier 3 Meeting, SLAC November 28, 2007. Introduction. For this talk: Distributed computing == Production and Distributed Analysis (DA) Question: are we ready for data in ~6 months?
E N D
Distributed ComputingA Status Report Kaushik De University of Texas At Arlington Tier 2/Tier 3 Meeting, SLAC November 28, 2007
Introduction • For this talk: Distributed computing == Production and Distributed Analysis (DA) • Question: are we ready for data in ~6 months? • What are the biggest challenges to Distributed Computing? • Lessons learned from 2 years of continuous production, distributed analysis experience, and data management • In this talk, I will concentrate on path to readiness • No details on future production plans (FDR etc) or computing model: covered in Jim’s talk • Also Rob and Michael’s talk on facilities organization • I will concentrate on open issues, leading to discussion • Discuss process and functional requirements Kaushik De
Some High Level Issues • Can we handle MC production and data flow / data processing from ATLAS simultaneously? • Past experience from SC and M* raised many issues – work needed • How much do we have to scale up? • Expect number of users to increase factor of ~5? • Tier 2 resources will rise by factor of ~4? • Software releases/patches by factor of 10? Validation? • Missing functionalities? • Integrating Tier 3’s into computing model? • User analysis at Tier 2’s? Interactive analysis - Proof? • Recall Jim’s point – in the U.S. Tier boundaries are not rigid • BNL Tier 1 is also Tier 2 (MC prod, DA), and Tier 3 (Proof) • Tier 2’s do reprocessing (T1 task) and provide Tier 3 functionalities Kaushik De
Site Hierarchy for Production Kaushik De
Capacity Projections Kaushik De
Production Facilities Issues • All Tier 2’s are running in steady production mode • But Tier 2’s need to scale up by factor of four in ~6 months • Tier 1 needs to scale up by a factor of three • Support user analysis at T2’s • Queues need to be setup – no road blocks anticipated • Need AOD replication - urgently • Interactive analysis – beyond Proof of concept! • Tier 3 contributions to production • Issues emerging with data transfer model (uberftp) • Working on solutions (pending SRM v2.2) • Tier 3 data distribution for end user analysis • We should not forget networking Kaushik De
Production Software • PanDA – now mature product with stable team • But software is changing daily now to support non-U.S. sites: development team is stretched to limit • Need to rapidly expand/integrate production/support team • New architecture supports multiple Panda servers • Pathena – working very well • Users love it – had to increase CPU’s by factor of three recently • User support is becoming urgent issue – need to form new team • So far, shift team and developers provide support – will not scale • Challenge – scaling up from one to ten clouds • Scaling from one to two (adding Canada) was easy • Scaling from two to four (adding UK, France) going very slowly • Took 2 years to achieve smooth U.S. operations, ~6 months for rest Kaushik De
Data Production/Processing • ATLAS managed production (MC, reprocessing) • Historically, U.S. has contributed ~25% of MC production • Tier 1 and Tier 2’s provide dedicated queues and storage for this • Physics groups directly manage task requests (we will have quotas/allocations per group arbitrated by RAC) • Detector, calibration, particle ID, test beam, commissioning… groups… will also have allocations • Regional U.S. production • Same as ATLAS managed production – physics groups define tasks needed by U.S. physicists with special group name (ex. ushiggs) • Panda manages quota (currently 20-25% for U.S. production) • So far, U.S. physicists have been slow in taking advantage of this (less than 25% of the quota allocated by RAC is being used) Kaushik De
Panda Production Statistics CSC= Computing System Commissioning Kaushik De 9
Panda Central Production Since 1/1/06 Since 10/1/07 Kaushik De
Data Location Model • Tier 1 – main repository of data (MC & Primary) • Store complete set of ESD, AOD, AANtuple & TAG’s on disk • Fraction of RAW and all U.S. generated RDO data • Tier 2 – repository of analysis data • Store complete set of AOD, AANtuple & TAG’s on disk • Complete set of ESD data divided among 5 Tier 2’s • Data distribution to Tier 1 & Tier 2’s is managed • Tier 3 – unmanaged data matching local interest • Data through locally initiated subscriptions • Mostly AANtuple’s, some AOD’s • Tier 3’s will be associated with Tier 2 sites? • Tier 3 model is still not fully developed – evolving Kaushik De
Storage Management • Tier 1 storage systems • Disk storage projected to grow by factor of three in ~6 months • Additional funding also expected from management reserve • During past few months many new issues have emerged • Disk/tape dcache pools – the default at BNL • Allows unlimited space – write pools automatically push data to tape • Does not work well for small files (log files), or volatile user output • Solution: new disk only pool was set up recently • Remaining issue: need tools to manage space (no longer infinite) • Does not work well for computing model – AOD, RDO, Evgen, DPD etc need to be on disk (large fraction of these got pushed to tape) • Solution: software to manage ‘pinning’ will be rolled out soon • Have not tackled Tier 2 storage issues yet! Kaushik De
Data Management • DQ2 is on critical path • Many performance issues have been identified through operations • Central server load issues – still problem after Oracle migration • Fetcher performance issues – incomplete datasets, QoS • Essential features needed soon: hierarchical (container) datasets, lost file flag, tape handling… • We expect rapid improvements via new ADC organization • Expect higher priority for production and DA needs – since Panda was chosen for ATLAS wide use • Panda not using DQ2 for input file transfers – PandaMover • Need to integrate PandaMover with DQ2 • Test and implement LFC in the U.S. • Support problems with LRC used in U.S. – diverging from DQ2 Kaushik De
Distributed Analysis Challenges • DA usage rising rapidly • Works very well – except data availability issues • But only available at BNL • We increased CPU allocation from ~200 to ~700 recently • Still not sufficient – ex. ~30k jobs waiting to run right now! • We need to bring Tier 2’s rapidly into DA activities • Show stopper: availability of AOD files • Also need: dedicated analysis queues, moving along well • Interactive analysis • Primarily expected at Tier 3’s • BNL and Wisconsin tests with Proof encouraging • Need to scale up and deploy rapidly • Many issues to understand: scaling, multi-user, data movement Kaushik De
Process and Requirements • Well organized in U.S.: both development and operations • Tier 1 and Tier 2 requirements well understood • Functional requirements evolving thorough production operations and facilities integration program • Need to adapt quickly as we scale up rapidly • Some reorganization of operations will be needed • Need user support team • Often, issues are beyond U.S. control – software, trf, DQ2 etc: need help from new ADC organization Kaushik De
Summary • Distributed computing working well in the U.S. • But many challenges still to overcome in short time • Expanding Panda ATLAS-wide is big task – but will help ATLAS in the long run • DDM and storage issues on critical path • Tier 2’s need to expand roles beyond MC production • Will everything be ready in ~6 months: still open question • Always, new people and new ideas are welcome Kaushik De
Production – Live! http://panda.atlascomp.org/?dash=prod Kaushik De