1 / 24

ATLAS DC2 Production …on Grid3

ATLAS DC2 Production …on Grid3. M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04. ATLAS Data Challenges. Purpose Validate the LHC computing model Develop distributed production & analysis tools Provide large datasets for physics working groups

emilie
Télécharger la présentation

ATLAS DC2 Production …on Grid3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04

  2. ATLAS Data Challenges • Purpose • Validate the LHC computing model • Develop distributed production & analysis tools • Provide large datasets for physics working groups • Schedule • DC1 (2002-2003): full software chain • DC2 (2004): automatic grid production system • DC3 (2006): drive final deployments for startup

  3. ATLAS DC2 Production • Phase I: Simulation (Jul-Sep 04) • generation, simulation & pileup • produced datasets stored on Tier1 centers, then CERN (Tier0) • scale: ~10M events, 30 TB • Phase II: “Tier0 Test” @CERN (1/10 scale) • Produce ESD, AOD (reconstruction) • Stream to Tier1 centers • Phase III: Distributed analysis (Oct-Dec 04) • access to event and non-event data from anywhere in the world both in organized and chaotic ways cf. D. Adams, #115

  4. ATLAS Production System • Components • Production database • ATLAS job definition and status • Supervisor (all Grids): Windmill(L. Goossens, #501) • Job distribution and verification system • Data Management: Don Quijote(M. Branco #142) • Provides ATLAS layer above Grid replica systems • Grid Executors • LCG: Lexor(D. Rebatto #364) • NorduGrid: Dulcinea(O. Smirnova #499) • Grid3: Capone(this talk)

  5. ATLAS Global Architecture Don Quijote “DQ” prodDB (CERN) data management AMI (Metadata) Windmill super super super super soap jabber jabber soap LCG exe NG exe G3 exe Legacy exe Capone Dulcinea Lexor RLS RLS RLS Grid3 LCG Nordu Grid LSF this talk

  6. Capone and Grid3 Requirements • Interface to Grid3 (GriPhyN VDT based) • Manage all steps in the job life cycle • prepare, submit, monitor, output & register • Manage workload and data placement • Process messages from Windmill Supervisor • Provide useful logging information to user • Communicate executor and job state information to Windmill (ProdDB)

  7. Capone Execution Environment • GCE Server side • ATLAS releases and transformations • Pacman installation, dynamically by grid-based jobs • Execution sandbox • Chimera kickstart executable • Transformation wrapper scripts • MDS info providers (required site-specific attributes) • GCE Client side (web service) • Capone • Chimera/Pegasus, Condor-G (from VDT) • Globus RLS and DQ clients “GCE” = Grid Component Environment

  8. Message protocols Web Service Jabber Translation Windmill ADA CPE Grid Stub DonQuijote Capone Architecture • Message interface • Web Service • Jabber • Translation layer • Windmill schema • CPE (Process Engine) • Processes • Grid3: GCE interface • Stub: local shell testing • DonQuijote (future)

  9. Capone System Elements • GriPhyN Virtual Data System (VDS) • Transformation • A workflow accepting input data (datasets), parameters and producing output data (datasets) • Simple (executable)/Complex (DAG) • Derivation • Transformation where the parameters have been bound to actual parameters • Directed Acyclic Graph (DAG) • Abstract DAG (DAX) created by Chimera, with no reference to concrete elements in the Grid • Concrete DAG (cDAG) created by Pegasus, where CE, SE and PFN have been assigned • Globus, RLS, Condor

  10. Monitoring RLS MDS GridCat SE DonQuijote MonALISA Capone ProdDB CE Windmill gsiftp WN gatekeeper Chimera VDC Condor-G schedd Pegasus GridMgr Capone Grid Interactions

  11. RLS Monitoring Monitoring MDS GridCat MDS GridCat SE DonQuijote RLS MonALISA MonALISA SE Capone ProdDB DonQuijote CE Windmill gsiftp WN gatekeeper Chimera Capone ProdDB VDC CE Condor-G Windmill gsiftp schedd WN Pegasus GridMgr gatekeeper Chimera VDC Condor-G schedd Pegasus GridMgr A job in Capone (1, submission) • Reception • Job received from Windmill • Translation • Un-marshalling, ATLAS transformation • DAX generation • Chimera generates abstract DAG • Input file retrieval from RLS catalog • Check RLS for input LFNs (retrieval of GUID, PFN) • Scheduling: CE and SE are chosen • Concrete DAG generation and submission • Pegasus creates Condor submit files • DAGMan invoked to manage remote steps

  12. Monitoring MDS GridCat RLS MonALISA SE DonQuijote Capone ProdDB CE Windmill gsiftp WN gatekeeper Chimera VDC Condor-G schedd Pegasus GridMgr A job in Capone (2, execution) • Remote job running / status checking • Stage-in of input files, create POOL FileCatalog • Athena (ATLAS code) execution • Remote Execution Check • Verification of output files and exit codes • Recovery of metadata (GUID, MD5sum, exe attributes) • Stage Out: transfer from CE site to destination SE • Output registration • Registration of the output LFN/PFN and metadata in RLS • Finish • Job completed successfully, communicates to Windmill that jobs is ready for validation • Job status is sent to Windmill during all the execution • Windmill/DQ validate & register output in ProdDB

  13. Performance Summary (9/20/04) • Several physics and calibration samples produced • 56K job attempts at Windmill level • 9K of these aborted before grid submission: • mostly RLS down or selected CE down • “Full” success rate: 66% • Average success after submitted: 70% • Includes subsequent problems at submit host • Includes errors from development • 60 CPU-years consumed since July • 8 TB produced

  14. ATLAS DC2 CPU usage G. Poulard, 9/21/04 Total ATLAS DC2 ~ 1470 kSI2k.months ~ 100000 jobs ~ 7.94 million events ~ 30 TB

  15. CPU-day Mid July Sep 10 Ramp up ATLAS DC2

  16. Job Distribution on Grid3 J. Shank, 9/21/04

  17. Site Statistics (9/20/04) Average success rate by site: 70%

  18. Capone & Grid3 Failure Statistics 9/20/04 • Total jobs (validated) 37713 • Jobs failed 19303 • Submission 472 • Execution 392 • Post-job check 1147 • Stage out 8037 • RLS registration 989 • Capone host interruptions 2725 • Capone succeed, Windmill fail 57 • Other 5139

  19. Production lessons • Single points of failure • Production database • RLS, DQ, VDC and Jabber servers • One local network domain • Distributed RLS • System expertise (people) • Fragmented production software • Fragmented operations (defining/fixing jobs in the production database) • Client (Capone submit) hosts • Load and memory requirements for job management • Load caused by job state checking (interaction with Condor-G) • Many processes • No client host persistency • Need local database for job recovery  next phase of development • DOEGrids certificate or certificate revocation list expiration

  20. Production lessons (II) • Site infrastructure problems • Hardware problems • Software distribution, transformation upgrades • File systems (NFS major culprit); various solutions by site administrators • Errors in stage-out caused by poor network connections and gatekeeper load. • Fixed by adding I/O throttling, checking number of TCP connections • Lack of storage management (eg SRM) on sites means submitters do some cleanup remotely. Not a major problem so far, but we’ve not had much competition • Load on gatekeepers • Improved by moving md5sum off gatekeeper • Post job processing • Remote execution (mostly in pre/post job) error prone • Reason of the failure difficult to understand • No automated tools for validation

  21. Operations Lessons • Grid3 iGOC and US Tier1 developed operations response model • Tier1 center • core services • “on-call” person available always • response protocol developed • iGOC • Coordinates problem resolution for Tier1 “off hours” • Trouble handling for non-ATLAS Grid3 sites. Problems resolved at weekly iVDGL operations meetings • Shift schedule (8-midnight since July 23) • 7 trained DC2 submitters • Keeps queues saturated, reports sites and system problems, cleans working directories • Extensive use of email lists • Partial use of alternatives like Web portals, IM

  22. Conclusions • Completely new system • Grid3 simplicity requires more functionality and state management on the executor submit host • All functions of job planning, job state tracking, and data management (stage-in, out) managed by Capone rather than grid systems • clients exposed to all manner of grid failures • good for experience, but a client-heavy system • Major areas for upgrade to Capone system • Job state management and controls, state persistency • Generic transformation handling for user-level production

  23. Authors • GIERALTOWSKI, Gerald (Argonne National Laboratory)MAY, Edward (Argonne National Laboratory)VANIACHINE, Alexandre (Argonne National Laboratory) SHANK, Jim (Boston University)YOUSSEF, Saul (Boston University)BAKER, Richard (Brookhaven National Laboratory)DENG, Wensheng (Brookhaven National Laboratory)NEVSKI, Pavel (Brookhaven National Laboratory)MAMBELLI, Marco (University of Chicago)GARDNER, Robert (University of Chicago)SMIRNOV, Yuri (University of Chicago)ZHAO, Xin (University of Chicago)LUEHRING, Frederick (Indiana University)SEVERINI, Horst (Oklahoma University)DE, Kaushik (University of Texas at Arlington)MCGUIGAN, Patrick (University of Texas at Arlington) OZTURK, Nurcan (University of Texas at Arlington)SOSEBEE, Mark (University of Texas at Arlington)

  24. Acknowledgements • Windmill team (Kaushik De) • Don Quijote team (Miguel Branco) • ATLAS production group, Luc Goossens, CERN IT (prodDB) • ATLAS software distribution team (Alessandro de Salvo, Fred Luehring) • US ATLAS testbed sites and Grid3 site administrators • iGOC operations group • ATLAS Database group (ProdDB Capone-view displays) • Physics Validation group: UC Berkeley, Brookhaven Lab • More info • US ATLAS Grid http://www.usatlas.bnl.gov/computing/grid/ • DC2 shift procedures http://grid.uchicago.edu/dc2shift • US ATLAS Grid Tools & Services http://grid.uchicago.edu/gts/

More Related