1 / 20

Data Challenge DC04 Underway (Mar 1 - Apr 30)

Data Challenge DC04 Underway (Mar 1 - Apr 30). 70M MC events (20M with G4) produced in pre-challenge Classic production Centers and LCG and US/GRID3 heavily used Challenge: (Not a “CPU” challenge, but a full-chain demonstration)

gzifa
Télécharger la présentation

Data Challenge DC04 Underway (Mar 1 - Apr 30)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Challenge DC04 Underway (Mar 1 - Apr 30) • 70M MC events (20M with G4) produced in pre-challenge • Classic production Centers and LCG and US/GRID3 heavily used • Challenge: • (Not a “CPU” challenge, but a full-chain demonstration) • Reaching a sustained 25Hz reconstruction rate in the Tier-0 farm. (25% of the target conditions for LHC startup) • (This, however, is a lot of CPU, ~500) • Use of CMS and LCG software to record the DST, catalog the data and Meta-Data • Distribution of the reconstructed data to six Tier-1 centers using available GRID and other tools • Close to real-time reprocessing of that data at some of the Tier-1 centers • Production of new data-sets at the T1 with their subsequent distribution to Tier-2 centers for analysis purposes • Monitoring and archiving of performance criteria of the ensemble of activities for debugging and post-mortem analysis Detailed current status on DC04 page: http://www.uscms.org/s&c/dc04/ And in yesterdays talks: http://agenda.cern.ch/fullAgenda.php?ida=a04921#s1 DPS Mar 04, CMS Week

  2. Keywords • Automatic processes • Failover systems • Automatic recovery • Transaction security • Detailed Monitoring • Use of Tools written/maintained by other organizations! • Detailed logging of status and decisions for post-mortem DPS Mar 04, CMS Week

  3. Status Today • Reconstruction, DST code getting into good state quickly • Will be run-able at 25Hz • Currently at 7-10Hz • T0->T1 database and distribution systems will be able to run at 25hz • Standard set of problems being tackled. This weekends it was disk-servers… • RLS currently a bottleneck. • Performance and design questions • Working with all teams to see if our usage can be modified • Workarounds are possible (“its only software…”), but only some are desirable • Bringing up T1 “local” analysis now • This ensures that the infrastructure to use data at the t1 exists • LCG based analysis • Will use this infrastructure DPS Mar 04, CMS Week

  4. reference to push or create read Fake on-line operations RefDB Input Buffer • Pre-condition • empty Digi and Hits COBRA metadata available • RefDB has POOL metadata for Digis • Post-conditions • input buffer filled with Digi files and consistent COBRA metadata • POOL catalogue correctly filled • entry in RefDB specifies new job to be run • entry in Transfer Management DB for digi files POOL RLS catalogue Digi+Hits COBRA metadata TMDB 4. get POOL fragment 5. register PFN & metadata Digi files 6. insert new “request” 7. insert 3. attachRun 1. get dataset file names 25Hz fake on-line process Dataset priority list (PRS) 2. stage Castor DPS Mar 04, CMS Week

  5. Tier-0 job preparation operations ORCA RECO script & .orcarc Input Buffer XML catalogue • Pre-condition • Empty Reco COBRA metadata file is available and registered in POOL • Post-conditions • XML catalogue to be used by the job is ready • execution script and accessory files are ready • job is submitted to LSF Digi+Hits COBRA metadata 3. McRunJob create 2b. POOL publish Digi files Job preparation agent LSF 4. McRunJob run General Dist. Buffer 2a. POOL cat. read Empty Reco COBRA metadata 1. discover RefDB POOL RLS catalogue DPS Mar 04, CMS Week

  6. Tier-0 reconstruction Input Buffer ORCA RECO script & .orcarc Original XML catalogue • Post-conditions • Reco files are on the General Distribution Buffer and on tape • POOL catalogue correctly updated with Reco files • Reco file entries are inserted in the Transfer Management Database Digi COBRA metadata POOL RLS catalogue 2. read catalogue Digi files 8. register files & metadata 1. execute LSF 3. rfcp (download) ORCA RECO Job General Dist. Buffer XML Publ. Agent Empty Reco COBRA metadata 5. rfcp (upload) 6. write catalogue & upload 9. insert Reco files 4. update XML with local copy of Reco COBRA metadata 7. Discover XML catalog Castor TMDB Updated XML catalogue DPS Mar 04, CMS Week

  7. SE SRB Vault dCache SRM SRB RM Data distribution @ Tier-0 2. get metadata POOL RLS catalogue Configuration agent • Pre-conditions: • Digi and reco files are registered in the Transfer Management DB • Post-conditions: • Input and general distribution buffer are cleared of any files already at Tier-1’s • All data files assigned (copied) to Tier-1 as decided by Configuration agent logic • Transfer Management DB and POOL RLS cat. kept up-to-date with file locations 3. Assign file to Tier-1 Tier-1 8. discover 1. new file discovery 11. purge 9. update Transfer Manag. DB Castor 10. check 6. add PFN 4. discover 7. update Input Buffer General Dist. Buffer Digi files RM/SRM/SRB Tier-0 agent 5a. copy (read) 5b. copy (write) Reco files DPS Mar 04, CMS Week

  8. Tier-1 RM data import/export TMDB local POOL catalogue • Pre-conditions: • POOL RLS catalogue is either the CERN one or a local mirror • Transfer Manag. DB at CERN is accessible from Tier-1’s • Post-conditions: • data copied at Tier-1 on MSS and available to Tier-2 • CERN POOL RLS catalogue and local POOL catalogue updated • Transfer Management DB updated 6.FCpublish (if not an RLS mirror) 1. discover Tier-1 agent 5. update MSS 5.update 3. replicate 2.lookup 4ac.lookup & update RM POOL RLS catalogue 4b. copy SRM 4ac.lookup & update GMCAT 4d.add SFN SRB 7. discover 8. Sget LCG SE 9. gridftp SRB2LCG agent 10.add SFN DPS Mar 04, CMS Week

  9. Tier-1 analysis job preparation ORCA RECO script & .orcarc Loal storage or SE XML catalogue • Pre-conditions: • a local POOL catalogue is populated with at least the local files (may be an RLS) • the list of files of a given run is provided by the global POOL catalogue (i.e. RLS) • Post-conditions: • XML catalogue to be used by the job is ready • execution script and accessory files are ready • job is submitted to a local or grid resource manager EmptyReco COBRA metadata 3. McRunJob create 2b. POOL publish Reco files Job preparation agent Local or Grid Resource Manager 4. McRunJob run 2a. POOL cat. read Global (RLS) POOL catalogue Discover Local POOL catalogue DPS Mar 04, CMS Week

  10. Tier-1 analysis ORCA RECO script & .orcarc Local storage or SE Original XML catalogue • Post-conditions • Root or ntuple files are on the local storage or on a storage element • RLS updated if on the grid Note: if the Tier-1 uses SRB the local storage may be an SRB vault and the RLS catalogue is replaced by the GMCAT Empty Reco COBRA metadata 2. read catalogue Reco files ORCA Analysis Job 3. file download Resource Manager 1. execute 5. attachRun on the local copy of the COBRA metadata 6a. file upload 6.b register new files (if on grid) root or ntuple files RLS catalogue 4. update XML with local copy files (only if downloaded) DPS Mar 04, CMS Week

  11. ORCA Software • DST with ~50 physics reconstructors • Tracks, vertices, jets (4 algo), MET, muons, b-jets, Electrons, Photons, HLT results, …. • Streaming operational • Streams can be selections of events and have contents that depend on the stream • ECAL calibration stream, special information just used in calibration • DT calibration stream likewise • L1 trigger streams (HLT streams possible but not actually done yet) • 20sec full reconstruction time • Just fast enough to run at 25Hz on 500 CPUS! • Working on ORCA_8_0_x release. • 8_0_0 released Tuesday, few bugs to fix • Tremendous effort of CCS and PRS Code writers • Very rapid and substantial progress in last few months. • Very busy on daily basis now. DPS Mar 04, CMS Week

  12. DST Runs • DST Event Model • DST events link back to digis, but not in DST files • Can be analyzed without Digis • Events can be in full DST and in selected streams • Object model in streams can be different (to each other and to full stream) • DST Writing about 20-30 s/event with full tracker reconstruction dominating • 500MB memory footprint • DST reading ~1/25s • First analyses jobs using DST are starting to work now Trigger Digis L1 DiMuon Stream Tracks and Partial Muon Reconstruction DPS Mar 04, CMS Week Full DST including Tracks, Muons, Cluster, jets

  13. LCG-2 “official” BDII, RB for CMS-DC04 RLS “rlscms.cern.ch” with suitable end-points POOL compatible version now deployed RLS replication with CNAF still pending Single point of failure! (several) problems encountered/discovered Need to better understand usage by DC04 (still) no monitoring/logging information available to us TMDB (TransferManagementDB) So far so good (actually excellent!) Monitoring LEMON MonaLisa GridICE on LCG-2 service systems (server in INFN) Ancillary systems “private” LCG-2 BDII MonaLisa backup system Login lxgate04.cern.ch Login user “cmsdc04”, other users on request Daemons, agents Drop boxes edg RM Batch LSF-batch Worker Nodes “lxbNNNN” 3 racks, 88 nodes each > 250 dedicated nodes, fully installed (including grid) Dual P-IV 2.4GHz, 1GB memory, 100baseT “cmsdc04” LSF-queue, 2 jobs / node Disk servers available CASTOR stager “stagecmsdc04”, with 2 pools “detector”: IB InputBuffer, 5.9TB, 2.3TB free today (39%) “gdb”: GDB GlobalDistributionBuffer, 6.8TB, 5.6TB free today (83%) EB-SRB SRB server, 4.2TB, 3.0TB free today (75%) EB-SR SRM server, 4.2TB, 3.0TB free today (75%) EB-SE LCG-2 “Classic SE”, 1.05TB, 0.5TB free today (47%) 2 more systems being installed today Tier-0 Overview DPS Mar 04, CMS Week

  14. Transfer Tools, TMDB, Status • Prototype in place, and tested • Agents all in place, and have successfully transferred first sets of files produced at CERN. • (~20000 files, replicated to each T1, O(10^5) transfers). • Small fraction of total required for DC04… • Enormous compared to the number you’d want to trigger by hand. • Have experienced problems (a new one every day..) • Lots of transactions on the LRC causes problems. • Finding bugs in various bits of software. • Lots of support from EDG, POOL, SRB developers… thanks! • Low transfer rates because of small file size (most are ~40MB) • And so on … • So- ongoing development, but a very basic prototype now exists. DPS Mar 04, CMS Week

  15. LCG-2 components in DC04 • RLS (Replica Location Service) • Many clients: • RLS Publishing Agent: converting the XML catalogue of Tier-0 job into the RLS • Configuration Agent: querying the RLS metadata to assign files to Tier-1 • Export buffer Agents: inserting/deleting the PFN for the location in the EB • Tier-1 Agents: inserting PFN for the destination location, in some cases dumping the RLS into local MySQL POOL catalogue • Scalability problems… • Understand bottlenecks, e.g. use C++ API instead of the (java) command line • Reduce the load on RLS • Mirror the RLS • Mirror at CNAF is ready since last week. But for not yet in use • Data transfer between LCG-2 Storage Elements • Export Buffer at Tier-0 with disk based SE • Production system delivered by IT end last week with ~ 1 TB of disk • Before were using a system provided by EIS team • added CPU and 2TB disk space today • Serving transfers to CASTOR SEs at PIC and CNAF via the Replica Manager • Also replicating files from CNAF to Legnaro for muon streams DPS Mar 04, CMS Week

  16. Services and SW installation • Dedicated information indexes at CERN supported by LCG • CMS may add its own resources and remove problematic sites • Dedicated Resource Broker at CERN supported by LCG • Virtual Organization tools are the official LCG-2 ones • Dedicated GridICE monitoring server at CNAF: • monitor resources registered in the CMS-LCG information index • active on all service machines (CE, SE, RB, etc…) • WN monitoring on at CNAF/PIC/Legnaro • CMS Software installation: • With new LCG-2 tools CMS software manager can: • install the software in an LCG site (with a shared area between CE and WNs) • advertise in the Information system what has been installed • Working on two kind of CMS software distribution: • DAR (for production activities) • CMSI-based tool to install RPM’s (for analysis activities) DPS Mar 04, CMS Week

  17. Analysis on LCG-2 • Job preparation on the User Interface, submitting jobs to a Resource Broker • Tier-1 fake analysis • Demonstrate ability to process incoming data in close to real-time • Tier-2 end-user analysis • GROSS (developed at Imperial College) • basic functionalities are there: • User submits meta-data query and ORCA executable to GROSS • GROSS queries RLS and performs job splitting • GROSS provides job preparation and submission (via BOSS) • GROSS provides basic job monitoring and information retrieval for user • GROSS provides output data retrieval through Grid sandbox (i.e. output files not registered on RLS, as yet) • need testing with real use cases DPS Mar 04, CMS Week

  18. DC04 INFN Tier-1 status report Example. Similar reports from FNAL, IN2P3, CNAF, RAL, PIC D.Bonacorsi (CNAF/INFN Bologna) CMS Week, DC04 TF meeting – Mar 17, 2004D.Bonacorsi

  19. INFN Tier-1 transfer agent(s) Very busy and fruitful first DC04 days: • First data arrived at INFN Tier-1 on Saturday, March 6th, 17:23 CERN time • INFN is using basically same T0  T1 agent in C as PIC, with few differences •  In general: very active collaboration among PIC and INFN (sharing the same EB) • The agent is now quite well-tested and stable: can stay up&running 24/7 (it was tried all over last weekend) • All files available so far in the EB_SE (> 18000) have been transferred to INFN and migrated to tape • First test files were delivered to Legnaro T2 with prototype T1  T2 agent •  More coding/testing is needed, possibly ready end of this week Nevertheless: • Castor filename length: ORCA filenames turn to be too long for Castor@CNAF •  workaround: store files with guid’s as filenames (short and unique by def)  Castor installation at CNAF is aligned with other T1’s but in configuration it happened to hit the maximum internal limit for a fullpath on Castor •  Reconfiguration in production needs support and instructions by Castor IT CMS Week, DC04 TF meeting – Mar 17, 2004D.Bonacorsi

  20. Computing TDR • Current Milestones call for a Computing and Software TDR for final submission in Autumn 2004 • To be followed by an LCG TDR in July 2005 • Just in time for purchasing to get underway • Not possible to meet this schedule, we have slipped 9 months in 1 year • We propose to concentrate on the key Computing Parameters and Model • Good chance to have solid draft in October, final submission early 2005 • Will address the issues required for the LCG TDR • Will address the computing architecture and requirements for initial running. • Interaction of core components like COBRA and POOL with Computing • But not the Application Software areas or many of the Distributed Analysis issues DPS Mar 04, CMS Week

More Related