1 / 22

Condor at BNL

Condor at BNL. Antonio Chan Jason Smith Tomasz W ł odek Brookhaven National Laboratory. Brookhaven Linux Farms. Brookhaven National Laboratory (BNL) hosts 4 big experiments (Brahms, Phenix, Phobos, Star) which operate RHIC heavy ion accelerator

johncturner
Télécharger la présentation

Condor at BNL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor at BNL Antonio Chan Jason Smith Tomasz Włodek Brookhaven National Laboratory

  2. Brookhaven Linux Farms • Brookhaven National Laboratory (BNL) hosts 4 big experiments (Brahms, Phenix, Phobos, Star) which operate RHIC heavy ion accelerator • BNL also participates in ATLAS experiment, aimed at building world’s largest elementary particle detector in CERN laboratory, in Geneva, Switzerland. • BNL will host the main US computing center serving the ATLAS experiment, and one of a handful such centers in the world.

  3. Brookhaven Linux Farms – cntd. • Brookhaven manages a Linux farm which serves those experiments. • At present – 1145 machines, P3 and P4; 450 MHz-3.06 GHz • HPSS tape storage, 1.5 PB, 7.5M files (as of now) • 180 Tb of centralized disk storage served via NFS. (Distributed storage alternatives are being considered)

  4. Brookhaven Linux Farms –growth • The BNL farm will grow. • For the RHIC experiments: 224 P4, 3.06 GHz machines ordered, currently being delivered and installed • Another 200+ will be ordered in autumn • 200+ will be added every ear for several years to come • Atlas farm – 95 machines now, will grow exponentially (double the size every year) until Atlas experiment starts taking data. (1000 machines by 2007)

  5. The RHIC computing - now • The machines belonging to RHIC experiments fall into two categories: data analysis machines (CAS) and data reconstruction machines (CRS) • Data analysis: hundreds of users, each user runs his own jobs which read data files and produce histograms. CAS machines use LSF batch system. • Data reconstruction: data arriving from detectors must be reprocessed and stored in HPSS. One user runs constantly thousands of high CPU, heavy IO jobs for several months. • CRS machines use a home grown batch system, written in BNL and designed specifically for RHIC data reconstruction jobs.

  6. The LSF problem • You need three things in politics: money, money and money – Louis XIV • You need the same three things to run LSF … • We have 2200 LSF licenses now, and we expect to have 5000 processors in the future • Due to projected future costs of LSF licenses we are considering alternatives.

  7. Condor in BNL (reason #1) • In order to escape the cost of LSF licensing we are considering to use condor and phase out LSF • No LSF licenses will be installed on newly acquired machines • It is expected that users will gradually migrate from LSF to condor. • At present we are encouraging users to use condor in order to see whether it can serve as viable alternative to LSF

  8. Condor at BNL (reason #2) • The old data reconstruction system (CRS) is not scaling well with the growth of the farm. • We need to replace it with a new one. • The new CRS system is condor based – i.e. it uses condor as its job scheduler/submitter and monitor

  9. Anatomy of a CRS job Unix disk HPSS … Import data Import data Import data Reconstruct data Export data

  10. CRS job • A CRS job is a black box which reads some file(s) from HPSS or NFS disk(s), then does 6-12 hours of number crunching and finally spits out output file(s) to disk(s) or tape(s). • The procedure described above needs to be done for millions of jobs, over long periods of time with high reliability, as little user maintenance as possible and perfect bookkeeping.

  11. New CRS • The new CRS batch system is based on condor/dagman. • CRS builds job based on user provided job description file. • Condor/dagman run the user jobs. • The parent jobs in Dag locate input data on disks and stage them from tapes to disk cache, if needed. • Child job in dag runs user executables and once job is completed exports output files to final destinations

  12. New CRS • CRS system provides the bookkeeping and troubleshooting software. • CRS interfaces condor to MySQL databases to keep track of production • CRS provides user interface – GUI and line mode commands are supported.

  13. New CRS – condor setup • The new CRS machines are divided into 5 groups, which serve as “queues” • CRS1 – 450 MHz • CRS2 – 800 MHz • CRS3 – 1 GHz • CRS4 – 1.4 GHz • CRS5 – 2.4 GHz

  14. The New-CRS User Interface

  15. New CRS - status • Project started in January 2003 • Design, coding phase – until August 2003 • August 2003-now : testing • At present: we are moving to production phase. Right now users run o(1000) jobs/week using the new CRS. • We are still far from our objective: uninterrupted production at sustained rate of o(10k) jobs/day over several months in a row. • No full “condor stress test” done – yet.

  16. CAS farm - configuration priority CAS1=400-800 MHz CAS2=1-1.4 GHz CAS3=2.4-3.06 GHz

  17. ATLAS farm • The Atlas farm use follows a different model than RHIC farm • Atlas farm forms part of a US-wide grid of farms serving, among others, Atlas experiment. • Atlas farm is part of IVDGL collaboration aimed at building a computing grid for several different scientific experiments.

  18. Grid3 sites

  19. Atlas farm at BNL • 95 machines, a handful of gatekeepers • Runs condor only (there is legacy LSF installed on Atlas farm, however there are no plans to expand its use) • Condor queues: • Atlas production – 95 machines • CAS1,2,3 - 95 machines • Grid3 (non Atlas, external jobs) – 15 machines

  20. Atlas farm status • Used for developing and testing Atlas grid software • Production is done in “data challenges” (DC) – few month long production runs to perform Monte Carlo simulations • Last fall – “Pre DC2” run – preparation for present day production run

  21. Atlas farm status • DC2 – will start around May 1st and will run until fall 2004. • More data challenge runs will come in future • Atlas farm will grow and serve as the main Atlas computing facility in the US. • The Atlas experiment will collect data until 2020 (at least???) → Atlas farm will continue to exist in some form for very long time.

  22. Conclusion • BNL manages one of the largest (?) Linux farms in the US. • The laboratory considers replacing LSF with condor as the batch system for the farm operation. • Due to user’s inertia and natural reluctance to switch to untested system this process is slow. • The BNL farms are likely to grow to around 5000 processors in 2007. If condor is adopted as the official farm system this will (?) make them the largest condor installation in the world.

More Related