1 / 19

Roadmap to AliEn v2-20

Roadmap to AliEn v2-20. A. Abramyan , L. Betev , D. Goyal , A. Grigoras , C. Grigoras , M. Litmaath , N . Manukyan , M. Martinez, J . Porter, P. Saiz, S. Sankar , S. Schreiner. What’s new. Plenty of new improvements Catalogue simplification Client UI Extreme Job Brokering

field
Télécharger la présentation

Roadmap to AliEn v2-20

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Roadmap to AliEn v2-20 A. Abramyan, L. Betev, D. Goyal, A. Grigoras, C. Grigoras, M. Litmaath, N. Manukyan, M. Martinez, J. Porter, P. Saiz, S. Sankar, S. Schreiner

  2. What’s new • Plenty of new improvements • Catalogue simplification • Client UI • Extreme Job Brokering • Removal of PackMan • New JDL fields • Proxy renewal • Job Memory checkup • And baseline for new development

  3. Catalogue Simplification • Up to now, catalogue divided in multiple DB: • Simplifies scalibility • Logic slightly more complicated • Changing username/userid • Smaller tables Thanks Dushyant, Subho

  4. PackMan • Removing the PackMan/PackManMaster services • Functionality stays in client UI/JA • JA can install packages directly • Very powerful if combined with torrent • Speeds up most of the packman operations Thanks Narine, Armenuhi

  5. New JDL fields • MaxWaitingTime: amount of time that job can stay in ‘WAITING’ • If time exceeded, job ends up in error • New state: ERROR_EW (Expired Waiting) • Retrial: • Number of times that a single job can be resubmitted • Resubmission done by central services • Reusing JobId in resubmission • Direct removal of KILLED jobs Thanks Miguel

  6. Extreme Brokering • Postpone splitting of job until last moment • Decide data to be analyzed based on current location of JA & files not analyzed yet • Can define Max/Min number of files to be analyzed • Even if the files are not local • Less subjobs: • Easier merging Thanks Pablo

  7. Current situation Works nicely if one replica per file Job Manager JOB JOB JOB JOB JOB JOB A bit more complex with 3 SE and 2 replicas And a lot more with 50 SE and 3 replicas Job Manager Job Manager JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB

  8. Example Current schema Submit 4 jobs: File1 File 4 File2 File3 File 5 Broker per file Submit 3 empty subjobs If nothing left, just exit File1,2,4,5 When a job starts, analyze as much as possible File 3

  9. Proxy renewal system • Replaces vobox-proxy-renewal service • Can receive ‘validity’ or proxies • Simplifies CREAM-CE job submission • No corruption of proxies • Can be started by non-root user • Already deployed at CERN • And for some CMS sites… • Can already be deployed Thanks Maarten

  10. New development • More than 1 year since last mayor update • Some backward incompatible changes • Change of catalogue schema • What to do with new requests, bugs: • Debug current system? • Debug in new version? • Both!

  11. AliEn deployment for ALICE 80 sites AliEn v2-19.(80-163) 80 sites Central Services 8 machines AliEn v2-19** 8 machines vobox catalogue aliensh Api TaskQueue Transfers Api Api ROOT LDAP Api BACKUP JA 12 machines AliEn v2-19**, v2-17 12 machines 3 machines (+1 slave, backups) 3 machines (+1 slave, backups) AliEn v2-17 40.000 wn AliEn v2-19.(80-163) 40.000 wn

  12. How to test new versions… • Build system: • Multiple platforms • Integration & basic functionality tests • No API/access from ROOT tests  • Similar to the AliROOT, ROOT build systems • Running the whole system on a single machine • http://alienbuild.cern.ch:8888

  13. Already deployed for PANDA • Running since September • 12th PANDA Grid Workshop and 2ndAliEn Developers Week • Multiple sites, smaller load than ALICE • No API services • ‘Old’ v2.20 version Thanks PANDA

  14. Previous major update • Stopping the whole system • 1 week to redeploy • 1 month ironing out details Not an option!

  15. Second set of services: Central Services Central Services CE CE catalogue catalogue aliensh aliensh Api Api TaskQueue TaskQueue Transfers Transfers Api Api Api Api ROOT ROOT LDAP LDAP Api Api JA JA

  16. Second set of services • Copy of the catalogue • 3 different central machines, 3 voboxes, same SE • What to do with output • Throw away (easiest) • Incorporate back (easy if output in a different directory)

  17. Timeline Mar Apr May Now: 1 week: Investigate test system 1 week: Test Catalogue migration 1 week: Define New VO 1 week: Verify quotas 1 month: New hardware for CS 2 days: Central deployment from backup 3 days: First site working (CERN) 2 weeks: At least 2 external sites (CCIN2P3, ?) After that works, keep adding sites 2 months: 1 day: Switch VO 1 day: Overall site upgrade

  18. Summary • AliEn v2.20 ready for deployment • With plenty of new features and bug fixes • Minimize upgrade downtime • Create testing setup with several sites, and with all the SE • More effort on testing (also from site admins) • Deploy Test V0 with ALICE sites • And say goodbye to v2-19 in two months Thank you!!

  19. Job execution TASKQUEUE Job Manager JOB JOB JOB JOB Job Broker Site C Site A CE Site B JA CE MonALISA CE MonALISA xrootd JA xrootd MonALISA xrootd JA File catalogue LFN GUID Meta data

More Related