1 / 17

WP1 WMS rel. 2.0 Some issues

WP1 WMS rel. 2.0 Some issues. Massimo Sgaravatto INFN Padova. Outline. Some issues to discuss (and let’s try to decide) LB server choice New CondorG Proxy renewal RLS integration WP2 Optor integration Output data upload and registration LB issues Gangmatching

tomas
Télécharger la présentation

WP1 WMS rel. 2.0 Some issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WP1 WMS rel. 2.0Some issues Massimo Sgaravatto INFN Padova

  2. Outline • Some issues to discuss (and let’s try to decide) • LB server choice • New CondorG • Proxy renewal • RLS integration • WP2 Optor integration • Output data upload and registration • LB issues • Gangmatching • Security of files on the WM node • Disk quota management in WM node • VOMS integration • Job exit code • ISB/OSB transfer errors • Accounting integration • User vs host proxies • … ?

  3. LB server choice • Allow multiple LB servers for a single WM for increased reliability and performance • Approach • UI responsible to choose the LB server (e.g. via a round robin) ? • List of available LB servers in UI conf file, waiting for having this VO specific info published in a “VO repository” (R-GMA/IS/VOMS) ? • Move list of available NSs in this VO repository as well, when available • Not too clear yet what could be this VO repository (discussions within ATF)

  4. New CondorG • New CondorG negotiated with Condor people (more details by Francesco P.) • Released by end of March, included in VDT, and to be used in rel 2.0 • Two proxies • X509UserProxy • One per job • X509ManagementProxy • One per user’s DN or one “serving” n jobs for that user’s DN • A CondorG <gridmanager, gahp-servers> pair for a given X509ManagementProxy • Details on the whole machinery to be discussed • Where is this user’s DN X509ManagementProxy mapping kept and managed ? • Proxy renewal ? • …

  5. Proxy renewal • Necessary to have a “persistent” proxy renewal daemon (i.e. if it is restarted it shouldn’t loose control of the “managed” jobs as it happens now) • Necessary to discuss and decide on various issues • Renewal of X509UserProxy • Done only if requested by the user (if MyProxyServer specified in the JDL ?) ? • No MyproxyServer in WM conf file anymore ? • And what about renewal of X509ManagementProxy ? • If a new proxy “arrives” from UI and extends the validity of the existing one, the new one replace the old one ? • Not enough: what about if at least a job of that user asked for proxy renewal ? • Necessary to renew also X509ManagementProxy • Who does registration ? NS ? • Who does un-registration ?? • …

  6. RLS integration • At J+27 RB/MM will have to query the WP2 RLS instead of WP2 RC to get the SFNs given a LFN (or LCN, or a GUID) • On-going negotiation of this WP1-WP2 interface • New JDL attribute (VirtualOrganization) to make possible to refer to the “official” VO’s RLS (needed by WP2 services) • Not needed anymore when VOMS integrated and therefore it will be possible to get the VO from user’s proxy • Optional JDL attribute to make possible to specify a “non-official” RLS ? • edgReplicaManager::listReplicas to have the SFNs • New BrokerInfo content (under negotiation)

  7. Integration with WP2 Optor • Completely different approach than querying the RLS to have the PFNs (mutually exclusive) … • RB calls getAccessCost for all the suitable CEs (the ones where the user is authorized to submit jobs and matching the JDL “Requirements” expression) and for all the specified input data (LFNs, LCNs, GUIDs) • A “cost” is returned for each CE • The RB chooses the CE, taking into account this cost and also the other Ranks (to be decided how) • In some cases the WM has also to trigger the replica of files to the closeSE • Not too difficult, but very high impact on scheduling/planning performed by RB/MM • Integration WMS-Optor • Planned after J+27 • However according to WP2, this stuff ready and tested well before J+27 • To discuss details of integration • How ? A binary flag in the WM conf file to enable/disable Optor ? • When ?

  8. Output data upload and registration • Problem discussed and solution agreed in the ATF • Approach (details by Fabrizio P.): • OutputData JDL attribute (optional) to specify output file names, output LFNs and output SEs • Jobwrapper at the end has to call the WP2 function copyAndRegister • Issues • Some details about copyAndRegister to be sorted out • Release date of this stuff not decided yet

  9. LB • What happens exactly at J+27 wrt: • “Advanced query to LB” ? • “LB – RGMA integration” ? • How ? • Interfaces (e.g. for advanced queries) ? • Issues ? • Ales ??

  10. Gangmatching • Problem: take into account both CE and SE information in the matchmaking • For example to require a job to run on a CE close to a SE with “enough space” • Salvo has been working on this for a while, also after some negotiations with Condor team (A. Roy) • Salvo’s talk for details (e.g. JDL) and discussions • When can this stuff be released ? J+27 ?

  11. Security of files on the WM node • Approach • WP1 services (NS, …) running as edguser.edguser in WM node • Different user’s subjects mapped to different local users in grid-mapfile: user1.user, user2.user, … • Patched gridftp server (by Massimo M.) running on the NS node, so that the InputSandbox files are transferred in the NS node belonging to edguser as group and rwxrwx--- as mask • So a user can not access files belonging to an other user anymore • Issues • When ? J+27 ? • How ? Gridftp server RPM released by WP1 ?

  12. Disk quota management on the WM node • Having different DN users mapped to different local users in the grid-mapfile of the WM node allows to set disk quota for the various users • NS to be modified (for J+27) so that it has to reject a job if no enough disk quota available to store the input sandbox files • Issues ? Marco ??

  13. VOMS integration • E.g.: voms-proxy-init –vo CMS • VO info in the generated proxy • Impact on WP1 software • Retrieve VO from user’s proxy • So not necessary to provide it anymore in the JDL, for querying the RLS • Check for authorization not node anymore with a matchmaking considering User Cert Subject but according to VO • Proxy used by the various services (NS, LB, etc.) generated by VOMS ? • Issues • VOMS deployed at J+37 but not too clear which and when integration will take place • Not clear yet which VOMS APIs available

  14. Job exit code • For release 2.0 we agreed to return job exit code to user with dg-job-status • What about if exit code <> 0 ? • Done-ok in any case ? • Done-failed (and therefore resubmission) ?

  15. ISB/OSB transfer errors • In release 1.x job considered failed (and therefore resubmission attempted) if JobWrapper detects errors when transferring a file of ISB/OSB between RB node and WN • But failure could be simply because of user’s error when writing ISB/OSB expressions in JDL … • And what about if the job crashed for “internal” problems and therefore some OSB files not produced ? • Is it ok to mark the job as failed and re-attempt the submission or is it better to consider the job as done-ok ? • Approach in release 2.0 • JobAdapter should check and issue globus-url-copy only for ISB-OSB files which exist (simple for OSB, bit more complex for ISB) and/or globus-url-copy errors ignored ?

  16. Accounting integration • What exactly happens at J+27 (“Accounting infrastructure”) ? • And later, after release 2.0 (“Full integration of cost estimation/accouting into scheduling policies”) ? • Dependencies and interfaces with other components and other WPs at J+27 and later ?

  17. Host vs user proxies • Can we rely on user’s proxies instead of host proxies for authentication when possible, as recommended ? • E.g. in LB logging • Other cases ?

More Related