290 likes | 523 Vues
CREAM. Massimo Sgaravatto – INFN Padova On behalf of the CREAM cluster of competence. Current status. CREAM CE released for production in EGEE in Oct 2008 Since that, regular updates with bug fixes and improvements As of May 4: 22 CREAM CEs (~ 200 CEIds) published in the EGEE production BDII
E N D
CREAM Massimo Sgaravatto – INFN Padova On behalf of the CREAM cluster of competence
Current status • CREAM CE released for production in EGEE in Oct 2008 • Since that, regular updates with bug fixes and improvements • As of May 4: 22 CREAM CEs (~ 200 CEIds) published in the EGEE production BDII • Used in particular by Alice • They report good results, in terms of reliability and performance • Also ICE (enabling submissions to CREAM through the WMS) released (released more recently than CREAM), even if there are still some scalability issues • A version of ICE (which is much better than the version in production) is in certification • The one tested in the PPS pilot testbed • But still some other scalability problems (bug #47911) • Being addressed • Several problems with testing • CMS is starting some tests submitting to CREAM via the WMS EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Current status: more details • Some patches recently released in production • The ones tested in the CREAM PPS pilot • Patch #2748: CREAM, CEMon, BLAH (glite-CREAM) • Bug fixes • Patch #2845: CREAM & CEMon client for SL4 (UI, WMS) • Bug fixes & IPv6 compliance • Patch #2750: yaim-cream-ce (glite-CREAM): • Bug fixes • In certification • Patch #2875: UI for sl5_x86_64 • Includes CREAM and CEMon client • Same software tag of patch #2845 • Patch #2966: CREAM & CEmon client for VOBox • Same stuff of patch #2845 • Patch #2597: WMS • Includes new ICE (the one tested in the CREAM PPS pilot) • As agreed, the new ICE had to be released quickly in production with the other CREAM patches, but this didn’t happen EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
LCG-CE CREAM-CE • “Sites are encouraged to deploy a CREAM CE in parallel to their LCG CE” • Defined criteria that must be met to start the transition from LCG-CE to CREAM • http://twiki.cern.ch/twiki/bin/view/LCG/LCGCEtoCREAMCETransition • Functionality and performance criteria • Details of how/when/where doing (some of) these formal tests being finalized • Activity b of Phase 3 of CREAM PPS pilot • Joint SA1/SA3 effort • https://twiki.cern.ch/twiki/bin/view/LCG/PpsPilotCream#ActivityB • First tests (“at least 5K simultaneous jobs per CE node”) are being started EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Submission to CREAM from CondorG • One requirement to be fulfilled is the submission to CREAM via CondorG • At CHEP Sanjay Padhi (CMS US) reported they have done it, but they see a high failure rate in their tests • Not reported before • Problems with proxy delegation • We are not aware of such problems • Installed and made them available a CREAM CE to be used to debug such problems • Still waiting for Sanjay’s feedback EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Workplan • Premise • CREAM and related software components are pretty new • Most of the time will have to be spent in support, very likely • Also support for OSG, for what concerns CEMon • Not too much feedback so far • We expect more for the future • The current plan (specified in Savannah) can heavily change in the future • Note • We are still thinking considering the current model (the proposed one for EGEE III year II is not fully clear and will be discussed tomorrow) • E.g. as “expected date” we are considering the date in which the patch is released for certification EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Release 1.5 • Task #9732 • Expected date: May 2009 • Patch #2666 (Fourth update of CREAM CE for slc4/i386 platform) • Release notes • Several bug fixes • Porting to voms-api-java (task #7744) • This also means that VOMS server certificates won’t be needed anymore in the CREAM CE node (.lsc files will be enough) • voms-api-java (patch #2771) must be released in production first • First release of new BLAH parsers for LSF and PBS • Use of the batch system status/history commands instead of parsing the log files • Use of old/new parser decided at configuration (yaim) time • IPv6 compliance for BLAH (task #8825) EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Release 1.6 • Task #9734 • Expected date: July 2009 • Release notes • Bug fixes • Proper management of error codes and error messages (task #9295) • Task recently added as requested by the management • Proper project-wide guidelines must be defined first • glexec sudo (task #9557) • Replace glexec calls with sudo calls • Long discussions if this acceptable from a security point of view • Eventually discussed and approved by MWSG and SCG • Glexec will be used only once per job submission, just to get the local user to be used in the sudo calls • Eventually this local user will be given by the new AuthZ service • Besides improving performance and reduce dependencies, this should facilitate the migration to new AuthZ service • The same need to be done in BLAH EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Release 1.7 • Task #9735 • Expected date: October 2009 • Release notes • Bug fixes • Move to new AuthZ Service (task #7746) • Depends on: • Availability of new AuthZ service (task #7718) and its “maturity” (to be verified for the glexec on WN use case) • Integration of gridftpd with new AuthZ Service (Chad) • Glexec sudo (task #9557) • BES and JSDL v. 1.0 support (task #7739) • Since BES and JSDL are not really usable for production activities, not really active in this task • Much more effort in following PGI activities (task #9290) • Goal: definition of appropriate profiles needed for production use EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Release 1.8 • Task #9736 • Expected date: January 2010 • Release notes • Bug fixes • Support for bulk job submissions (task #7740) • Submissions of multiple jobs to CREAM CE via a single call • Also (in particular) for submission through WMS/ICE EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Release 1.9 • Task #9738 • Expected date: April 2010 • Release notes • Bug fixes • CEMon backend refactoring (task #7747) • Problems with JNDI based backend • Performance problems • Difficult to maintain • Already discussed and found agreement with OSG people • RDBS (Mysql) for CEMon in CREAM CE • Light embedded DB (e.g. Derby) for CEMon in OSG • Some support for high availability/scalability CE (task #7742) • Requested in particular by CERN people • To support a pool of CREAM CE machines seen as a single CREAM • Preparing a proposal describing different possible options EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
IPv6 compliance • Bugs opened by Mario Reale • CREAM and CEMon clients: fixed and already released in production (task #7801) • CREAM and CEMon server: no bugs opened • BLAH: fixed in CVS. Will be released with release 1.5 (patch #2666) (task #8825) • As agreed we haven’t done any tests on IPv6 • To be done by SA2 • As far as I can understand support for IPv6 is still missing is several packages that we depend on (e.g. gridsite, gsoap-plugin, voms) • They can’t test too much right now (for both CREAM client and CREAM server) EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Porting to new platforms • Our understanding of requirements • CREAM and CEMon client on sl5_x86_64 (task #9289) • Hopefully done (patch #2875 in certification) • CREAM CE on sl5_x86_64 (task #9288) • org.glite.ce.* ~ already builds (at least if you just build org.glite.ce) • Not performed any tests yet also because not all needed software components needed for the CREAM-CE node build for SL5_x86_64 • CREAM and CEMon client on MacOS X (task #9293) • CREAM and CEMon client on sl5_ia32 (task #9292) • CREAM and CEMon client on deb4_x86_64 (task #9291) • WMS (and therefore ICE) on sl5_x86_64 (task #9429) • Issues • Not clear by when this is required • Not clear deadlines given, apart for UI on SL5_x86_64 • Not completely up to us (in the CREAM CE there isn’t only our software) • Are these all (and the only ones) platforms we’ll have to support ? • Can other platforms be supported if asked by some customers ? EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Documentation • Everything available in the CREAM web site • Doc for users • CREAM CLI documentation • CREAM JDL documentation • CREAM C++ API documentation and tutorial • … • Doc for admins • Installation and configuration guides • Description of CREAM control mechanisms • Info for troubleshooting • … • Trying to keep it updated • Recently added: Forwarding of requirements to the batch system howto EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Other tasks • Task #7743: Better integration between CREAM and LB (LB events logged also by CREAM) • Depends on task #7638 ([LB] Support native CREAM jobs) • But its expected date is 30/04/2010 • Not by the end of EGEE-III … • Better support for MPI jobs • See MPI WG activities • Still to be checked if the mechanisms to forward requirements to the batch system (via BLAH) is enough • Recent requests to use CEMon for Alice dashboard • They would like CEMon notifies the dashboard about CREAM job status changes • Still discussing with the relevant persons EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Current modus operandi • We are responsible for developments and maintenance of • CREAM: INFN Padova • CEMon: INFN Padova • BLAH: INFN Milano • ICE: INFN Padova • yaim-cream-ce: INFN Padova • Usually software released in the form of: • Patches for CREAM and CEMon client • To be installed on the UI, WMS and VOBOX nodes • Patches for BLAH, CREAM and CEMon server • To be installed on the glite-CREAM node • Patches for yaim-cream-ce • To be installed on the glite-CREAM node • Patches for WMS (or only the ICE component) • To be installed on the WMS node EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Used procedures: precertification • When it’s time to finalize a patch • After developers’ tests • Software is tagged and ETICS confs are locked • A specific script (which increments the version numbers in the ini files, perform the CVS tags, create the ETICS confs) is used • RPMs (taken from the ETICS permanent repository) are installed for testing in the testbed • Small testbed (testbedA) • Larger testbed (testbedB) • 7 CREAM CEs with Torque @ INFN-Padova, 7 CREAM CEs with LSF @ INFN-Padova, 7 CREAM CEs with LSF @ INFN-CNAF • Used in particular for testing submission via the WMS • Performed tests • Functionality tests • Implemented a testsuite for CREAM and CREAM-CLI • Tests to check if the bugs specified in the patch are really fixed • Precertification report attached to patch • Still missing • Real regression tests • Performance tests (GRNET is working on that) EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Feedback on tools and procedures • CVS • No major problems with it • Don’t like at all the CVS notification mechanisms • Simba mailing lists for the existing (but not all) subsystems • Not flexible • You have to ask someone to create a new mailing list for the interested subsystem • Doesn’t allow to be notified about commits for just a specific component or a specific directory tree • E.g. I am interested in just org.glite.yaim.cream-ce and not the whole org.glite.yaim subsystem • The approach used for the CVS @ IN2P3 for EDG was much better • .cvsnotify files containing e-mail addresses • All listed people received notifications for commits done under that directory EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Feedback on tools and procedures • ETICS • Powerful tool but too complex • The average user has not (and doesn't want to have) a deep knowledge of the system • He just wants to be able to manage his use cases • Very often very few people are able to understand the reasons of some problems/behaviors • E.g. GGUS #45622 (ETICS client problem) • E.g. why configuration xyz builds against a certain project config, and it doesn’t build anymore after locking ? • E.g. why org.glite.ce.common-java builds if I build just the org.glite.ce subsystem, while it doesn’t build if I try to build the whole org.glite ? • If even the release manager complains about that, there is a problem … EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Feedback on tools and procedures • ETICS • Testing of new ETICS versions should be probably improved • It happened more than once that major problems were introduced with new versions • Not always very effective support • E.g. GGUS #45622 • Opened on Jan 27, 2009 (high priority) • Solution found on March 3, 2009 (several “pings” were needed) • Still open (i.e. we have to do some hacks by hand waiting for the new client, if we have to run the client on a CREAM CE machine) • Not too clear if/when some of the requested features will be provided • E.g. https://twiki.cern.ch/twiki/bin/view/ETICS/SA25InterviewResults • Is there an ETICS workplan available somewhere ? EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Feedback on tools and procedures • ETICS • The main problem was that no clear directives were given about how ETICS should be used in gLite • E.g. specification of dependencies • Static dependencies or properties • Clear, well documented and “bomb-proof” guidelines, receipts and tools should be given to developers and “internal” integrators to manage their use cases • Should someone checks if the configurations released for certifications are compliant with these guidelines ? • Also via some automatic tool ? EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Feedback on tools and procedures • Savannah • Not a “homogeneous” way to use it in the project • Some use to track via Savannah everything • This is what we use to do • Basically each commit refers to a Savannah “bug” • For some other components it is used only for bugs submitted by users • Procedure for closing bugs should be improved, otherwise many bugs keep staying open even if they have been fixed • When a patch goes to production, the bugs should go in status “Ready for Review” • Foreseen but this doesn’t always happen • When a bug goes to “Ready for review”, it should be assigned to the person who submitted it • Otherwise difficult to understand which bugs you are supposed to verify • Not foreseen (bugs keep be assigned to “egeetest”) • Even not technically possible, if that person is not part of the JRA1 MW Savannah group EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Feedback on tools and procedures • GGUS • Saying that it is a “best effort” or a “voluntary basis” activity doesn’t make too sense • Right now there is just a “workload management” support unit with just includes the WMS developers • Single “job management” support unit or multiple support units (one for WMS, one for CREAM, one for BLAH, etc.) ? • Some problems if the procedure explained by Diana at last AH meeting • E.g. interactions between GGUS and Savannah • We should just put the Savannah bug number in the GGUS ticket • As far as I can see GGUS is not to taking care of the rest, as it is supposed to be • E.g. filling of GGUS field in the Savannah bug • E.g. updates of Savannah bug logged in GGUS ticket EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Feedback on tools and procedures • Current certification process • A patch is released for certification • The ETICS conf. has been built and locked against glite_branch_3_1_0 project config • The RPMs are available in the permanent ETICS repository • The new RPMs are installed on the relevant node types where certification tests are performed • When the patch is released for production, the glite_branch_3_1_0 is updated with the new ETICS conf. of that patch • Not suitable for all scenarios • Just testing the new RPMs doesn’t always mean testing the new stuff • E.g. consider recent trustmanager and util-java patch • Some used jars are the ones installed via the RPMs • But some other used jars are included in the webapps wars (e.g. CREAM, CEMon, FTS), so they are consider at build time • New trustmanager and util-java are really used everywhere only when the involved RPMs are deployed AND the relevant components are built against the new stuff EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Feedback on tools and procedures • org.glite.ce subsystem • Includes CREAM, CEMon and BLAH • Includes both server (to be installed on the CREAM CE) and client (to be installed on the UI and on the WMS) • Doesn’t fit well with the current software organization • Specifying a whole org.glite.ce subsystem configuration for e.g. a CREAM server patch doesn’t make too sense • Which conf.s should be specified for the CREAM client components ? • Willing to consider node type (metapackage) configurations instead • How it is possible to keep synchronized these metapackage configurations with the versions of the software used in production ? EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Feedback on tools and procedures • Time for a patch to go in production is very long • 1-3 months • Most of the spent time was not in the certification itself, but in the time waiting for the patch to start the certification and waiting for the patch to be deployed in the PPS after having been certified • Wasn’t the precertification supposed to address this issue ??? • Not very flexible procedure • E.g. I simply forgot to add “VOBOX” in the “affected metapackage” field of a CREAM client patch, and a new patch had to be created !!! • And it has to follow the usual (long !!!) procedure • At any rate should be better with the new organization EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Other feedbacks • Coordination and communication should definitely improve • E.g. management of non-backward compatible changes • E.g. new jobid in gLite 3.2 • E.g. porting to SL5/glite 3.2 • Feel like there are different views (and priorities) by JRA1 and SA3 management • E.g. “dependency challenge” done some time ago • Different opinions and different outcomes by different reviewers • We heavily modified all our dependencies based on that review, and now it turns out that we have to modify them again • E.g. release of CREAM and ICE software used in the PPS pilot in production • The agreement was to release it in production in a short time, after a quick certification, but this didn’t happen EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
Other feedbacks • Rules and guidelines • The few defined rules and guidelines are not always enforced • E.g. update of RPMs of a patch during its certification process • Wasn’t it decided that this should not happen and instead the patch has to be rejected/obsoleted and a new one created (via the cloning Savannah tool) ? • Not always done • Up to developers and/or certifiers • Dependencies on other gLite components • We have several dependencies on other gLite components • Feel like that in some cases people don’t feel committed in supporting these components, if the raised issues are not relevant for them • Afraid that it will be even worst in the future EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009
New organization for EGEE-III year II • To be discussed tomorrow, but not clear how the “one product team per node” proposed model can fit with our model • In a CREAM CE there are a lot of other software components which we don’t implement and maintain • E.g. yaim-core, voms, lcas, lcmaps, glexec, etc. • Saying that everything in the CREAM CE node will be under our full control is not really true • What about CREAM related software components not installed in the CREAM node ? • CREAM client (installed in the UI, VOBOX and WMS) • ICE (installed in the WMS) EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, 2009