200 likes | 211 Vues
Learn about the experiences and services provided by CNAF, PIC, CESGA, UPATRAS, and IFIC in deploying gLite in the PPS. Discover the challenges, improvements, and documentation available to support the deployment process.
E N D
Experience on Deploying gLite in the PPS Contributions from CNAF (Italy), PIC (SWE), CESGA (SWE), UPATRAS (SEE), IFIC (SWE)
CNAF Experiences on the PPS M. Selmi - INFN-CNAF D. Cesini – INFN-CNAF
cert-ce-03( CE gLite1.3) Production Farm (T1) Uses LSF 150 job slots CNAF Participation to PPS CNAF provides the following services to the PPS: • 1 gLite Site (1 CE, 2 WN, 1 I/O Server, 1 R-GMA Server) • 1 gLite CE that uses the LCG T1 Farm (150 job slots) • 1 gLite WMS+LB (pull mode) • 1 gLite VOMS Server (DTEAM, LHC VOs, CDF, TROI, SCOTT, CHECOV, DILIGENT) • 1 gLite UI • 1 APT Repository for SA1 Certified gLite1.3 and gLite1.4 Notes: Supported VOs: DTEAM, ATLAS, CMS, LHCB, ALICE, CDF, PICARD, RIKER, CRUSHER, TROI, SCOTT, CHECOV, MCCOY, SEVENOF9, DILIGENT Data Management at CNAF is still not working, during the next upgrade it will be fixed ALL PPS CEs In pull mode cert-rb-01 WMS+LB (gLite1.2) cert-voms-01 (gLite 1.3 VOMS Server) gLite-1.2 Site pre-ui-01 (gLite 1.4 UI) APT Repository For Certified 1.3 and 1.4 cert-mon (gLite 1.2 R-GMA Server)
CNAF Experiences on PPS (1/3) Notes on installation The two-scripts (install/configure) or the apt+script methods are good, but some improvements are needed to speed up the installation activity: • An upgrade procedure made available • An easy, semi-automatic way to add VOs on existing configuration • The large XML configuration files should be split into “changeme” files and “Advanced configuration, change if you know what you are doing” files. • Many common parameters for PPS (i.e. VOMS server hosts/ports, catalogue endpoints, advanced configuration, etc.) could be centrally decided and deployed somehow (i.e. customized templates or a common public PPS XML file could be used).
CNAF Experiences on PPS (2/3) Notes on Support and Documentation • Very good support is available through the PPS and glite-discuss mailing lists • Quick feedback is also available through the savannah bug tracking portal • The PPS wiki pages are useful to get information about other sites configuration files • Good documentation, but some XML parameters should be explained in greater details in the configuration files or in the installation guides
CNAF Experiences on PPS (3/3) General Remarks • Having certification carried out before deploying middleware to PPS (started with gLite 1.3) is very useful. • PPS reached a number of sites which provides a kind of inertia, a new release about every month (or less) leads inevitably to an obsolete PPS (considering also the certification time) – experiments want “the latest feature” or shift their tests elsewhere! How can we avoid this? Furthermore continuous upgrades focus a lot of efforts into the installation process and take out time from testing what PPS have installed. • Testing (or make available for testing) gLite services connected to the LCG production infrastructure (i.e. WMS + BDII, CE + Production Farms) should be one of the PPS activities.
PIC Experiences on the PPS Carlos Borrego Iglesias Gonzalo Merino
PIC Experiences on PPS (1/3) • PIC's pre-production cluster includes (since May ’04): • gLite IO + SRM Server - se02.pic.es • gLite CE (+ ~150 LCG WN) - ce02.pic.es • gLite FireMan server - fs01.pic.es • gLite WMS - rb02.pic.es • gLite RGMA-server - rgma01.pic.es • gLite UI - ui03.pic.es • At the moment all services are running glite 1.3
PIC Experiences on PPS (2/3) • In the initial phase (gLite 1.0), it was difficult to deploy services: • Lack of documentation • Poor support channel • Error messages were not useful • often too cryptic • many reasons could cause the same error message • At the beginning, updating the services was a complete disaster • Machines were forced to be re-installed from scratch
PIC Experiences on PPS (3/3) • Why things got better: • CESGA's wiki page • real, example configuration files • end-points for core services • glite-discuss mailing list • "apt" was (finally) chosen as the default installation tool • Update to version 1.3 was straightforward! • We have great expectations for glite 1.4 !
CESGA Experiences on the PPS Francisco Jose Bernabe Pellicer Javier Lopez Cacheiro
Glite Services @ CESGA • Glite: • CE – pps-ce.egee.cesga.es (1.4) • WN – pps-wn001.egee.cesga.es (1.4) • R-GMA Registry/Schema – pps-rgma-server.egee.cesga.es (1.3, Core Service) • R-GMA Server – pps-mon.egee.cesga.es (1.3) • UI – ui.egee.cesga.es (1.4) • Wiki: • Public Area: • http://pps-public-wiki.egee.cesga.es/cgi-bin/moin.cgi • Private Area: • https://pps-private-wiki.egee.cesga.es/cgi-bin/moin.cgi
CESGA bad experiences • Too many configuration parameters and some of them are not well documented. • Difficult to know if a failure is due to an existing bug or a misconfiguration. • New services and changes to existing ones have forced rewrites of the configuration files each release. • More testing is needed before releasing a new version. • Suggestions: • Periodically update the list of bugs in the Release Notes so it is easier to find them (Savannah’s list is too long)
CESGA good experiences • The documentation is improving a lot (the sections of the Release Notes about the new parameters and the open bugs are very useful!). • A lot of new functionality. • JRA1 people are doing a good job. Thanks!!
UPATRAS Experiences onthe PPS Georgios Goulas Andreas Alexopoulos
UPATRAS: Site & Overall Experience • UPATRAS PPS Site • 1 CE, 3 WNs, 1 RGMA Server • Core Service: WMS (PUSH mode) • gLite Installation • Stable enough, improving over time • Small site installation: ~ 1-2 days (expect a week for first time) • Job Submission • Stable, no significant problems • We expect VOMS integration for the WMS • Data Management • Promising and interesting features • Ongoing Development, sometimes problems arise (bugs, changes, misconfigurations due to evolution)
UPATRAS: Some Remarks • XML Configuration files are lengthy • Mixture of “Advanced”, “System” and “User” parameters in XML configuration files • Proposal: Split to “Site”, “EGEE” and “System” • No “uninstall” feature • Substantial changes can cause reconfiguration to fail • Proposal: Can the configuration remove all generated files? • Very helpful comments in the XML files – a whole manual by themselves • Documentation is good and getting better • Great support and quick reactions (“quick fixes”) from JRA1, Many Thanks !
IFIC Experiences on the PPS Alvaro Fernandez Javier Sanchez
IFIC Experiences on PPS (1/2) • GLITE on PPS • Configuration: • Currently OK with provided procedures (APT + config. scripts) • It can take some time to understand all configuration parameters and to locate the service names (useful PPS mailing list, PPS wiki) • Documentation: • Fine, but should state the differences between the current release and architecture documents • Usage: • Seems that “standard” usage is working for everyone • (From IFIC) can’t successfully submit jobs with WMS+Fireman+InputData • FTS on the PPS needs more testing • some WMS machines unstable • We don’t receive many jobs: only people doing tests? Not stress testing? where are the real users ?? ( CMS started using it) ATLAS, BIOMED, ... have their own resources?
IFIC Experiences on PPS (2/2) • MIDDLEWARE SERVICES: • Concerns about data security model: • We have a production CASTOR SRM interfaced through I/O Server (involved minor changes: service auth). FTS to work needs also user auth. (possible inconsistencies accessing data). • Our SRM not publishing specific gLite/PPS config. but seems fine • We would like to see one service (controlled by a site) to access data . gLite I/O enough? • WMS/CE: will the WMS be the (only) “enforced” way to submit jobs? • PREPRODUCTION SERVICE • Need more real users to stress test (also some functionality seems not to be used we believe everything running fine, but some errors are detected late) • Will apps be ready for merging gLite into production? • Need SFT, accounting, and more coordinated upgrade procedure (some sites fail during upgrade without notice)