250 likes | 525 Vues
CIC Portal/COD Activities. Hélène Cordier IN2P3/CNRS Computing Centre, Lyon, France. Contents. CIC Portal Usage : who/how Latest Release Portal Characteristics On-going developments CIC portal overview for COD Statistics and results Working groups Zoom on Failover. Use tools.
 
                
                E N D
CIC Portal/COD Activities Hélène Cordier IN2P3/CNRS Computing Centre, Lyon, France
Contents • CIC Portal Usage : who/how • Latest Release Portal Characteristics • On-going developments • CIC portal overview for COD • Statistics and results • Working groups • Zoom on Failover
Use tools Each actor can use a set of operational tools (provided, integrated or interfaced) SITE Communicate USER Report on site activity, submit tests, configure Manage static information about my VO VO MANAGER Track, report, diagnose and follow-up problems OPERATOR REGIONAL CENTER Tools (CIC Portal) The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007) 22/08/2014 3
Av connections Dec 2004-Dec 2007 What do people connect to the CIC portal for ?
Tasks handled by CIC portal Development team Between February 2007 and January 2008
Contents • CIC Portal Usage : who/how • Latest Release Portal Characteristics • On-going developments • CIC portal overview for COD • Statistics and results • Working groups • Zoom on Failover
Latest changes in 6 months • Last technical changes • authentication is now based on full certificate DN instead of CN • Work on VO ID cards • changes in Database schema for VO/VOMS information • VO ID card interface improved • Integration of the YAIM VO Configurator to the CIC portal • Downloadable XML dump of VO ID card info • Scheduled downtimes procedure • Integration of the regional 1rst line support dashboard – prototype with CE
On-going developments • CIC Portal Usage : who/how • Latest Release Portal Characteristics • On-going developments • CIC portal overview for COD • Statistics and results • Working groups • Zoom on Failover
What is left for next release in March • 2159 Adapt to new components released into production, cf YAIM tool. • 1559 Development of a new version report taking into account several feedback. • 1920 Follow SAM migration to gridview on CIC portal side  IDLE • Internal Tasks include quick fixes/bug fixes, documentation, background clean-up work, code optimization/prospective for EGEE-III.
COD activity CIC Portal Usage : who/how Latest Release Portal Characteristics On-going developments CIC portal overview for COD Statistics and results Working groups Zoom on Failover ARM Meeting, EGEE’07, Budapest 22/08/2014 11
A tool for Grid Operators: COD dashboard Monitoring tool #1 Sites info Operator Operator Monitoring tool #2 Sites info Dashboard Monitoring tool #n Monitoring tool #1 Monitoring tool #2 Mail client Monitoring tool #n Mail sender Ticketing system Ticketing system MANY ENTRY POINTS SINGLE ENTRY POINT Start of EGEE Now The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007) 22/08/2014 12
Interaction with EGEE services OPERATIONS PORTAL FZK, Karlsruhe, Germany IN2P3-CC, Lyon, France - View ticket GGUS SOAP Site1 status status ticket #28 Site2 status status ticket #32 - Create ticket - Update ticket Site3 status status No ticket Site4 status status ticket #14 SQL queries CERN, Geneva, Switzerland ASGC, Taipei, Taiwan http GIIS status per site XSQL-based service Test results on nodes - Site info - Scheduled downtimes SAM GOC-DB Gstat The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007) 22/08/2014 13
Outline CIC Portal Usage : who/how Latest Release Portal Characteristics On-going developments CIC portal overview for COD Statistics and results Working groups Zoom on Failover The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007) 22/08/2014 14
CIC Portal Usage : who/how Latest Release Portal Characteristics On-going developments CIC portal overview for COD Statistics and results Duties and Working groups Zoom on Failover The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007) 22/08/2014 16
COD Duties • Rotations of 10 federations/teams -- 1/5 weeks. • Quarterly face-to-face meetings to update tools, procedures and uniformize working habits. =================================== • 10 federations over 18 months in EGEE-I • Working groups for over 18 months now….
There is more to it …. Straightforward mandate working groups: GSTAT -- TW, SAM -- CERN, SAMAP – CE, topped by • Tools for Improvement for COD, TIC – CE (EGEE’07)
Working groups mandate • Integration of the existing tools CIC– FR Integration platform of all COD tools to ease-up the daily operational job • Improvement of BEST PRACTICES -- DE-CH Identifity, raise and analyse with COD how to have homogeneous operations  • Release of updated documentation OPM –SE Documentation under constant evolution • Set-up of Failover Mechanisms for GRID CORE SERVICES – SWE, What is done at a federation level, what is done at the project level (need help from JShiers group), what could be done (operational point of view) and what is needed at the ROC/Site level (from a m/w point of view). • Set-up of High Availability strategy of the operational tools for CODs FAILOVER– IT
Failover working group CIC Portal Usage : who/how Latest Release Portal Characteristics On-going developments CIC portal overview for COD Statistics and results Working groups Zoom on Failover for Operational Tools The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007) 22/08/2014 20
EGEE Failover: purpose • Propose, implement and document failover procedures for the collaboration, management and monitoring tools used in EGEE/WLCG Grid. • Solution is based on DNS and consists in: • mapping the service name to one or more destinations • update this mapping whenever some failure is detected • Geographical failover for the EGEE-WLCG Grid collaboration tools • CHEP 2007, Victoria BC, Canada (September 2007)
COD Work aspects to keep in EGEE IIII • Dedication : Working groups recognized within federations to provide expertise and by federations to make the needs come to the central operations. • Collaboration : Up to now, each federation had found a way to contribute actively to improve their COD work environment, when not proactively leading a working group. Also, each person/tool developper/expert recognized as of « global interest » eventhough out of COD scope has been integrated happily in this « closed community », e.g SAMAP  TIC scope to monitor this aspect with Nagios prototype for example. • Flexibility : Purpose of the groups to evolve together with their mandate with time and the upcoming of the needs e.g. Core grid services HA, EGI • Anticipation : e.g. Strategy of the Operational Failover Working Group. • Experiment : e.g regionalisation of tools and the future modular « NGI dashboards » to widen the CE 1rst line support experience.
COD Work aspects to make evolve in EGEE IIII • Mandate and Assessment of the COD activity  Integration of NDGF/NE as a COD team – other teams ?  Catch-all and global operations center -- what core services are to be monitored centrally , and how to monitor them and how to properly switch to backup -- How to aggregate local data and what local data would be concerned  Assess metrics in order to assess the most problematic m/w components, recurrently unreliable sites  Operational tools reliability assessment /ENOC test as a start base?  Strenghten need on HA/Failover of operational tools and grid core services • Vision of the COD tools long-term evolution : 1 set of tools /federation + aggregation? Which set of tools is to be regionalized ? SAM, GOC DB, COD? what else? How are they going to interact => need for a global schema, NOW.
COD Work aspects to make evolve in EGEE IIII • Leverage on « project labeled » tools in order for operational use-cases for not to remain « pending ».  developements strategy/priorities are coherent. -- data workflow – synch GOCDB/BDII/SAM/COD -- development strategy – depends on the stretegy of the COD tools long-term evolution -- priority decision workflow – Who and how to drive the « project labeled » tools requests priority for operational use-cases for not to remain « pending ». - critical tests monitoring/accounting or ARC CE. - ca update procedure, - need for SAM failover…  staffing is adequate for proper reactivity not only for bugfix. • Interoperability/interoperations (item to be followed up) • OSG : rather informal for the moment, BUT NOW, users do have problems and sites are the relay of their users cf GGUS ticket 31037. • NDGF : existing critical test monitoring ? and what are the consequences on operational procedures?
Conclusions and References • Where, how, when do we adress these topics?? • Some can be adressed here or can be thought at at COD meetings, some are relevant to OCC/ROC first and COD working groups can then make suggestions/recommendations. • References: • CIC portal: a Collaborative and Scalable Integration Platform for High Availability Grid Operations • Grid 2007 (IEEE), Austin Tx, United-States (September 2007) • Geographical failover for the EGEE-WLCG Grid collaboration tools • CHEP 2007, Victoria BC, Canada (September 2007)