1 / 16

Operation team at Ccin2p3

Operation team at Ccin2p3. Suzanne Poulat – suzanne@in2p3.fr. Overview. Operation Team Organisation Operation’s role Services during out of working hours Tools Monitored services Examples. Operation team. Two groups : Support and Operation Support (9 persons) :

hayden
Télécharger la présentation

Operation team at Ccin2p3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operation team at Ccin2p3 Suzanne Poulat – suzanne@in2p3.fr

  2. Overview • Operation Team • Organisation • Operation’s role • Services during out of working hours • Tools • Monitored services • Examples Suzanne Poulat - suzanne@in2p3.fr

  3. Operation team • Two groups : Support and Operation • Support (9 persons) : • general user support, • dedicated persons for LHC experiments, • help-desk(Xhelp), • opening CC to collaborations and other sciences • Operation : details follow Suzanne Poulat - suzanne@in2p3.fr

  4. Organisation • Ten persons in the group • two for Grid coordination • Four for Operation • Four operators in shift to cover 08:00AM to 09:PM 7/7 • on a weekly basis : • one person for operation (often 1.5) • The others have tasks as developments, monitoring or administrative tasks Suzanne Poulat - suzanne@in2p3.fr

  5. Operation’srole • Check the avalaibility of all services (storage, cpu,…) • Optimize service usage • Insure that commitments of CCIN2P3 for the experiments and Grid VOs are respected • Organize the scheduledshutdowns • Coordinate actions duringunscheduleddowntimes • Monitoring and management of tape libraries • Create and manage accounts and AFS space • Organize the « on duty » service Suzanne Poulat - suzanne@in2p3.fr

  6. Services - Out of workinghours • On site night security guard from 6PM to 8AM and weekends • no computing actions : Alerting and Messaging • 1 on-duty engineer (evenings, weekends) • Corrective actions if possible (documentations, Training) • else call an expert … if available • Weekend : 1 operator on site (10AM – 5PM) • first low level action • else call on-duty engineer • Result is a « Best effort » coverage Suzanne Poulat - suzanne@in2p3.fr

  7. tools • Monitoring tool : NGOP -> Nagios • RemoteLogging Service : RLS • Mails • Tickets from local and gridusers : Xhelpinterfacedwith GGUS at CC • Web pages on the current state of services • Wiki for documentation, recipes, shutdowns, postmortemanalysis • log of the daily production : ELog • Tickets web page for tapes and drives incidents (~50 incidents per month : 10 drives, 40 tapes with 2 lost of data) • Scripts to analyse faulty tapes Suzanne Poulat - suzanne@in2p3.fr

  8. Monitored services • BQS • Storage : HPSS, dCache, AFS • Grid : CE, SRM, TOP BDII • Databases • Others : Tape libraries, Saphir (privileges and location of services) • Workers and all servers Suzanne Poulat - suzanne@in2p3.fr

  9. Nagios Suzanne Poulat - suzanne@in2p3.fr

  10. SMURF Suzanne Poulat - suzanne@in2p3.fr

  11. Anastasie – Running jobs Suzanne Poulat - suzanne@in2p3.fr

  12. Xhelp Suzanne Poulat - suzanne@in2p3.fr

  13. Xhelp (2) ~320 tickets by month = 10 to 20 tickets by days Suzanne Poulat - suzanne@in2p3.fr

  14. Xhelp (3) Suzanne Poulat - suzanne@in2p3.fr

  15. implementations • Wiki Operation • Nagios monitoring • Ovax • Users database Interface • Incidents robotique • On duty tools Suzanne Poulat - suzanne@in2p3.fr

  16. QUESTIONS ? Suzanne Poulat - suzanne@in2p3.fr

More Related