1 / 21

Improving ENOC ’s support for CODs COD-18, Abingdon, UK

Improving ENOC ’s support for CODs COD-18, Abingdon, UK. Guillaume Cessieux (CNRS, IN2P3-CC / EGEE SA2) 2008-12-03. Outlines. ENOC and COD interactions Status of work around network trouble tickets DownCollector Assessment Review of last 12 months

cachez
Télécharger la présentation

Improving ENOC ’s support for CODs COD-18, Abingdon, UK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving ENOC ’s support for CODsCOD-18, Abingdon, UK Guillaume Cessieux (CNRS, IN2P3-CC / EGEE SA2) 2008-12-03

  2. Outlines ENOC and COD interactions Status of work around network trouble tickets DownCollector • Assessment • Review of last 12 months Proposal to handle DownCollector’s troubles • Processes • Tools’ improvements COD18 2008-12-03

  3. EGEE Network Operating Centre ENOC • Aiming to provide support for: • Sites • ROCs • CODs • Hard to get feedbacks and requirements from SA1 • “Two different worlds”... • Now real-life background with better vision 0.5 FTE in EGI, main changes MUST happen before • Drop unnecessary things, focus on useful • Network support wider role than the ENOC in EGI COD18 2008-12-03

  4. Current status with COD Only DownCollector seems now to be used by CODs [ https://ccenoc.in2p3.fr/DownCollector/ ] • Very efficient integration in COD’s dashboard SA2 is willing to know how to better serve CODs around network support • Regarding processes • Balance between wait and see & over-engineered things • Regarding tools and integration • DownCollector, other tools, CIC dashboard, alarms … Use background to sketch wise, realistic and useful processes and tools COD18 2008-12-03

  5. Around network trouble tickets (1/2) ~ « Main router is down. Will be up soon. » Currently TTdrawlight [ https://ccenoc.in2p3.fr/TTdrawlight/ ] • Repository of network trouble tickets • Not enough accurate & hard to be used efficiently Network trouble tickets are not a panacea • «  Главным образом сеть вниз. Будет вверх скоро» • Targeted for a local community • But often the only operational information available… Strong privacy issuesto share network trouble tickets • No filtering of sensible information delivered (school, military…) • Fear of comparison and competition • Knowledge database of networks trouble tickets compromised? COD18 2008-12-03

  6. Around network trouble tickets (2/2) 19 NRENs currently sending their tickets to the ENOC • EGEE relies on networks from ~ 50 NRENs + GÉANT2 • We cover ~80% of European Grid sites • 2800 e-mails for 900 tickets/month • Really hard to deal with meaning of tickets (location, duration...) Standardisation of network TT? • Can enable painless, accurate and automatic management of TT • Strong advances in this domain but hard to promote to NRENs Situation to be sorted out between NRENs & SA2 • Solve centralisation, accuracy and exposure of TT • Then tools will easily follow COD18 2008-12-03

  7. Around network monitoring Connectivity addressed with DownCollector • But not performance Hard to have information on end-to-end performances • Require to go on network paths and devices details • 300 certified sites, 50 NRENs... Inhomogeneous domains • Network is shared, should be monitored once and not at project level • Slowly converging toward perfSONAR – not yet mature EGEE Network troubleshooting tool upcoming • Lightweight package from SA2 • Prototype around January 2009 COD18 2008-12-03

  8. DownCollector (1/3) Now a key tool reporting TCP listening of Grid nodes 2 minutes accuracy • 2600 nodes pooled • Often first to detect some failures GOCDB Scheduled downtimes are managed • Troubles not reported for sites in scheduled downtimes COD18 2008-12-03

  9. DownCollector (2/3) GÉANT2 OFF-SITE NREN X checkpoint ON-SITE A trouble = All Grid hosts of a site unreached • To avoid measuring host availability Network checkpoint = border router • Demarcation point for ENOC’s responsibility • Checked during trouble Three kinds of troubles • OFF-SITE: Network checkpoint NOT reached • Fault in: WAN, MAN, NREN, GÉANT2, ISP... • ON-SITE: Network checkpoint reached • LAN, power, software ... • UNKNOWN: No clear and reliable checkpoint, but site in trouble COD18 2008-12-03

  10. Foreign site 2 French site GÉANT2 Router B Router A NREN X RENATER Checkpoint for site 1 ENOC Foreign site 1 DownCollector (3/3) COD18 2008-12-03 Is it trustable or biased? • If failure reported from ENOC is failure from entire infrastructure? • For ON-SITE troubles: ~YES • What about French sites reached without using GÉANT2? remote probes? • 2 instances of DownCollector? ~NO

  11. Troubles detected by DownCollector Number of troubles Troubles are not concentrated on few sites! • Scope • (300 certified sites) • Last 12 months Number of troubles per month: COD18 2008-12-03 54% of detected problems are ON-SITES

  12. Troubles’ durations Last 12 months troubles’ dispatching: COD18 2008-12-03 80% solved within 30 min • Pareto’s law The others • OFF-SITE • Avg 45 troubles/month • ON-SITE • Avg 85 troubles/month

  13. Yearly sum of downtimes per sites 164 sites have less than 1d of downtime during last 12 months 46 sites Last 12 months total downtime for site 46: 4d OFF-SITE, 17d ON-SITE 85% of sites <4d of downtime/year = 98.90% reachability/year N.B: unscheduled downtime Better: 4 minutes down Worst: 64 days (PPS…) COD18 2008-12-03

  14. First assessment Networks are quite reliable • Few long outages on resilient transit networks • ON-SITE troubles are important things • 30 minutes seems a wise threshold • DownCollector seems reliable and trustable enough Automatic management of network TT currently not reliable Currently few interactions SA2 / CODs This was discussed with pole1 for improvements • Thanks to them for feedbacks, results are following COD18 2008-12-03

  15. Proposal for troubles handling Map troubles handling around the three kinds of problem from DownCollector COD18 2008-12-03

  16. OFF-SITE troubles handling ENOC please follow that ENOC’s responsibility – devolving trouble resolution to NRENs/GÉANT2 Targeted key information: expected end date • Hard to get… Enable marking of particular outages • Maybe then automatically create a ticket into ENOC’s helpdesk (GGUS) to exchange information with COD COD18 2008-12-03

  17. Proposal for tools (1/2) ON-SITE UNKNOWN Trouble OFF-SITE -5h Now ENOC to work with sites to improve some network checkpoints • Reduce number of unknown troubles (~ 12%, ~106/month) • 351 sites in database: 32 (9%) without usable checkpoint • [ https://ccenoc.in2p3.fr/DownCollector/?v=list_headnodes ] ENOC’s bar in COD dashboard COD18 2008-12-03

  18. Proposal for tools (2/2) 1.5 - select threshold Notification from DownCollector to site admins for long-standing outage (15 or 30 minutes?) • Integration to Nagios not sufficient? • Existing DownCollector feature: Subscribe to troubles • [ https://ccenoc.in2p3.fr/DownCollector/?v=subscription ] • Released with EGEE broadcast on 2008-07-16 • 34 sites, 26 distinct emails have currently registered • Noticed problem: E-mails not reaching disconnected sites… • No threshold implemented yet COD18 2008-12-03

  19. Actions list for tools ENOC • DownCollector • Improve checkpoints • Add threshold to subscribe feature? • Allow flagging important network outages and study scheme to exchange around (GGUS ENOC’s helpdesk...) • Provide ENOC’s bar CIC portal • Manage networks alarms & alarms masking • Integrate ENOC’s bar COD18 2008-12-03

  20. Conclusion Its really going ahead Some implementation details to sort out • Scalability, regionalisation • Right now or waiting your next model (alarm DB, R-COD etc.)? • CIC portal & ENOC • priorities, manpower and roadmap Other ideas, feedbacks etc. always welcome • Help designing the network support you need COD18 2008-12-03

  21. Questions? COD18 2008-12-03

More Related