1 / 21

LHCOPN: Operations report

LHCOPN: Operations report. Guillaume.Cessieux @ cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, CERN, 2010-10-08. From last LHCOPN meeting, 2010-06-29, Barcelona. Conclusion on Operations

paloma
Télécharger la présentation

LHCOPN: Operations report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LHCOPN: Operations report Guillaume.Cessieux @ cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, CERN, 2010-10-08

  2. From last LHCOPN meeting, 2010-06-29, Barcelona • Conclusion on Operations • Unequal following of processes by sites because missing clear feeling of usefulness and evidence of network failures • WLCG relationships are weak • Monitoring and SLD required to really assess Operations • Items not solved • LHCOPN representatives • How to push efficiently for proper solving of some issues/administrative tasks • In clear words: Stress sites and escalate frozen issues • Merging LHCOPN helpdesk with standard GGUS GCX

  3. Outlines • Operation status • TTS stats • Long standing issues & Ops phoneconf report • Operational exchanges with WLCG • Post mortem analysis of some issues • Ease exchanges with WLCG • AOB GCX

  4. Number of tickets put in the LHCOPN TTS per month AVG: 23 tickets/month GCX

  5. Kind of tickets per month GCX

  6. KPI-1: Infrastructure vs operations behavior GCX

  7. Ticket ownership during [2010-07-01,2010-09-31] Joy of terminating 6 LHCOPN links GCX

  8. Ownership of tickets per month per site GCX

  9. Conclusion from TTS stats • Workflow stable, but unclear if this is good • Miss SLD & monitoring to correlate and focus on service impacting events • Lot of L2 events (80%) well handled • Often clear cut, easy to spot • Not used to complex issues • Often turning into a long story • packet loss, MTU... GCX

  10. Long standing issues • Only administrative! • Validate prefix acceptance etc. • Wait GGUS feature “clone this ticket and assign it to all impacted sitename” to follow this in a per site basis • Followed during the LHCOPN Ops phoneconf, each 3 months • Recurrent issue: Hard to have administrative issue solved GCX

  11. Issues highlighted by WLCG (1/4) • Painful to spot and a lot not anyhow related to the LHCOPN • #GGUS-54473 transfer error from PIC_DATADISK to SARA-MATRIX_DATADISK • Child issues: #GGUS-54416, #GGUS-54474, #GGUS-54500 • “The two LHCOPN routers at CERN were connected via a VLAN, and VLAN tagging adds 4 bytes to a packet. The MTU between these routers has been increased” • Opened 2010-01-05 12:17, closed 2010-01-08 16:16 • No related LHCOPN tickets GCX

  12. Issues highlighted by WLCG (2/4) • #LHCOPN-58197: Poor performance between CERN and ASGC • Opened 2010-05-12, closed 2010-05-17 • Never updated, only Opened/Closed for the record • Only communication problem, issue was managed • Network staff movement at TW-ASGW, solved • SIR filled https://twiki.cern.ch/twiki/bin/view/LCG/SIRCernAsgcLinkMay2010 • #GGUS-59791: Transfer problem from to INFN-T1_DATADISK to PIC_DATADISK • Child issue: #GGUS-59697 T0 export to INFN-T1_DATADISK failures: No valid space tokens • Opened 2010-07-07 00:06, closed 2010-07-14 18:05 • “Network issue of MTU black hole + route asymetry at CNAF/GARR” • No LHCOPN tickets GCX

  13. Issues highlighted by WLCG (3/4) • # GGUS-61306: Functional test transfer errors to RAL-LCG2_DATADISK • Related to • #GGUS-61942 “NDGF-T1 transfer error from RAL-LCG2 and to BNL-OSG2” • #GGUS-61835 “Transfer errors from NDGF-T1_DATADISK to RAL-LCG2_DATADISK” • #GGUS-62287 “Transfer errors at NDGF-T1_SCRATCHDISK” • Opened 2010-08-19 17:41, closed 2010-09-17 15:09 • #LHCOPN-62228, opened/closed 2010-09-17 • Symbolic for the record, no info into • “The linecard terminating the RAL primary link on the CERN router was replaced and the issue was definitely solved” GCX

  14. Issues highlighted by WLCG (4/4) • 4 LHCOPN issues this year • Nothing particularly wrong • Problem is mainly around communication • Main mistake is forgetting creating a ticket in LHCOPN helpdesk • This was the agreed process • Not aware of any other LHCOPN related issue from WLCG • But others network issues (LAN, Generic IP...) GCX

  15. Separated LHCOPN helpdesk in GGUS, why? (1/3) • Key requirement 2008-03 • Not doing user support, but coordinating network teams • Match operational model, particularly responsibility and notification scheme • Network issue ≠ Grid issue, lot of non service impacting events to be registered into • Avoid disturbing or misleading people • Network teams have no access to standard GGUS • And did not want • Centralize anything related to LHCOPN Ops • Clear desire to be isolated/protected • “If we use standard GGUS this will be a mess” • Real fear of enquiries for anything • Did not want to be considered as a catch all networking support, we should accept only selected enquiries LHCOPN related going through storage teams • So we ended with the LHCOPN helpdesk GCX

  16. Separated LHCOPN helpdesk in GGUS, why? (2/3) • Now • General workflow is agreed, discussion is on way to implement it • Lot of things have evolved • GGUS support scheme, experience in applying processes etc. • Several problems/concerns experienced • Problem cannot be solved independently by network team? • Lot of interaction with storage, system etc. • Aren’t iperf tests or monitoring sufficient? • We miss clear bridge with WLCG Ops • Hope was put in awaited parent/child relationship feature for GGUS tickets • cross helpdesk accesses and exchanges required ? • Enquiries often still have a standard GGUS tickets • “Why creating a LHCOPN TT if there is still a GGUS one ?” • Competition between LHCOPN helpdesk and standard GGUS • Tickets turning out to be network related after some time and investigations • LHCOPN tickets: Overhead or true advantage? • Notification, responsibility, tracking etc. GCX

  17. Separated LHCOPN helpdesk in GGUS, why? (3/3) • So create 12 related support units in the standard GGUS? • LHCOPN_CA-TRIUMF etc. • Will this add happy interactions with everybody? • Can we keep the set of particular features we have and be smartly integrated in current GGUS’ workflow? • Particular view, non service impacting events hidden, categories, tickets for maintenances, notification and assignment scheme ? • Transparent for us? Can a standard ticket be turned into a LHCOPN one? • Aren’t we doing more than user support? GCX

  18. AOB (1/3) • Routing policies • To be documented accurately through a routing matrix • https://twiki.cern.ch/twiki/bin/view/LHCOPN/RoutingPolicies • Escalation process • Existing, but never used • https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#Escalated_incident_management_pr • Give this privilege to WLCG people on LHCOPN tickets? • Scheme of responsibilities to be improved? • Set on links basis, so who’s responsible for a IT-INFN-CNAF ↔ US-T1-BNL issue? • Can this really happen without problems between IT-INFN-CNAF ↔ CERN or US-T1-BNL ↔ CERN ? GCX

  19. AOB (2/3) • Issues/requests related to MDM • Must be visible, tracked and centralised like any others LHCOPN issues • Must be in the LHCOPN TTS • Maybe new problem categories etc. to support this • How far? Track software bug or only sites implementation? • DANTE/GN3 could have login/pass to GGUS if no certificate • Any concern about? • Documentation about MDM boxes available? • Should be on the LHCOPN twiki, even very brief • List and IP address of boxes enough? • Hard to solve problems only knowing local boxes • DANTE/GN3 should have R/W access to LHCOPN twiki GCX

  20. AOB (3/3) • Too many off the record e-mails exchanges about LHCOPN issues • MUST be in the LHCOPN TTS • Visible, followed, timestamped etc. • Tickets in the LHCOPN TTS have a clear scheme of responsibilites… not an e-mail sleeping in inbox • If no LHCOPN ticket, no LHCOPN issue GCX

  21. Conclusion • Awaiting monitoring to revitalise Ops • And SLD to really know what matters • Main weakness of LHCOPN Ops: relationship with WLCG • GGUS merging: To be investigated/discussed further • Why not if this solves issues • Be careful with the scope of our model • LHCOPN only • Key reason for having this so specific? • But be careful before changing something working • Wait also EGI networking support and Tiers 2 networking to converge GCX

More Related