1 / 12

Operational Requirements and Proposals

Operational Requirements and Proposals. Dan Nae California Institute of Technology. The Network (Overview). A Circuit (Engineering View). US LHCNet. CERN. ESnet. FNAL. Almost all circuits span across multiple domains An end-to-end circuit is made of equipment interconnected by links

lucian
Télécharger la présentation

Operational Requirements and Proposals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operational Requirements and Proposals Dan Nae California Institute of Technology

  2. The Network (Overview)

  3. A Circuit (Engineering View) US LHCNet CERN ESnet FNAL • Almost all circuits span across multiple domains • An end-to-end circuit is made of equipment interconnected by links • Any component (link or equipment) falls under the responsibility of one entity, but responsibility for the end-to-end circuit is shared • Incidentally, this is also the view of the monitoring systems

  4. The Network (Contractual View) • There is a mutual agreement between the two end nodes to have a link between them • There is also an agreement between the service providers to supply an end-to-end service • The two end points control directly or indirectly all the components of the network by means of contracts and agreements Service Beneficiaries CERN FNAL US LHCNet Service Providers ESnet 1st Level Contractors Qwest Ciena Force10 2nd Level Contractors VSNL Ciena Interoute

  5. Relationships Outside the LHC OPN Govt. Funding Agency CMS Office of Science CERN FNAL USLHC Network Working Group US LHCNet ESnet LHC OPN • These obligations didn’t change with the establishment of the LHCOPN • Any of the LHCOPN service providers is bound by multiple agreements to provide service to the Tier1s

  6. Operations Today: Problem Location is Known • Problem location is known – easy problem • Contractual path is taken to solve the problem • End sites prefer to talk to the local service provider • LHCOPN service providers tend to be pro-active • TTR (time to repair) differs for various components (e.g. circuit repair times are different from equipment RMA times) • Redundancy is essential to ensure total service availability is greater than the availability of individual components End Site End Site Service Provider Service Provider 1st Level Contractor 1st Level Contractor 1st Level Contractor 1st Level Contractor 2nd Level Contractor 2nd Level Contractor 2nd Level Contractor !

  7. Operations Today: Problem Location is Unknown • Problem location is unknown – hard problem • Service providers collaborate according to agreements in place to isolate the problem • When problem is located contracts are enforced to fix it • TTR == Time to Locate + Time to Fix (SLAs) • Time to Locate decreases as the experience and knowledge of the engineers increases • An external entity with no knowledge and/or access to the equipment will not make the Time to Locate shorter  End Site End Site ? Evrika! Service Provider Service Provider 1st Level Contractor 1st Level Contractor 1st Level Contractor 1st Level Contractor 2nd Level Contractor 2nd Level Contractor

  8. Current Operational Model CERN FNAL • Problem is detected by the monitoring systems (easy) • The responsible domain opens an internal ticket, announces the relevant entities (end nodes, peers, centralized information repository) • The responsible domain knows there is a local problem before the other domains (if not, there is “hard” problem) • The responsible domain fixes the problem, announces the relevant entities (end nodes, peers, centralized information repository) • Problem is not detected by the monitoring systems (hard) • Problem signaled by one of the end nodes (directly or by a data manager) • All domains open internal tickets and try to look for the problem (problem isolation) • End domains are the only entities who can confirm the problem is fixed • Coordination == keeping the ticket open • Intermediate domains must keep the ticket open until fix confirmed

  9. Centralization in the Commercial World • This is the simplest model • CERN could tender the LHCOPN as a whole • If CERN owns all contracts then CERN has to pay for all links • Not possible with the current funding model (each Tier1 is responsible for its link to CERN) • If the remote sites own the contracts then there won’t be a single service provider • Unless all LHCOPN partners agree to hand over all or part of the operations to a third party • The propagation of information doesn’t always work that well in this model End Site End Site Service Provider Hidden layers 1st Level Contractor 1st Level Contractor 1st Level Contractor 1st Level Contractor 2nd Level Contractor 2nd Level Contractor

  10. Current Operations • Web page centralizing all the information: https://twiki.cern.ch/twiki/bin/view/LHCOPN/LNOC • Each NOC should at least have listed: • A working hours contact (and what are these hours) • An emergency contact (outside working hours) • Each NOC should have/publish the necessary information to separate between the two types (i.e. what is urgent enough to be called at night) • Dissemination channels are not always well defined

  11. Scheduled Maintenance • Announced to end sites, peers and the central information repository • If we are to have change management, this has to be done by a party aware of all changes inside the LHCOPN and aware of the changes or activities outside the LHCOPN • One of the simple ways you can do this is by having a dashboard where all scheduled maintenances are posted, along with information about them: • who performs the maintenance (one of the LHCOPN members or an outside entity) • is this an emergency maintenance (cannot be rescheduled) • 5 days advanced notice is not always possible for emergency maintenance • maybe a certain priority associated with the maintenance • 1 == cannot be rescheduled (emergency) • 2 == can be rescheduled with considerable difficulty • 3 == can be rescheduled with some difficulty

  12. What Is Missing • An end-to-end monitoring system that can pin-point reliably where most of the problems are • An effective way to integrate the above monitoring system into the local procedures of the various local NOCs to help them take action • A centralized ticketing system to keep track of all the problems • A way to extract performance numbers from the centralized information (easy) • Clear dissemination channels to announce problems, maintenance, changes, important data transfers, etc. • Someone to take care of all the above • A data repository engineers can use and a set of procedures that can help solve the hard problems faster (detailed circuit data, ticket history, known problems and solutions) • A group of people (data and network managers) who can evaluate the performance of the LHCOPN based on experience and gathered numbers and can set goals (target SLAs for the next set of tenders, responsiveness, better dissemination channels, etc)

More Related