Fault Management *

Fault Management* *Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000.

The process of locating and correcting network problems and faults fault is a failure of a network component, which results in loss of connectivity It is the most important functional management area Resolve problem Fault Management • Process, 5 steps: • Identify faults • Gathering information via traps (linkDown, egpNeighborLoss) and polling • Traps may not be sufficient • Is a received trap an important one??? • Locate Fault • Detect all failed components and trace down the tree topology to the source (e.g., interface card failure on a router all connected components will indicate a failure) • Fault isolation by network and SNMP tools • Use artificial intelligence /correlation techniques • Restore service (high priority) • Identify the root cause of the problem (trouble ticket) • Resolve problem

Virtual router topology IP/MPLS, DiffServ packet QoS IP Data Layer backbone Intelligent transport routing/protection switch Network Restoration- example Collapsed Hierarchy, Improved Efficiency Failure detected Resources successfully setup, Restore traffic Source notified Message received and resources configured. SEND ACK • Traffic is successfully restored only after failure notification • and a round trip configuration/confirmation.

An event is an exceptional condition in the operation of the network Software failure Performance bottleneck Configuration inconsistencies Intrusion attempts Network management operations Monitoring events Interpreting events Handling events A single problem event may cause many symptom events Correlating symptom events to identify and localize the underlying problems Preliminaries

A client application exchanges data over a TCP connection with a DB server Distinct domains each administered by a different organization Illustrative scenario

Problem scenario A clock at an interface in WAN2 that supports T3 link loses SYNC 4 times a second for 0.25 ms  intermittent noise causing loss of 0.1% of T3 capacity  this small noise causes bit errors in a large number of packets routed over C-D Bit errors cause packet losses, either at routers (if IP header corrupted) or at destinations Illustrative scenario

 performance of TCP connection degrades due to packet loss  TCP sender interprets this as congestion and hence reduces its window TCP increases its window gradually until new packet loss However due to the noise, the TCP window will not increase DB transactions by client will last longer DB server performance will degrade due to records lock-out, causing frequent aborts for remote transactions Illustrative scenario

Three important points problems propagate among related objects, and possibly amplified by various protocol mechanisms single problem can cause numerous observable events in multiple domains some problems are not observable where they originate: WAN2 domain may observe minor error events at the T3 interface, but these events may be indistinguishable from normal operating noise  WAN2 may be unaware that there is a problem Illustrative scenario Challenges • Determine events to monitor and ways to analyze them • Operations staff must have knowledge of operational parameters of managed objects and the significance of its events • Correlation of events and coordination among different domains • Automating the management activities (manual processing does not scale)

Partition the system into multiple management domains (e.g., enterprise domain, ED, and router domain, RD) Each domain has a domain manager (DM) to monitor, correlate and handle its events A MD may subscribe to receive notifications from other domains ED sees the RD as a single entity connecting LAN1 and LAN2 Modeling the Scenario

Any problem in the connection is seen as RD problem Inside each domain, finer grained correlation can determine the particular problem using symptoms from other domains Example: packet loss is degraded TCP performance is detected by ED not by the RD. this symptom is received by the RD and can be correlated along with other observable symptoms to isolate the “clock problem”. Modeling the Scenario Detects only IP header corruption

An automated event management system (AEMS) must accurately model and store knowledge of the underlying system and its associated events. Static Information associated with managed objects such as SNMP traps, thresholds for MIB variables, etc. Dynamic information: reflects addition, removal, upgrades of network devices, etc. The process of automation is that of developing correlation algorithms to analyze observable events Correlation algorithms must Scalable to large networks involving complex systems Handle a large number of symptoms caused by a single problem Fast --real time correlation Robust (loss of a single alarm or generation of spurious event should not affect its decision  insensitive or resilient to noise Automating Event Management

A problem is an event that can be handled directly; e.g., a faulty interface Some problems are directly observable or indirectly by observing their symptoms Symptoms are observable events Degraded application performance is a symptom of a faulty interface Symptoms cannot be handled; symptoms persist unless the problem is resolved Problems and symptoms propagate from one object to another Noise in WAN  bit errors in link C-D  loss of packets at routers  poor TCP performance  frequent transaction aborts in the DB server Problems and Symptoms

Monitors typically collect managed data at network elements and detect out of tolerance conditions, generating appropriate alarms. The correlator uses an event model to analyze these alarms. The event model represents knowledge of various events and their causal relationships Event model depends on the expert people The correlator determines the common problems that caused the observed alarms. Event Correlation System

The Modeler’s event knowledge contains the following information for each class of managed objects: The data attributes of objects of this class (e.g., MIB variables). The set of events that are observable within instances of this class (e.g., a particular MIB variable is above threshold), or by asynchronous event notifications. The set of events caused by each problem. This set can include events within the object, as well as events in other objects to which the object is related. The problems that can originate within instances of this class. The relationships in which an instance of the class can be involved. The events and/or problems that are exported by instances of the class. Event Knowledge

Treat the complete set of events caused by a problem as a “code” that identifies theproblem Correlation is the process of decoding the set of observed symptoms Determine which problem has these symptoms as its code Note: traditionally, alarms are typically correlated through searches over the event model knowledge base Complexity of search limits scalability Event model is a large database and the received alarms or symptoms may also be quite large Coding Approach for Event Correlation

Two phases: Codebook selection phase: Select a subset of events for monitoring – codebook Codebook is an optimal subset of events that must be monitored to distinguish the problems of interests from one another Ensure a desired level of noise tolerance Algorithms must decode or infer the problem in the presence of lose alarms or the existence of spurious alarms Decoding Find the problem whose associated symptoms (i.e., code) match the observed symptoms most closely Coding Approach for Event Correlation

Correlation is concerned with analysis of causality relations among events e  f denotes causality of event f by event e Causality is a partial order relation between events Relation  can be described by a graph whose nodes represent the events and edges represent causality Causality Graph Models

Causality Graph Models • A symptom caused by another symptom • do not contribute any information about the problem Event that is neither a symptom nor a problem. Causal equivalence All these indirect symptoms can be eliminated without loss of information Correlation graph

Correlation • Information contained in the correlation graph must be converted into codes, one for each problem in the graph. A code for a problem p is a vector p of 0s an 1s. Each bit corresponds to a symptom in the graph • example: code is of length 3 (3 symptoms) – after ordering of the symptoms (e.g., <S3, S6, S9>):  code for p1 is p1 = (1,0,1) This means p1 causes symptoms S3 and S9 p2 = (1, 1, 0) and p11 = (1, 0, 1) Correlation graph Event correlation is finding problems whose codes optimally match an observed symptom vector

Correlation • What happens when we observe symptoms S3 and S9? Both P1 and P11 match the observed vector! Clearly we know there is a problem but cannot identify the problem since both problems have identical codes.. • What happens when we observe symptoms (0, 1, 0)? two possibilities: (1) a false event or (2) P3 occurred but one symptom was lost. Correlation graph Interpretation depends on whether loss is more likely than false alarm generation In case spurious or lost symptoms are unlikely, information provided by S9 is redundant  (1, 0) and (1, 1) are sufficient to correlate event vectors. Subset of symptoms required to provide desired level of distinction between problems is called codebook

Radius is ½ the hamming distance Codebook not resilient to noise Correlation- example • Codebook contains only three symptoms • The codebook distinguishes among all problems however, it guarantees distinction by only a single symptom A loss or spurious generation of S4 will result in decoding error Distinction between problems is measured by the “hamming Distance” between their codes

Correlation- example Event vectors {011100, 101100, 110100, 111000} will be decoded as P1 with a single symptom loss and {111110, 111101} is interpreted as P1 with a single spurious symptom When two error symptoms occur, decoder will detect the error but cannot correctly (uniquely) decode the event (e.g., P1 and P4)

Correlation- Advantages

Fault Management *