North Carolina State University Columbia University Florida Institute of Technology

A Data Mining Approach for Building Cost-Sensitive and Light Intrusion Detection ModelsPI Meeting - July, 2000 North Carolina State University Columbia University Florida Institute of Technology

Overview • Project description • Progress report: • correlation • cost-sensitive modeling • anomaly detection • collaboration with industry • Plan of work for 2000-2001

New Ideas/Hypotheses • High-volume automated attacks can overwhelm an IDS and its staff. • Use cost-sensitive data mining algorithms to construct ID models that consider cost factors: • damage cost, response cost, operational cost, etc. • Multiple specialized and light ID models can be dynamically activated/configured in run-time • Cost-effectiveness as the guiding principle and multi-model correlation as the architectural approach .

Impact • A better understanding of the cost factors, cost models, and cost metrics related to intrusion detection. • Modeling techniques and deployment strategies for cost-effective IDSs. • “Clustering” techniques for grouping intrusions and building specialized and light models. • An architecture for dynamically activating, configuring, and correlating ID models.

Correlation: Model and Issues • “Good” base models: data sources and modeling techniques. • The combined model: the correlation algorithms and network topology. across sources across time/sources

Correlation: Approaches • Extend previous work in JAM • A sequence of time-stamped records • each is composed of signals from multiple sensors (network topology information embedded); • Apply data mining techniques to learn how to correlate the signals to generate a combined sensor: • link analysis, sequence analysis, machine learning (classification), etc.

Correlation: Integrating NM and ID Signals • A stream of measures (anomaly reports) on MIB variables of network elements and a stream of ID signals: • Better coverage; • Early sensing of attacks. • Normal measures of network traffics and parameter values of ID signatures • S = f(N, A), A is invariant then S=g(N). • Automatic parameter adjustment, S1=g(N1).

Cost Factors of IDSs • Attack taxonomy: result/target/technique • Development cost • Damage cost (DCost) • The amount of damage when ID is not available or ineffective. • Response cost (RCost) • The cost of acting upon an alarm of potential intrusion. • Operational cost (OpCost) • The cost of processing and analyzing audit data ; • Mainly the computational costs of the features.

Cost Models of IDSs • The total cost of an IDS over a set of events: • CumulativeCost(E) = eE (CCost(e) + OpCost(e)) • CCost(e), the consequential cost, depends on prediction on event e

Consequential Cost (CCost) • For event e :

Cost-sensitive Modeling: Objectives • Reducing operational costs: • Use cheap features in ID models. • Reducing consequential costs: • Do not respond to an intrusion if RCost > DCost.

Cost-sensitive Modeling: Approaches • Reducing operational costs: • A multiple-model approach: • Build multiple rule-sets, each with features of different cost levels; • Use cheaper rule-sets first, costlier ones later only for required accuracy. • Feature-Cost-Sensitive Rule Induction: • Search heuristic considers information gain AND feature cost.

Cost-sensitive Modeling: Approaches (continued) • Reducing consequential costs: • MetaCost: • Purposely re-label intrusions with Rcost > DCost as normal. • Post-Detection decision: • Action depends on comparison of RCost and DCost.

Latest Results • OpCost • Compare the multiple-model approach with single-model approach; • rdc%: (single - multiple)/single; • range: 57% to 79%.

Latest Results (continued) • CCost using a post-detection cost-sensitive decision module • rdc% range: 75% to 95%; • Compared with single model: slightly better rdc%; • Compared with cost-insensitive models: 25% higher rdc%.

Anomaly Detection • Unsupervised Training Methods • Build models over noisy (not clean) data • Artificial Anomalies • Improves performance of anomaly detection methods. • Combining misuse and anomaly detection.

AD over Noisy Data • Builds normal models over data containing some anomalies. • Motivating Assumptions: • Intrusions are extremely rare compared to to normal. • Intrusions are quantitatively different.

Approach Overview • Mixture Model • Normal Component • Anomalous Component • Build Probabilistic Model of Data • Max Likelihood test for detection.

Mixture Model of Anomalies • Assume a generative model: The data is generated with a probability distribution D. • Each element originates from one of two components. • M, the Majority Distribution (x  M). • A, the Anomalous Distribution (x  A). • Thus: D = (1-)M + A

Modeling Probability Distributions • Train Probability Distributions over current sets of M and A. • PM(X) = probability distribution for Majority • PA(X) = probability distribution for Anomaly • Any probability modeling method can be used: Naïve Bayes, Max Entropy, etc.

Detecting Anomalies • Likelihood of a partition of the set of all elements D into M and A: L(D)=  PD(X) =((1-)|M| PM(X) )( |A| PA(X)) • Log Likelihood (for computational reasons): LL(D)=log(L(D)) D M A

Algorithm for Detection • Assume all elements are normal (M0=D, A0= ). • Compute PD(X). • Using PD(X) compute LL(D). • For each element compute difference in LL(D) if removed from M and inserted into A. • If the difference is large enough, then declare the element an anomaly.

Evaluating xt Mt+1 = Mt – {xt} At+1 = At U {xt} Recompute PMt and PAt. (efficiently) If (LLt+1-LLt)> threshold, xt is anomaly Otherwise xt is normal

Experiments • Two Sets of experiments: • Measured Performance against comparison methods over noisy data. • Measured Performance trained over noisy data against comparison methods trained over clean data.

AD Using Artificial Anomalies • Generate abnormal behavior artificially • assume the given normal data are representative • “near misses” of normal behavior is considered abnormal • change the value of only one feature in an instance of normal behavior • sparsely represented values are sampled more frequently • “near misses” help define a tight boundary enclosing the normal behavior

Experimental Results • Learning algorithm: RIPPER rule learner. • Data: 1998/99 DARPA evaluation • U2R, R2L, DOS, PRB: 22 “clusters” • Training data: normal and artificial anomalies • Results • Overall hit rate: 94.26% (correctly normal or intrusion) • Overall false alarm rate: 2.02% • 100% dectection: buffer_overflow, guess_passwd, phf, back • 0% detection: perl, spy, teardrop, ipsweep, nmap • 50+% detection: 13 out of 22 intrusion subclasses

Combining Anomaly And Misuse Detection • Training data: normal, artificially generated anomalies, known intrusions • The learned model can predict normal, anomaly, or known intrusion subclass • Experiments were performed on increasingsubsets of known intrusion subclasses in the training data (simulates identified intrusions over time).

Combining Anomaly And Misuse Detection (continued) • Consider phf, pod, teardrop, spy, and smurf are unknown (absent from the training data) • Anomaly detection rate: phf=25%, pod=100%, teardrop=93.91%, spy=50%, smurf=100% • Overall false alarm rate: .20% • The false alarm rate has dropped from 2.02% to .20% when some known attacks are included for training

Collaboration with Industry • RST Inc. • Anomaly detection on NT systems • NFR Inc. • real-time IDS • SAS Institute • off-line ID (funded by SAS) • Aprisma (Cabletron) • Integrating ID with NM (funded by Aprisma) • HRL Labs • ID in wireless networks (funded by HRL)

Plan for 2000-2001 • Dynamic cost-sensitive modeling and deployment • work with industry for realistic cost analysis and real-time testing • Anomaly detection • improve existing algorithms using feedback from evaluation • Correlation • develop/evaluate algorithms for integrating multiple sources data/evidences

North Carolina State University Columbia University Florida Institute of Technology