Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set

Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set Vipin Kumar, AHPCRC, University of Minnesota Group members: L. Ertoz, M. Joshi, A. Lazarevic, H. Ramnani, P. Tan, J. Srivastava

Introduction • Key challenge • Maintain high detection rate while keeping low false alarm rate • Misuse Detection • Two phase learning – PNrule • Classification based on Associations (CBA) approach • Anomaly Detection • Unsupervised (e.g. clustering) and supervised methods to detect novel attacks

DARPA 1998 - KDDCup’99 Data Set • Modification of DARPA 1998 data set prepared and managed by MIT Lincoln Lab • DARPA 1998 data includes a wide variety of intrusions simulated in a military network environment • 9 weeks of raw TCP dump data simulating a typical U.S. Air Force LAN • 7 weeks for training (5 million connection records) • 2 weeks for training (2 million connection records)

KDDCup’99 Data Set • Connections are labeled as normal or attacks • Attacks fall into 4 main categories (38 attack types) - • DOS - Denial Of Service • Probe - e.g. port scanning • U2R - unauthorized access to root privileges, • R2L - unauthorized remote login to machine, • U2R and R2L extremely small classes • 3 groups of features • Basic, content based, time based features (details)

KDDCup’99 Data Set • Training set - ~ 5 million connections • 10% training set - 494,021 connections • Test set - 311,029 connections • Test data has attack types that are not present in the training data => Problem is more realistic • Train set contains 22 attack types • Test data contains additional 17new attack types that belong to one of four main categories

Performance of Winning Strategy • Cost-sensitive bagged boosting (B. Pfahringer)

Simple RIPPER classification • RIPPER trained on 10% of data (494,021 connections) • Test on entire test set (311,029 connections)

Simple RIPPER on modified data • Remove duplicates and merge new train and test data sets • Sample 69,980 examples from the merged data set • Sample from neptune and normal subclass. Other subclasses remain intact. • Divide in equal proportion to training and test sets • Apply RIPPER algorithm on the new data set

Building Predictive Models in NID • Models should handle skewed class distributions • Accuracy is not sufficient metric for evaluation • Focus on both recall and precision • Recall (R) = TP/(TP + FN) • Precision (P) = TP/(TP + FP) • F – measure = 2*R*P/(R+P) rare class – C large class – NC

Predictive Models for Rare Classes • Over-sampling the small class [Ling, Li, KDD 1998] • Down-sizing the large class [Kubat, ICML 1997] • Internally bias discrimination process to compen-sate for class imbalance [Fawcett, DMKDD 1997] • PNrule and related work [Joshi, Agarwal, Kumar, SIAM, SIGMOD 2001] • RIPPER with stratification • SMOTE algorithm [Chawla, JAIR 2002] • RareBoost [Joshi, Agarwal, Kumar, ICDM 2001]

PNrule Learning • P-phase: • cover most of the positive examples with high support • seek good recall • N-phase: • remove FP from examples covered in P-phase • N-rules give high accuracy and significant support C C NC NC Existing techniques can possibly learn erroneous small signatures for absence of C PNrule can learn strong signatures for presence of NC in N-phase

RIPPER vs. PNrule Classification • 5% sample from normal, smurf (DOS), neptune (DOS) from 10% of training data (494,021 connections) • Test on entire test set (311,029 connections)

Þ s , a Classification Based on Associations (CBA) • What are Association patterns? • Frequent itemset: captures the set of “items” that co-occur together frequently in a transaction database. • Association Rule: predicts the occurrence of a set of items in a transaction given the presence of other items. Association Rule: y X Support: Confidence: Example:

Classification Based on Associations (CBA) • Previous work: • Use association patterns to improve the overall performance of traditional classifiers. • Integrating Classification and Association Rule Mining [Liu, Li, KDD 1998] • CMAR: Accurate Classification Based on Multiple Class-Association Rules [Han, ICDM 2001] • Associations in Network Intrusion Detection • Use classification based on associations for anomaly detection and misuse detection [Lee, Stolfo, Mok 1999] • Look for abnormal associations [Barbara, Wu, Jajodia, 2001]

Methodology F1: {A, B,C} => dos F2: {B,D} => dos … DOS Overall data set Feature Selection F1: {A, C, D} => u2r F2: {E,F,H} => u2r … U2R F1: {C,K,L} => r2l F2: {F,G,H} => r2l … R2L F1: {B,F} => probe F2: {B,C,H}=> probe … probe normal F1: {A, B} => normal F2: {E,G} => normal … Feed to classifier Stratification Frequent Itemset Generation

Methodology • Current approaches use confidence-like measures to select the best rules to be added as features into the classifiers. • This may work well only if each class is well-represented in the data set. • For the rare class problems, some of the high recall itemsets could be potentially useful, as long as their precision is not too low. • Our approach: • Apply frequent itemset generation algorithm to each class. • Select itemsets to be added as features based on precision, recall and F-Measure. • Apply classification algorithm, i.e., RIPPER, to the new data set.

Experimental Results (on modified data) Original RIPPER RIPPER with high Precision rules RIPPER with high Recall rules RIPPER with high F-measure rules

Experimental Results (on modified data) RIPPER with high Precision rules Original RIPPER RIPPER with high Recall rules RIPPER with high F-measure rules For rare classes, rules ordered according to F-Measure produce the best results.

CBA Summary • Association rules can improve the overall performance of classifiers • Measure used to select rules for feature addition can affect the performance of classifiers • The proposed F-measure rule selection approach leads to better overall performance

Anomaly Detection – Related Work • Detect novel intrusions using pseudo-Bayesian estimators to estimate prior and posterior probabilities of new attacks [Barbara, Wu, SIAM 2001] • Generate artificial anomalies (intrusions) and then use RIPPER to learn intrusions [Fan et al, ICDM 2001] • Detect intrusions by computing changes in esti-mated probability distributions [Eskin, ICML 2000] • Clustering based approaches [Portnoy et al, 2001]

SNN Clusteringon KDD Cup 99’ data • SNN clustering suited for finding clusters of varying sizes, shapes, densities in the presence of noise • Dataset • 10,000 examples were sampled from neptune, smurf and normal both from training and test • Other sub-classes remain intact • Total number of instances : 97,000 • Applied shared nearest neighbors based clustering and k-means clustering

Clustering Results • SNN clusters of pure new attack types are found

Clustering Results All k-means clusters • K-means performance • SNN clustering performance Tightest k-means clusters

Nearest Neighbor (NN) based Outlier Detection • For each point in the training set, calculate the distance to the closest point • Build a histogram • Choose a threshold such that a small percentage (e.g., 2%) of the training set are classified as outliers

attack Anomaly Detection using NN Scheme

Novel Attack Detection Using NN Scheme Detection Rate for Novel Attacks = 68.50% False Positive Rate for Normal connections = 2.82%

novel attacks Novel Attack Detection Using NN Scheme details details

Conclusions • Predictive models specifically designed for rare class can help in improving the detection of small attack types • SNN clustering based approach shows promise in identifying novel attack types • Simple nearest neighbor based approaches appear capable of detecting anomalies

KDDCup’99 Data Set • KDDCup’99 contains derived high-level features • 3 groups of features • basic features of individual TCP connections (duration, protocol type, service, src & dest bytes, …) • content features within a connection suggested by domain knowledge (e.g. # of failed login attempts) • time-based traffic features of the connection records • ''same host'' features examine only the connections that have the same destination host as the current connection • ''same service'' features examine only the connections that have the same service as the current connection back

1-NN on Anomalies back

1-NN on Known Attacks back

Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set

Data Mining for Network Intrusion Detection: Experience with KDDCup’99 Data set

Presentation Transcript

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining: Preprocessing Techniques

Chapter 3: Data Mining and Data Visualization

To Intrusion Detection Analysts

Lesson 13-Intrusion Detection Systems

Mining data with PolyAnalyst

Web Mining

ecs236 Winter 2006: Intrusion Detection #2: Vulnerability Analysis

Chapter 5: The Data Link Layer

What we have covered?

MMDSS 2007 Data stream management and mining

Mining text and data on chemicals

DATA MINING FOR INTRUSION DETECTION

Monte F. Hancock, Jr. Chief Scientist Celestech, Inc.

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining with Big Data

Spatial Data Mining

Data Mining: Concepts and Techniques