Data Mining Based Intrusion Detection System

Data Mining Based Intrusion Detection System Krishna C Surendra Babu

Papers: • A Data Mining Framework for Building Intrusion Detection Models (Wenke Lee, Salvotore J. Stolfo) - Research supported in parts by grants from DARPA • Creation and Deployment of Data Mining-Based Intrusion Detection Systems in Oracle Database 10g

Intrusion Detection System: • Intrusion Detection Techniques: • Anomaly Detection • Misuse Detection • DOS • Probing • Unauthorized access to local super user (U2R) • Unauthorized access from a remote machine (R2L)

Requirements: • Reliable • Extensible • Easy to manage • Low maintenance cost

A Data Mining Framework for Building Intrusion Detection Models • Data Mining Data mining refers to extracting or mining knowledge from large amounts of data. • Data Warehouse A data warehouse is a repository of information collected from multiple sources

A Data Mining Framework for Building Intrusion Detection Models • Why Data Mining? • The dataset is large. • Constructing IDS manually is expensive and slow. • Update is frequent since new intrusion occurs frequently.

Challenges for Data Mining in building IDS • Develop techniques to automate the processing of knowledge-intensive feature selection. • Customize the general algorithm to incorporate domain knowledge so only relevant patterns are reported • Compute detection models that are accurate and efficient in run-time

Mining the data • Dataset Types: • Network based dataset • Host based dataset • Build IDS by mining in the records. • When an attack is detected, give alarms to the administration system.

Framework of Building IDS • Preprocessing. Summarize the raw data. • Association Rule Mining. • Find sequence patterns (Frequent Episodes) based on the association rules. • Construct new features based on the sequence patterns. • Construct Classifiers on different set of features

Preprocessing • To summarize raw data to high level event, • e.g network connection, time, duration, service, host, destination • Bro and NFR Packet filtering Techniques can be used.

Classification • Classify each audit record into one of a discrete set of possible categories, normal or a particular kind of intrusion.

Association rule mining Searches for interesting relationships among attributes in a given data set i.e. to derieve multi feature(attribute) correlations from a database table.

Sequence Pattern Mining • Frequent Episodes. • X,Y->Z, [c,s,w] • With the existence of itemset X and Y, Z will occur in time w.

Feature Construction • Feature extraction is the processes of determining what evidence that can be taken from raw audit data is most useful for analysis. • Construct new feature according to the frequent episode. • Some features will show close relationship to each other. Then combine the features. • Some frequent episode may indicate interesting new features.

Build Model (classifier) • Build different classifiers for different attacks.

Experiments • The DARPA data • 4G compressed tcpdump data of 7 weeks of network traffics. • Contains 4 main categories of attacks • DOS: denial of service, e.g., ping-of-death, syn flood • R2L: unauthorized access from a remote machine, • e.g., guessing password • U2R: unauthorized access to local super user privileges by a local unprivileged user, e.g., buffer overflow • PROBING: e.g., port-scan, ping-sweep

Results • Training on the 7 weeks of labeled data, and testing on the 2 weeks unlabeled data. • The test data contains 14 attack types which do not exist in training data. • Comparing 4 methods: • Columbia: the IDS developed according to the framework introduced above • Group 1-3: three systems developed by knowledge engineering approaches.

Results Detection rate on New and Old attacks. • Old attacks: type of attacks occur in both training and testing data. • New attacks: type of attacks occur in testing data only.

Creation and Deployment of Data Mining Based Intrusion Detection Systems in Oracle Database 10G DAID A database centric architecture that leverages data mining with in the Oracle RDBMS to address the challenges. • Scheduling capabilities • Alert infrastructure • Data analysis tools • Security • Scalability • reliability

Requirements for a production quality IDS • Centralized view of the data • Data transformation capabilities • Analytic and data mining methods • Flexible detector deployment, including scheduling that enables periodic model creation and distribution • Real-time detection and alert infrastructure • Reporting capabilities • Distributed processing • High system availability • Scalability with system load

• Sensors • • Extraction, transformation and load (ETL) • • Centralized data warehousing • • Automated model generation • • Automated model distribution • • Real-time and offline detection • • Report and analysis • • Automated alerts

Sensors • Collects audit information • Network traffic data • System logs on individual hosts • System calls made by processes

ETL • Used for pre processing audit streams and feature extraction • Use SQL and user defined functions to extract key pieces of information. Ex: computes windowing analytic function to compute the number of http connections to a given host

Model Generation Popular Techniques for misuse and anomaly detection: • Association Rules • Clustering • Support Vector Machines • Supervised learning methods for Classification • Decision Trees

Model build functionality: • Dbms_data_mining PL/SQL package • to train linear SVM anomaly and misuse detection models. • Test dataset • Probing • Denial of service • Unauthorized access to a local superuser(u2r) • Unauthorized access from a remote machine(r2l) (37 subclasses of attacks under the 4 generic categories)

Misuse Detection Problem • Anomaly Detection Problem • Accuracy of the system 92.1%

Periodic Model Updates as new data is accumulated • Model rebuild when the performance falls below a predefined level

Model Distribution Real Application Clusters (RAC)

Detection Real time / offline Audit data are classified as attack or not by misuse detection SVM model.

Functional index on the probability of a case being an attack or not • returns all cases in audit_data with probability greater than 0.5 of being an attack

The query returns all cases where either model1 or model2 indicate an attack with probability higher than 0.4: • In this case, when the anomaly_model classifies a case as an attack with probability greater than 0.5, the misuse_model will attempt to identify the type of attack: • Combination of multiple models

Reports and Analysis

Conclusion • Data mining techniques are very useful in Intrusion Detection • Still need manually interpretation/advice in some processing steps • More efficient on known attacks than on unknown attacks only if the training data contains all normal behavior

Data Mining Based Intrusion Detection System