MINDS – A High Performance Data Mining Based Intrusion Detection System

MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota http://www.cs.umn.edu/research/minds/ Team Members: Varun Chandola, Eric Eilertson, Benjamin Mayer, Gyorgy Simon, Mark Shaneck, Michael Steinbach, Vipin Kumar

Objectives • Objectives • Develop innovative high-performance techniques for detecting sophisticated attacks in an on-line and real-time manner • Detect stealth attacks by sophisticated adversaries that are specifically designed to evade detection by known intrusion-detection system (IDS) tools • Track down the source of attacks and the scope of the compromise after the break-in is detected • Goals of Current Research • Development of new, scalable algorithms for analyzing large amounts of network data • Scan Detection • Summarization of Network Traffic • Profiling • Sequential Pattern Analysis • Context Extraction • Incorporate these algorithms into MINDS and the ARL-CIMP’s Interrogator

Relevance to Army and Research Portfolio • As the Army, and the DoD as whole, shifts to network-centric warfare, protecting the network means more than just protecting sensitive information. It now also means protecting the lives of the war fighter, and innocent civilians. • In a wartime situation, a breach of the DoD’s computer networks puts the lives and operations of all allies in grave danger, as the enemy may know about operations before the soldier on the ground does. • This makes intrusion detection not only an important technology for ensuring cyber security, but is critical for ensuring military superiority.

Background of Problem • Traditional intrusion detection system IDS tools are based on signatures of known attacks and have well known limitations • Signature database has to be manually revised for each new type of discovered intrusion • Substantial latency in deployment of newly created signatures across the computer system • Cannot detect emerging cyber threats • Not suitable for detecting policy violations and insider abuse • Do not provide understanding of network traffic • Generate too many false alarms • Not suited for detecting multi-step attacks • Data Mining based techniques offer great promise for addressing these limitations Spread of SQL Slammer worm 10 minutes after its deployment Example of SNORT rule (MS-SQL “Slammer” worm) any -> udp port 1434 (content:"|81 F1 03 01 04 9B 81 F1 01|"; content:"sock"; content:"send") www.snort.org

Relevance to HPC • Network traffic monitoring generates a large amount of data • HPC is critical for on-line analysis and scalability • Parallel versions of anomaly detection algorithms are required for on-line and distributed anomaly detection • Scalable, parallel algorithms for clustering, association analysis, summarization and 2nd level analysis will enable the analysis of data over months/years to detect long-term patterns and trends in network traffic

Work done in the past 1 year • Protocol Anomaly Detection • Clustering Long Term Patterns • Summarization of Network Traffic • Scan Detection • Data Mining Approach • Automatic Labeling of Training Data • 2nd Level Analysis Tools • Improving the Netflows Database Schema • Profiling of Long-term Patterns • On-demand Profiling • Privacy Preserving Data Mining • Distributed Outlier Detection, Clustering, Classification

Scans are reconnaissance operations to map services and find vulnerabilities Administrators can take preventive measures to protect network assets targeted by scans Scanners hide their activity Slow or distributed scans touch very few hosts during a time interval Current scan detection tools For each source IP, count the number of destination IPs it connects to on each destination port. If this count exceeds a threshold, source is scanning. Improvement: distinguish whether service was offered or not (TRW – Jung et al., 2004) Improvement: make use of frequency of service offered on (destination IP, port) combination (Ertoz,Eilertson,Kumar et al., 2004) Low thresholds have high false alarm rate and/or low coverage Scan Detection - Introduction

Data Mining for Scan Detection • Scanning behavior follows certain patterns that are difficult to capture manually • Numerous features -> exponentially many combinations • Too many potential patterns for a human to systematically explore all of them • If we observe sources for sufficiently long time, they can be labeled as scanner or normal with high confidence • Not useful for real time scan detection • Requires too much memory due to state explosion • Data mining can help build models for these patterns if labeled data is available • Key issues: (1) feature selection, (2) labeling and (3) building a classifier

Evaluation • University of Minnesota traffic • 13 observation periods between 03/21/2005.00:00 and 03/22/2005.12:00 • Each observation period 20 minutes (approximately 4M flows)

Comparison • Model built on ID #1 and tested on the remaining 12 periods • TRW (threshold of 2) • Ripper shows outstanding and consistent performance

Claim1 • Our data mining approach enables early detection of scanners. • In some cases, as early as first connection attempt on a specific port • Out of 59,860 SIDPs encountered in data set ID #5, 37,475 made connection attempts to only one destination IP on each destination port. • Performance on the portion of the data that contains source IPs making at most one connection attempt on each destination port • Model built on ID #1, tested on ID #5 • TRW-1: TRW at threshold of 1 • TRW at a threshold of 2 or higher will not find such scanners • RIPPER: Our proposed method

Claim 2 • Our data mining approach is capable of filtering out scanning-like benign traffic such as P2P or backscatter • Performance on the portion of the data set (ID #5) that contains P2P and scanning traffic only • Model built on ID #1 • TRW-P: an SIDP making connection attempts to a P2P host is declared non-scanning • TRW-1,2, Ripper The experimental results for this table in the paper have similar qualitative behavior but are incorrect due to a bug in one of the scripts for producing the output

Claim 3 • Our data mining approach successfully extracted the characteristics of scanning behavior from the long-term observation • Rules (model built on ID #1) make sense • Rule #2 is the workhorse rule

Summarization of Large Data Sets • Summarization is a technique to find a compact and meaningful representation for analyzing large data sets for which manual monitoring is not possible • Clustering can be used to summarize large datasets but cannot handle categorical attributes • In domains like network intrusion detection, the data is huge and has a mix of categorical and continuous attributes A sample network data set with 17 records. Each record has 8 different features which are categorical or continuous. A representative summary of the above data set.

Our Contributions • Formulated the problem of summarization of transactions that contain categorical data, as a dual optimization problem and characterize a good summary using two metrics • compaction gain • Size of data/Size of summary • information loss • Weighted sum of missing features • Developed two approaches to solve this problem • Clustering based approach • Generate clusters from the data set and replace the members of each cluster with a feature-wise intersection of all members in that cluster. • Association Rule based multi-step approach • Generate frequent itemsets from the data in the first step and then select a subset of these frequent itemsets as the summary of the data in the second step. • The selection of subset is done heuristically to optimize the information loss for a given compaction gain. We propose a suite of several heuristic based algorithms which can generate approximately good summary For the dataset shown in last slide and above summary, Compaction Gain = 17/3 Information Loss (if all features have weight = 1) = 19 Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Ranked among best 5 papers (with student as main author) at the 5th ICDM Conference, 2005 (total submissions = 630)

LOF kNN SNN Basic K-distance Vertical Partitioning Implementation PP NN Search Primitive Pre-clustering Low Dimension Approx. NN Multiplication Primitive Dot Product Division Euclidean Distance More Efficient Primitives General Solution Homomorphic Encryption Oblivious Transfer Comprehensive Distance Measure More Secure Primitives Privacy Preserving Data Mining • Goal: provide a comprehensive, cryptographic, privacy preserving solution for nearest neighbor search and its major applications in data mining

2nd Level Analysis • Detecting an attack and cleaning up the affected computers is not enough, and can even be harmful • Allows the attacker to determine what our detection capabilities are • Attacker can reorganize his attack to try and go undetected • Attacker may go after a different organization that is easier to break into • Security analysts need to quickly determine • WHEN a compromise happened, HOW a compromise occurred, WHAT the attacker is after, WHERE the attacker came from, WHO the attacker is, HOW many computers are compromised • The above questions are answered by 2nd level analysts • Currently done almost entirely manually • Takes days to months to answer some questions, if they are ever answered

Continuing Work on 2nd Level Analysis • Developing algorithms and tools for automating much of the 2nd level analysis process • Algorithms for creating and operating on communication graphs • On-demand profiling used in pruning communication graphs • Some building blocks for performing 2nd level analysis are already in use at the ARL-CIMP • High-speed data collection • Massive data storage • Quick information retrieval

On Demand Context Extraction • Starting from a suspected bad computer search for other computers communicating with it. • Currently in use at the ARL-CIMP in the form of Flowinator, part of the Interrogator architecture. • Flowinator contains billions of network connections • Can answer in seconds questions which used to take hours to answer. • Future work • Incorporate profiling to automate iterative extraction • Allow looking for multiple IPs at once

On Demand Host Profiling • Developing techniques for profiling hosts on the fly to rank the computers returned based upon how anomalous the activity was. • Uses the Flowinator database • Preliminary versions of this have worked well at the CIMP, but has not been incorporated in Interrogator • Future work • Determine if additional data needs to be captured for profiling • Portions of the payload, histograms of the payload, packet arrival times within a session • Develop a voting mechanism to increase accuracy • Potential voter are profiles of host, network, and computer class (e.g. workstation, server, programmer, secretary)

Publications • Journals • Varun Chandola and Vipin Kumar, “Summarization - Compressing Data into an Informative Representation." To Appear in the Knowledge And Information Systems (KAIS), Springer, 2006. • Hui Xiong Pang-Ning Tan, and Vipin Kumar, "Hyperclique Pattern Discovery". Accepted for publication in Data Mining and Knowledge Discovery (DMKD), 2006. • Hui Xiong, Gaurav Pandey, Michael Steinbach, Vipin Kumar, "Enhancing Data Analysis with Noise Removal, IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 18, no. 3, pp. 304-319, March, 2006. • Michael Steinbach and Vipin Kumar, “Generalizing the Notion of Confidence”, Accepted for publication to Knowledge and Information Systems (KAIS), Springer, 2006. • Hui Xiong, Shashi Shekhar, Pang-Ning Tan, and Vipin Kumar, TAPER: A Two-Step Approach for All-strong-pairs Correlation Query in Large Databases, IEEE Transactions on Knowledge and Data Engineering (TKDE), accepted for publication as a regular paper, 2006. • Jieping Ye, Qi Li, Hui Xiong, Haesun Park, Ravi Janardan, Vipin Kumar, "IDR/QR: An Incremental Dimension Reduction Algorithm via QR Decomposition", IEEE Transactions on Knowledge and Data Engineering, 17(9), pp. 1208-1222, Sept 2005

Publications • Books • Pang N. Tan, Michael Steinbach, Vipin Kumar, “Introduction to Data Mining” Addison-Wesley (May, 2005) • Vipin Kumar, Jaideep Srivastava, Aleksander Lazarevic, Eds, “Managing Cyber Threats: Issues, Approaches and Challenges”, Kluwer, 2005 • Book Chapters • Varun Chandola, Eric Eilertson, Levent Ertoz, Gyorgy Simon and Vipin Kumar, Data Mining for Cyber Security, To Appear in Data Warehousing and Data Mining Techniques for Computer Security, editor Anoop Singhal, Springer, 2006 • Conference Proceedings • Gyorgy Simon, Hui Xiong, Eric Eilertson and Vipin Kumar. “Scan Detection: A Data Mining Approach”. Proceedings of 6th SIAM International Conference on Data Mining (SDM), 2006. • Varun Chandola and Vipin Kumar, “Summarization - Compressing Data into an Informative Representation” Proceedings of 5th International Conference on Data Mining (ICDM) 2005, TR-2005-037. • Michael Steinbach and Vipin Kumar, “Extending the Notion of Confidence”. Proceedings of 5th International Conference on Data Mining (ICDM) 2005, TR-2005-039. • Technical Reports • Mark Shaneck, Varun Chandola, Haiyang Liu, Changho Choi, Gyorgy Simon, Eric Eilertson, Yongdae Kim, Zhi-li Zhang, Jaideep Srivastava, and Vipin Kumar, “A Multi-Step Framework for Detecting Attack Scenarios”, Technical Report 06-004, 2006, Computer Science Department, University of Minnesota • Mark Shaneck, Yongdae Kim, and Vipin Kumar, Privacy Preserving Nearest Neighbor Search, CS Technical Report 06-014, 2006, Computer Science Department, University of Minnesota

Participation in Government and DoD Forums & Army Interactions • Vipin Kumar attended and gave a talk at Workshop on Edge Computing Using New Commodity Architectures (EDGE), organized by various funding agencies including ARO, DTO and NSF at UNC, May 23 - 24, 2006 • Eric Eilertson met with ARL personnel Jan 30th - Feb 4th, in Adelphi MD to incorporate updates to the MINDS software • Eric Eilertson visited the ARL-CIMP July 12th – 16th, 2005, September 26th – 30th, 2005 and November 28th – December 3rd, 2005 to help with the continued design of 2nd level analysis tools and a framework for 2nd level analysis. • Vipin Kumar served as the co-organizer for the AHPCRC PGAS workshop in September 2005 • Benjamin Mayer attended the AHPCRC PGAS Workshop in Sept 2005 • Benjamin Mayer visited the ARL-CIMP as a summer intern during May 17th to July 30th 2005 • Benjamin Mayer and Eric Eilertson attended the DREN Networkers Conference, October 2005

Invited Talks and Presentations • Vipin Kumar, “Scalable Benchmarks and Kernels for Data Mining and Analytics”, Invited Talk, Workshop on Edge Computing Using New Commodity Architectures (EDGE), UNC, May 23 - 24, 2006 • Vipin Kumar, “High-Performance Data Mining for Cyber Security”. Invited Talk, Distinguished Speaker Series, University of California, Davis, Feb 16, 2006. • Vipin Kumar, “High Performance Data Mining for Cyber Security”, invited talk at IIT Roorkee (Dec 27th, 2005). • Vipin Kumar, “High Performance Data Mining for Cyber Security”, invited talk at IIT Delhi (December 22nd, 2005). • Benjamin Mayer, Eric Eilertson and Vipin Kumar, “Analyzing Long Term Network Data for Cyber Attacks Using HPC. A Comparison of MPI and UPC Implementations.” Presented at DREN Networkers Conference, October 2005. • Benjamin Mayer, Eric Eilertson, Kerry Long, Tony Pressley and Vipin Kumar, “NPADS – Network Protocol Anomaly Detection System”. Presented at DREN Networkers Conference, October 2005. • Benjamin Mayer and A. Karl Keller, “High Productivity Parallel Programming with Objective C.” AHPCRC PGAS Workshop, September 2005.

Significant Professional Activities & Awards • Vipin Kumar, “Scaling Data Analytics”, Tutorial at Supercomputing-2005, Seattle, 14th Nov., 2005 • Vipin Kumar, Elected ACM Fellow, Dec 2005 • Vipin Kumar, Technical Accomplishment Award, IEEE Computer Society,2005 • Varun Chandola, IBM Research Student Travel Award for the paper titled “Summarization - Compressing Data into an Informative Representation”, at 5th ICDM Conference, 2005. Ranked as one of the top 5 student papers.

MINDS – A High Performance Data Mining Based Intrusion Detection System