On The Effectiveness of Kolmogorov Complexity Estimation to Discriminate Semantic Types

Stephen Bush and Todd Hughes On The Effectiveness of Kolmogorov Complexity Estimation to Discriminate Semantic Types SFI Workshop on Adaptive and Resilient Computing Security Presenters: Enkh-Amgalan Baatarjav Kalyan PathapatiSubbu SatyajeetNimgaonkar

Overview • Introduction • Innovation and security • Challenges • Detecting variation in the complexity landscape • Semantic Type Classification • Framework and Experimental Test Set • Discrimination Results • Conclusion • References

Introduction • A problem in information system is information assurance • Main idea: Complexity based vulnerability analysis • Applying Kolmogorov Complexity for estimating and predicting previously unknown vulnerability • Progress on experimental validation of vulnerability analysis framework • Kolmogorov Complexity Video

Introduction • The salient point of complexity-based vulnerability analysis • The better one understands a phenomenon, the more concisely the phenomenon can be described. • Goal of science: to develop theories that require minimum size to be fully described • The objective of this paper • To find whether estimates of complexity can be used to differentiate known types of data based on their complexity

Intro: Benefit • Motivating early works: active network java complexity probe toolkit. • Tools based on Kolmogorov complexity do not require detailed a priori information about known attacks, but rather compute vulnerability based upon an inherent, underlying property of information itself, namely, its Kolmogorov-Chaitin complexity.

Intro: Innovation and Security • A method for vulnerability identification • Waiting for an information system to be attacked • Surviving the attack • Detecting the attack • Analyzing the attack • Adding result into a knowledge base • Attackers and defenders of information system are capable of innovation

Intro: Challenges • Length of time required to obtains an accurate sample (performing the analysis in real-time) • Stream of data on a network link can be sampled at multiple protocol layers. • OSI Model: physical, data link, network, transportation, session, presentation, application • Potential attackers target areas of low complexity and high complexity • Low complexity: easier to observe and understand • High complexity: potentially a good place to hide activities

Intro: Challenges

Detecting variation in complexity landscape • For complexity map generation • Complexity landscape has sufficient variation • Smallest descriptive length of different semantic types • Equal or vastly differ • Approximation of smallest descriptive length • Best descriptor • No redundant information • Unique essence of entity remains • Goal: Maximize discrimination • Smallest representation of a sequence

Semantic Type Classification An input stream Different kinds of information Arrives into the complexity probe classifier The classifier Kolmogorov Complexity estimate of the input stream to categorize incoming data into different semantic types. Audio, MS Word Document, Executable, Image, ASCII Text, or Video

Framework and Experimental Test Set • Ten randomly chosen samples of each type of data • Data filtered to extract header • The complexity estimator • returns an estimate of its complexity. • Mapper determines a semantic type • based upon the complexity estimate.

Complexity Estimator Module • Estimation using bit streams • simple entropy estimator (H) • Limpel-Zev (LZ) compression, Zip (Zip) compression, bZip (bZip) compression, and a frequency-based FFT estimator technique (Psi).

Tunable parameters of the Complexity Probe Parameters: • specification of filters, sampling rate, window size, and the set of estimator algorithms enabled. • The output • a single semantic type to identify a .file • a vector of semantic types, one for each window

Discrimination Results • Discriminate analysis • Zip estimator • Squared distance between semantic types r • relatively large except in the case of the distances circled in red. • These types – very close to one another • yield a high error rate in discriminating among these types.

Accuracy of thecomplexity-based system • The histogram columns represent the percent of data from the experimental test set correctly classified • Combination of entropy types • audio and executables as a combined type • MS Word and text as a combined type • Images and video as combined types

Timing Profile • For a complexity estimator, the actual complexity of the data and the window size will have greatest effects on timing. • Fig. shows the mean complexity for each estimator for the entire experimental test set.

Time (ms) vs. Window Size (bytes) • The fig. shows the expected amount of time for each semantic type as a function of window size. • In every case, a larger window size requires more time to estimate complexity.

Time (ms) vs. Complexity (10Video files) • The fig. shows the expected amount of time for each semantic type as a function of complexity of the sequence in the window. • Time to estimate decreases with increase in complexity.

Time vs. % Correct Discrimination Discrimination vs. Compression Ratio Accuracy vs. Time

Throughput (b/ms) per Semantic Type Throughput for Z & H/semantic Type Throughput for Psi, LZ & BZ/semantic Type

Conclusion • Results in this paper analyze whether estimates of complexity have their required resolution to differentiate known types of data based upon their complexity. • Results indicates data types can be identified by estimates of their complexity • A map of complexity can identify suspicious types • Executable data embedded within passive data types

References • On The Effectiveness of Kolmogorov. Complexity Estimation to Discriminate. Semantic Types. Stephen F. Bush, Senior Member, IEEE • Complexity as a Framework for Prediction, Optimization, and Assurance, Proceedings of the 2002 DARPA Active Networks Conference and Exposition (DANCE 2002), IEEE Computer Society Press, pp. 534-553, ISBN 0-7695-1564-9, May 29-30, 2002, San Francisco, California, USA. • Bush, Stephen F., Extended Abstract: Complexity and Vulnerability Analysis, Complexity and Inference, June 2-5, 2003, DIMACS Center, Rutgers University, Piscataway, NJ, Organizers: Mark Hansen, Paul Vitányi, Bin Yu. • Kirchher W., Li M., and Vitányi P., The Miraculous Universal Distribution. The Mathematical Intelligencer, Springer-Verlag, New York, Vol. 19, No. 4, 1997. • Ming Li and Paul Vitányi. Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, 1993. ISBN 0-387-94053-7.

On The Effectiveness of Kolmogorov Complexity Estimation to Discriminate Semantic Types

On The Effectiveness of Kolmogorov Complexity Estimation to Discriminate Semantic Types

Presentation Transcript

On the Complexity of Scheduling

Circuit Complexity, Kolmogorov Complexity, and Prospects for Lower Bounds

Kolmogorov complexity and its applications

Ontological analysis of the semantic types

Evolution Complexity: Realistic estimation of “evolvability”

Kolmogorov Complexity for analysis of DNA sequence

Computational Complexity of Semantic Web Language

Kolmogorov :

On Data Mining, Compression, and Kolmogorov Complexity.

CAN STATES DISCRIMINATE OR FORCE OTHERS TO DISCRIMINATE?

Refactoring Effect Estimation Based on Complexity Metrics

Software complexity estimation

Lecture 8. Kolmogorov complexity and Nature

Second Order Kolmogorov Entropy Estimation of speech data – The setup

Kolmogorov Complexity and Universal Distribution

Kolmogorov Complexity

A Kolmogorov Complexity Approach for Measuring Attack Path Complexity

Kolmogorov complexity and its applications

On Collaboration: Complexity, Effectiveness, and Choice

On Data Mining, Compression, and Kolmogorov Complexity.

Types of Effectiveness Evaluations

Software complexity estimation