Automatic Malware Behavior Analysis using Machine Learning

Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic Analysis of Malware Behavior using Machine Learning Author’s: Konrad Rieck, Philipp Trinius, Carsten Willems, and Thosten Holz

Abstract & Introduction • Malware - • Poses major threat to security of computer systems. • Very diverse – viruses, internet worms, trojan horses, • Amount of malware – millions of hosts infected • Obfuscation and polymorphism impede detection at file level • Dynamic analysis helps characterizing and defending.

Abstract & Introduction Contd.. • Framework for automatic analysis of malware behavior using Machine learning • Framework allows automatic analysis of novel classes of malware with similar behavior – Clustering. • Assigning unknown classes of malware to these discovered classes – Classification. • An incremental approach based on both for behavior based analysis.

Automatic analysis of Malware Behavior • Framework steps and procedure • Executing and monitoring malware binaries in sandbox environment. Report generated on system calls and their arguments. • Sequential reports are embedded in a vector space where each dimension is associated with a behavioral pattern. • ML techniques then applied to the embedded reports to identify and classify malware. • Incremental analysis progress by alternating between clustering and classification.

Report representation • Can be textual or XML • Human readable and suitable for computation of general statistics • But not efficient for automatic analysis • Hence MIST (Malware Instr. Set) • Inspired from instr. set used in process design.

MIST • Category of system calls • Operation - Reflects a particular system call • Arguments as argblocks.

Sandbox and MIST representation

Representation • These sequential reports identify typical behavior of malware – Changing registry keys, modifying system files. • But still not suitable for efficient analysis techniques. Hence the need to embed behavior reports in vector space – Using instruction q-grams. • This embedding enables expressing the similarity of behavior geometrically – Calculating distance.

Clustering and Classification • Reports are embedded in vector space – Process ready for applying ML techniques • Clustering of behavior – where classes of similar behavior malware are identified. • Classification of behavior – which allows to assign malware to known classes of behavior. • What allows us to do this? • Malware binaries are a family of similar variants with similar behavior patterns !

Contd..

Algorithms • Prototype extraction • Iterative algorithm • Extracts small set of prototypes from set of reports. First one chosen at random. • Clustering using Prototypes • Prototypes at beginning are individual clusters • Algorithm determines and merges nearest pairs of clusters • Classification using Prototypes • Allows to learn to discriminate between classes of malware.

Algorithms Contd.. • For each report algorithm determines the nearest prototype of clusters in training data, if within radius then assigns to cluster • Else rejects and holds back for later incremental analysis. • Incremental analysis • Reports to be analyzed are received from source. • Initially classified using prototypes of known clusters • Thereby variants of known malware are identified for further analysis. • Prototypes extracted from remaining reports and clustered again.

Experiments and Results

Evaluating components • Prototype extraction • Evaluated using Precision, Recall and Compression. • Precision – 0.99 when corpus compressed by 2.9 % & 7% • Clustering • Evaluated using F-measure • F-measure for experiments – MIST 1 = 0.93 and MIST 2 = 0.95 better than previous related work 0.881 • Classification • F-measure for experiments – MIST 1= 0.96 and MIST 2 = 0.99

Experiments and Results Contd..

Conclusion • A new framework introduced which overcomes several previous deficiencies. • The framework is learning based • Framework can be implemented in practice • Steps – Collection of malware, a study in sandbox environment, embed observed behavior in vector space, apply learning algorithms – clustering and classification. • This process is efficient and learns automatically after initial setup and run.

Thank you !

Automatic Malware Behavior Analysis using Machine Learning