1 / 29

A Machine Learning Approach to Android Malware Detection

A Machine Learning Approach to Android Malware Detection. Justin Sahs and Prof. Latifur Khan. The Problem. Smartphones represent a significant and growing proportion of computing devices Android in particular is the fastest growing smartphone platform, and has 52.5% market share*

hao
Télécharger la présentation

A Machine Learning Approach to Android Malware Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Machine Learning Approach to Android Malware Detection Justin Sahs and Prof. Latifur Khan

  2. The Problem • Smartphones represent a significant and growing proportion of computing devices • Android in particular is the fastest growing smartphone platform, and has 52.5% market share* • The power of the Android platform allows for applications providing a variety of services, including sensitive services like banking • This power can also be leveraged by malware *http://www.gartner.com/it/page.jsp?id=1848514

  3. The Problem (cont.) • Tremendous growth in Android market from 2,300 applications in March 2009 to 400,000 applications by January 2012 has also attracted a significant growth in malware for Android. • TrendMicro, a global leader in antivirus, has predicted growth of Android Malware by December 2012 to be 129,000 malware. • Anyone and everyone can develop Android applications and host it on the Android market. Online markets do not have a process to check android applications for malware. • Google added a new security feature on Feb 2 this year to its android market to fight malware which will scan every new submission and current apps for anomalous behavior.* • This new system does not apply to alternative markets … Various Android Markets *Google Android Bouncer

  4. The Problem (cont.) • Smartphones are becoming increasingly ubiquitous. • A report from Gartner shows that there were over 100 million smart phones sold in the first quarter of 2011, an increase of 85% over the first quarter of 2010*. • Malware often disguise themselves as normal applications • Malware can cause financial loss, theft of private information • Users need robust malware detection software *http://www.gartner.com/it/page.jsp?id=1689814

  5. Static Analysis: Data Mining Approach One Class SVM Training Training Apps Model Prediction Testing Apps Testing Malware Application Detection

  6. Feature Extraction • We use an open source library called Androguard to extract features from applications: • Permissions • Control Flow Graphs • One for every method in the application • We use these extracted features to train a machine learning-based classifier • Feature set is not homogeneous (bit-vector, string and graph representations)

  7. Feature Extraction: Acquiring Applications • APK files are Android package files. Applications that are packaged as APKs can be installed in any compatible android device. • Benign APK files were harvested from the official Android Market using the android-market-api,* in addition to a collection of known malware • We used 2,172 APK files in our analysis. *http://code.google.com/p/android-market-api/

  8. Background: Structure of an android application Contains permissions and other metadata Contains Application signing information Contains any auxiliary files. The Android framework does not generate IDs for assets. Accessed through AssetManagerapi. The compiled program code Contains auxiliary files (resources) with IDs generated by the Android framework. Contains compiled xml files and resources.

  9. Classification

  10. Classification: Permissions • Built-in permissions • Access to hardware and certain parts of the Android API • Based on a list of 121 standard built-in permissions, we construct a 121-bit vector, with a 1 for each requested permission, and a 0 otherwise • Non-standard permissions • Mainly access to other applications’ APIs • We split the strings into three sections: a prefix (usually “com” or “org”), a section of organization and product identifiers, and the permission name, ignoring instances of the strings “android” and “permission,” which are ubiquitous

  11. Classification: Permissions (example) Requested Permissions: Built-in: android.permission.WRITE_EXTERNAL_STORAGE android.permission.CALL_PHONE android.permission.EXPAND_STATUS_BAR android.permission.GET_TASKS android.permission.READ_CONTACTS android.permission.SET_WALLPAPER android.permission.SET_WALLPAPER_HINTS android.permission.VIBRATE android.permission.WRITE_SETTINGS android.permission.READ_PHONE_STATE android.permission.ACCESS_NETWORK_STATE android.permission.WRITE_APN_SETTINGS android.permission.RECEIVE_SMS android.permission.RECEIVE_MMS android.permission.RECEIVE_WAP_PUSH android.permission.INTERNET android.permission.SEND_SMS android.permission.READ_SMS android.permission.WRITE_SMS Non-standard: com.android.launcher.permission.INSTALL_SHORTCUT com.android.launcher.permission.UNINSTALL_SHORTCUT com.android.launcher.permission.READ_SETTINGS com.android.launcher.permission.WRITE_SETTINGS android.permission.GLOBAL_SEARCH_CONTROL Represented as a bit vector: 00000100 00000000 00000000 00100000 00000000 00010000 01000000 10000000 00000100 00101000 01110001 00000000 00011000 00000101 00100001 1 And three sets of strings: [“com”],[“launcher”],[“CONTROL”, “GLOBAL”, “INSTALL”, “READ”, “SEARCH”, “SETTINGS”, “SHORTCUT”]

  12. Classification: Control Flow Graphs (CFGs) • Constructed from the compiled bytecode of the application • Each method can be represented as a graph • Nodes represent contiguous sequences of non-jump instructions • Edges represent jumps (goto, if, loops, etc.) • CFGs encode the behavior of the methods they represent, and are therefore a potential source of discriminating information • The actual bytecode is often obfuscated, either by the compiler for optimization or deliberately to prevent reverse engineering or detection • We perform reduction on the extracted CFGs to counteract obfuscation

  13. Classification: CFG Reduction • We reduce graphs according to three rules: • Contiguous instruction blocks are merged • Unconditional jumps are merged with their target • Contiguous conditional jumps that share a destination are merged. (1) (2) (3)

  14. Classification: Training and Testing • Once we have extracted our four feature representations, we use them to train a One-Class Support Vector Machine (1C-SVM) • The 1C-SVM is designed to detect test examples that are significantly different than the training data • We have far more examples of benign applications than malware to train on

  15. One Class SVM (1C-SVM) • A Support Vector Machine (SVM) finds the maximum-margin separating hyperplane between the positive and negative training examples in some feature space • i.e. it maximizes the distance between the hyperplane and the closest examples from each class • The SVM uses comparison functions called kernels to map each extracted feature into a high-dimensional feature space • The linear separation of the data in the feature space may correspond to a very non-linear separation of the original data • Each kernel takes two feature representations as input and outputs a number that measures similarity

  16. Features and Kernels String Kernel Graph Kernel • The Set Kernel applies some other kernel to each pair of elements from the two input sets, e.g. the String Kernel if the elements are strings

  17. Classification: Training and Testing (cont.) • We use a data mining library, scikit-learn (http://scikit-learn.org/), which implements a convenient wrapper around the popular LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) • The use of a SVM requires specialized functions called kernels that are used to compare features between applications • We implement these kernels ourselves

  18. Kernels • We have three feature representations (bit-vectors, strings, and graphs) • We have three kernels, each of which takes two feature representations as input, and outputs a measure of similarity • A bit-vector kernel that counts the number of equivalent bits • Example: let our two bit-vectors be <0 0 1 1 1 0 1> <1 0 1 0 1 0 1> • Then, we have 5 matching bits, so a kernel value of 5

  19. Kernels (cont.) • A kernel over strings that counts the number of common subsequences between two strings, weighted by length • Length is measured by the distance between the first and last elements in both strings • For example, the strings “abc” and “bxc” have as common subsequences “b”, “c”, and “bc”, which have lengths 1, 1 and 2+3=5, respectively.

  20. Kernels (cont.) • A graph kernel that iteratively relabels the nodes of each graph based on that node’s children’s labels • Original labels are based on the instructions present in each node • The generated labels are counted to generate a vector • The kernel returns the dot product of these two vectors Original Graphs 2 2 1 4 1 4 4 Graph A Graph B Example:

  21. Kernels (cont.) Original Graphs Iteration 1 2 2 214 214 1 4 1 4 14 4 14 4 4 4 Graph A Graph B Graph A Graph B Iteration 2 Iteration 3 214144 214144 2141441444 2141441444 144 4 144 4 1444 4 1444 4 4 4 Graph A Graph B Graph A Graph B

  22. Kernels (cont.) • Labels and count vectors: • The dot product of these two vectors is 1*1 + 1*1 + 4*8 + 1*1 + 1*1 + 1*1 + 1*1 + 1*1 + 1*1 = 40

  23. Kernels (cont.) • Additionally, we have a kernel over sets, which applies some other kernel, k0, over the elements of each set. It applies the element kernel to every pair of elements in the two sets, and exponentiates these values, so that the better matches (higher values) are emphasized: • Then we feed the sets of strings from the non-standard permissions feature and the sets of graphs from the CFG feature into this with the string kernel and graph kernel, respectively

  24. Kernels (cont.) • Each of these kernel values are normalized, then summed to form the final kernel value • One such value is calculated for every pair of training examples, generating a kernel matrix • The kernel matrix is used to train the 1C-SVM • During testing, one value is calculated for each pair of training and testing examples.

  25. Experimental Results • We tested our system with 2081 benign applications and 91 malicious applications • The system correctly classifies approximately 90% of malware, but only correctly classifies approximately 50% of benign applications • We also tested against each of the individual features alone

  26. Background: Measures of Quality • We examine several measures of quality: • True Positive Rate (aka Recall): the proportion of actual malware that our model classifies as malware • False Negative Rate: the proportion of actual malware that our model classifies as benign; “miss” rate • True Negative Rate: the proportion of actual benign applications that our model classifies as benign • False Positive Rate: the proportion of actual benign applications that our model classifies as malware; “false alarm” rate • Precision: The proportion of malware-classified applications that are actually malware • F1: The harmonic mean of precision and recall; this gives a measure of quality between precision and recall, closer to the worse of the two • F2: Like F1, but with recall weighted twice as much as precision • F½: Like F1, but with precision weighted twice as much as recall

  27. Experimental Results (cont.)

  28. Experimental Results (cont.) Note: The downward trend in precision and F-measures is due to the increasing benign sample size and fixed malware sample size

  29. Conclusions and Future Work • The high true positive is promising, but the low true negative shows much room for improvement • There are a number of areas ripe for future investigation: • Additional features from static analysis or even dynamic analysis • New and better kernels and feature representations • Alternative models such as the Semi-Supervised SVM, Kernel PCA or probabilistic models

More Related