URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression

URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression Presented by : Mohammed Nazim Feroz 11/26/2013

Motivation • Web services drive new opportunities for people to interact, they also create new opportunities for criminals • Google detects about 300,000 malicious websites per month, this is a clear indication that these opportunities are being used by criminals • Almost all online threats have something in common, they all require the user to click on a hyperlink or type in a website address

Motivation • The user needs to perform sanity checks and assessing the risk of visiting a URL • Performing such an evaluation might be impossible for a novice user • As a result, users often end up clicking links without paying close attention to the URLs – this further makes them vulnerable to malicious websites on the web which in turn exploit them

Introduction • Openness of the web exposes opportunities for criminals to upload malicious content • Do techniques exist to prevent malicious content from entering the web?

Current Techniques • Security practitioners have developed techniques such as blacklisting in order to protect users from malicious websites • Although this approach has minimal overhead, it does not provide complete protection as about only 55% of the malicious URLs are present in blacklists • Another drawback of this approach is that malicious websites are not a part of the blacklist during the period before their detection

Current Techniques • Security researchers have done extensive research in order to detect accounts on social networks that are used for spreading messages that are malicious • The approach still does not provide thorough protection for users in areas such as social networks where the interaction is in real-time because there is a need to build a profile of malicious activity and the process can take a considerable amount of time

Current Techniques • Researchers from TokDoc have used a method that decides on a per-token basis whether a token requires automatic healing • Their work uses n-grams and length as features for detecting malicious URLs • This research builds on their idea by supplementing a set of their features with host-based features as the latter has exhibited a wealth of information that can be used

Approach • URLDoc classifies URLs automatically based on the lexical (textual) and host-based features • Scalable machine learning algorithms from Mahout are used to develop and test the classifier • Online learning is considered over batch learning • The classifier achieves 93-97% accuracy by detecting a large number of malicious hosts, with a modest false positive rate

Approach • If these predictor variables are correctly identified and the URLs metadata is carefully derived then the machine learning algorithms used can sift through tens of thousands of features • Online algorithms are preferred over batch-learning algorithms • Batch learning algorithms look at every example in the training dataset on every step and then update the weights of the classifier – a costly operation if the number of training examples is large

Approach • Online algorithms update the weights according to the gradient of the error with respect to a single training example • Online algorithms are able to process datasets far more efficiently than batch algorithms

Problem Formulation • URL classification lends itself naturally as a binary classification problem • The target variable y(i) can take one of two possible values-malicious or benign • For k predictor variables over all categories then there will be x1(i),…, xk(i); this will result in a k-dimension feature vector characterizing the URL • The goal is to learn a function h(x)=y that maps the space of input values to the space of output values so that h(x) is a good predictor for the corresponding value of y

Problem Formulation • The two main phases involved in building a classification system • The first phase creates the model (i.e. the function h(x))produced by the learning algorithm • The second phase makes use of that model to assign new data from the test dataset to its predicted target class • Selection of the training dataset and it’s predictor variables, the target classes, and the learning algorithm through which the classification system will learn are vital in the first phase of building the classification system • Predicted labels are compared with known answers to evaluate the classifer

Overview of Features • Lexical features • These features have values of both types-binary and continuous • These features include • Length of the URL • Number of dots in the URL • Tokens present in the hostname, primary domain, and path parts of a URL • Features in the hostname are further characterized as bigrams • Bigrams are able to capture a certain pattern on character strings permuted randomly and occurring in certain combinations • Example: www.depts.ttu.edu  Bigrams: deptsttu, ttuedu

Overview of Features • Host-Based features • IP address of the URL – A Record • IP address of the Mail Exchanger – MX Record • IP address of the Name Server – NS Record • PTR Record • AS number • IP Prefix

Overview of Features • Malicious websites have exhibited a pattern of being hosted in a particular “bad” portion of the Internet • Example: McColo provided hosting for major botnets, which in turn were responsible for sending 41% of the world’s spam just before McColo’s takedown in November 2008. McColo’s AS number was 26780 • These portions of the internet can be characterized on a regular basis by retraining on the predictor variables • This allows keeping track of concept drift

Online Logistic Regression with SGD • Logistic regression is a very flexible algorithm as it allows the predictor variables to be of both types-continuous and binary • Mahout greatly helps in the learning process by choosing an optimum learning rate and thus allowing the classification system to converge to the global minimum

Online Logistic Regression with SGD • Online learning when compared to batch learning is usually much faster, adapts to changes in a continuous manner and is much better when the size of the training and test datasets are large • Support Vector Machines were considered but not chosen since they take a longer period of time to train when compared to Online Logistic Regression • Online Logistic Regression converges more quickly if malicious and benign URLs from the training dataset are presented in a random order

Feature Vector • Feature hashing is used in order to encode the raw feature data into feature vectors • In this approach, a reasonable size (i.e. dimension) is picked for the feature vector and the data is put into feature vectors of the chosen size • After carefully considering the datasets, the size of the feature vectors in the research is in the 100,000 dimension space

Feature Vector Example • The data is encoded into the feature vector as continuous, categorical, word-like, and text-like features using the Mahout API

Results 90/10 dataset split 80/20 dataset split Training/Test dataset split Training/Test dataset split

Results 50/50 dataset split Training/Test dataset split Benign:Malicious

Other Approaches Attempted • Term Frequency – Inverse Document Frequency • A bag of words approach was used and term (lexical features) – document (URL) matrix was created • Online Logistic Regression is not affected by good word weighting • Clustering • The URLs are viewed as a set of vectors in vector space • Cosine similarity was used as the similarity measure between URLs • This research focused on classification over clustering since the target classes of the URLs was known – Clustering has known to be useful when the target classes are unknown

Future Work • Study the various features extensively and only use those with the highest contributions – Also add new features that would help in better classification • Try to use algorithms that can benefit from parallelization

Summary • A reliable framework for the classification of URLs is built • A supervised learning method is used in order to learn the characteristics of both malicious and benign URLs and classify them in real time • The applicability and usefulness of Mahout for the URL classification task is demonstrated, and the benefits of using an online setting over a batch setting are illustrated-the online setting enabled learning new trends in the characteristics of URLs over time

Questions?

URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression

URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression

Presentation Transcript

Regression for Data Mining

Logistic Regression – Basic Relationships

VI. Logistic Regression

Regression Analysis

3.3 Hypothesis Testing in Multiple Linear Regression

Standard Binary Logistic Regression

Review of last week

Early intervention in Ireland: the DETECT experience

The Multiple Regression Model

Cheshire II: Features and Internals and Cheshire III overview

Chapter 2: Logistic Regression

Relative Importance of Predictors with Regression Models

Bridging the gap from LogR to IRT

Security in Computing Chapter 3, Program Security

Correlation and regression

Chapter 3

Learning

Multilevel Regression Models

Chapter 10 Correlation and Regression

Regression