Intelligent Detection of Malicious Script Code

Intelligent Detection of Malicious Script Code CS194, 2007-08 Benson Luk Eyal Reuveni Kamron Farrokh Advisor: Adnan Darwiche

Introduction 3-quarter project Sponsored by Symantec Main focuses: Web programming Database development Data mining Artificial intelligence

Overview Current security software catches known malicious attacks based on a list of signatures The problem: New attacks are being created every day Developers need to create new signatures for these attacks Until these signatures are made, users are vulnerable to these attacks

Overview (cont.) Our objective is to build a system that can effectively detect malicious activity without relying on signature lists The goal of our research is to see if and how artificial intelligence can discern malicious code from non-malicious code

Data Gathering Gather data using a web crawler (probably a modified web crawler based on the Heritrix software) Crawler scours a list of known “safe” websites Will also branch out into websites linked to by these websites for additional data, if necessary While this is performed, we will gather key information on the scripts (function calls, parameter values, return values, etc.) This will be done in Internet Explorer

Data Storage When data is gathered it will need to be stored for the analysis that will take place later Need to develop a database that can efficiently store the script activity of tens of thousands (possibly millions) of websites

Data Analysis Using information from database, deduce normal behavior Find a robust algorithm for generating a heuristic for acceptable behavior The goal here is to later weigh this heuristic against scripts to determine abnormal (and thus potentially malicious) behavior

Challenges Gathering How to grab relevant information from scripts? How deep do we search? Good websites may inadvertently link to malicious ones The traversal graph is probably infinitely long Storage In what form should the data be stored? Need efficient way to store data without simplifying it Example: A simple laundry list of function calls does not take call sequence into account Analysis What analysis algorithm can handle all of this data? How can we ensure that the normality heuristic it generates minimizes false positives and maximizes true positives?

Milestones Phase I: Setup Set up equipment for research, ensure whitelist is clean Phase II: Crawler Modify crawler to grab and output necessary data so that it can later be stored and begin crawler activity for sample information Phase III: Database Research and develop an effective structure for storing data and link it to webcrawler Phase IV: Analysis Research and develop an effective algorithm for learning from massive amounts of data Phase V: Verification Using webcrawler, visit a large volume of websites to ensure that heuristic generated in phase IV is accurate Certain milestones may need to be revisited depending on results in each phase

Intelligent Detection of Malicious Script Code