1 / 23

Design and Evaluation of a Real-Time URL Spam Filtering Service

Design and Evaluation of a Real-Time URL Spam Filtering Service. Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011. OUTLINE. Introduction - Monarch Related Work System Design Implementation Evaluation Discussion and Conclusion.

macy
Télécharger la présentation

Design and Evaluation of a Real-Time URL Spam Filtering Service

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Evaluation of a Real-Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011

  2. OUTLINE • Introduction - Monarch • Related Work • System Design • Implementation • Evaluation • Discussion and Conclusion

  3. Spam URL • Advertisement • Harmful content • Phishing, malware, and scams • Use of compromised and fraudulent accounts • Email, web services

  4. Monarch • Spam URL Filtering as a Service • Tens of millions of features

  5. Related Work • “Detecting spammers on Twitter” (2010) • Post frequency, URLs, friends… • “Behind phishing: an examination of phisher modi operandi” (2008) • Lexical characteristics of phishing URLs • “Cantina: a content-based approach to detecting phishing web sites” (2007) • Parse HTML content

  6. System Design Monarch’s cloud infrastructure • Url Aggregation • Email providers and Twitter’s streaming API • Feature Collection • Visits a URL with web browsers to collect page content

  7. System Design(cont.) Monarch’s cloud infrastructure • Feature Extraction • Transform the raw data into a sparse feature vector • Classification • Training and testing by distributed logistic regression

  8. Collect Raw Features – Web Browser “A taxonomy of JavaScript redirection spam”(2007) • Lightweight browser not enough • Poor HTML parsing, lack of JavaScript and plugins • Instrumented version of Firefox • JavaScript enabled • Flash and Java installed • Visited a URL and monitor a number of details

  9. Raw Features • Web Browser • Initial URL and Landing URL, Redirects, Sources and Frames • HTML Content, Page Links • JavaScript Events, Pop-up Windows, Plugins • HTTP Headers • DNS Resolver • Initial, final, and redirect URLs • IP Address Analysis • City, country, ASN • Proxy and Whitelist (200 domains)

  10. Features Vector • Raw Features => sparse feature vector • Canonicalize URLs • Remove obfuscation • Tokenize the text corpus • Splitting on non-alphanumeric characters http://adl.tw/~dada/dada2.php?a=1&b=3 => domain feature [adl,tw] path feature [dada,dada2,php] query parameters feature [a,1,b,3] => (…,adl:true,adm:false,…,dada:true,…,tw:true,……..) total 49,960,691 feature(dimension)… => (1,3,a,adl,b,dada,dada2,php,tw)

  11. Distributed Classifier Design • Linear classification • : feature vector • Determine a weight vector • A parallel online learner • With regularization to yield a sparse weight vector • Labeled data , • Testing => -1 => non-spam site 1 => spam site

  12. Training the weight vector • Logistic Regression • With subgradient L1-Regularization • yi(xi.wi) larger => f(w) smaller (Classification margin, hyperplane)

  13. Distributed Classifier Algorithm

  14. Data Set and assumption • 1.25 million spam email URLs • 567,784 spam Twitter URLs • 9 million non-spam Twitter URLs • Checking all Twitter URLs against: • Google Safebrowsing, SURBL, URIBL, APWG, Phishtank • Any of its source URLs become blacklisted

  15. Data Set and assumption(cont.) • On Twitter: • 36% scams, 60% phishing, 4% malware

  16. After regularization

  17. Implementation • Amazon Web Services(AWS) infrastructure • URL Aggregation • A queue, keeps 300,000 URLs • Feature Collection • 20x6 Firefox(4.0b4) on Ubuntu 10.04 • With a custom extension • Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views • Classifier • Hadoop Distributed File System • On the 50-node cluster

  18. Evaluation – Overall Accuracy • 5-fold cross-validation • 500,000 spam and non-spam each • Training set size to 400,000 example • 1:1, 4:1, 10:1 • Testing set size to 200,000 example • 1:1

  19. Evaluation – Single Feature

  20. Evaluation – Accuracy Over Time Training once only <-> Retraining every four days

  21. Evaluation – Comparing Email and Tweet Spam • Log odds ratio:

  22. Evaluation – The Cost • For Twitter, $22,751 per month

  23. Discussion and Conclusion • Evasion • Feature Evasion • Time-based Evasion • Crawler Evasion • Monarch • Real-time system • Spam URL Filtering as a Service • $22,751 a month

More Related