Large-Scale Automatic Classification of Phishing Pages

Reporter: Li, Fong Ruei National Taiwan University of Science and Technology Large-Scale Automatic Classification of Phishing Pages Slide 1 (of 32)

Reference • Large-Scale Automatic Classification of Phishing Pages, Colin Whittaker, Brian Ryner, Marria Nazif, NDSS '10, 2010 Slide 2 (of 32)

Outline • Introduction • Phishing Classifier Infrastructure • Evaluation • Conclusion Slide 3 (of 32)

INTRODUCTION • Phishing is form of identity theft • social engineering techniques • sophisticated attack vectors • To harvest financial information from unsuspecting consumers. • Often a phisher tries to lure her victim into clicking a URL pointing to a rogue page. Slide 4 (of 32)

Phishing Classifier Infrastructure • Overall System Design • Our system classifies web pages submitted by end users and collected from Gmail’s spam filters. • These features describe the composition • the web page’s URL • the hosting of the page • the page’s HTML content as collected by a crawler Slide 5 (of 32)

Phishing Classifier Infrastructure • Classification Workflow • The first process extracts features about the URL of the page. • The second process obtains domain information about the page and crawls it • The final process assigns the page a score based on the collected features representing the probability that the page is phishing Slide 6 (of 32)

Phishing Classifier Infrastructure • Candidate URL Collection • We receive new potential phishing URLs in reports • from users of our blacklist • from spam messages collected by Gmail Slide 7 (of 32)

Phishing Classifier Infrastructure • URL Feature Extraction • The first process in the workflow, the URL Feature Extractor, looks only at the URL of the page to determine features. • If it matches a whitelist of high profile, safe sites, then the URL Feature Extractor drops the URL from the workflow entirely. • We manually compile this whitelist of 2778 sites Slide 8 (of 32)

Phishing Classifier Infrastructure • URL Feature Extraction • One feature this process extracts is whether the URL contains an IP address for its hostname. Slide 9 (of 32)

Phishing Classifier Infrastructure • URL Feature Extraction • Another feature this process extracts is whether the page has many host components • Phishers commonly use a long hostname, prepending an authentic-sounding host to their fixed domain name, to confuse viewers into believing that the page is legitimate. Slide 10 (of 32)

Phishing Classifier Infrastructure • URL Feature Extraction • Phishers often include characteristic strings in their URLs to mislead viewers. • These can include the trademarks of the phishing target, like “abbeynational” in the example above, or more general phrases associated with phishing targets, like “login”. • The feature extractor transforms each of these tokens into a boolean feature, such as “The path contains the token ‘login.’” Slide 11 (of 32)

Phishing Classifier Infrastructure • Fetching Page Content • The URL Feature Extractor also collects URL metadata, including PageRank, from Google proprietary infrastructure • We also use a domain reputation score computed by the Gmail anti spam system as a feature. • This score is roughly the percentage of emails from a domain which are not spam Slide 12 (of 32)

Phishing Classifier Infrastructure • Hosting and Page Feature Extraction • The Content Fetcher process crawls the page and gathers its hosting information. • It records the returned IPs, name servers, and name server IPs. • It also geo locates these IPs, recording the city, region, and country Slide 13 (of 32)

Phishing Classifier Infrastructure • Hosting and Page Feature Extraction • The Content Fetcher sends the URL to a pool of headless web browsers to render the page content. • After the browser renders the page, the Content Fetcher receives and records the page HTML, as well as all iframe, image, and javascript content embedded in the page Slide 14 (of 32)

Phishing Classifier Infrastructure • Page Classification • To compute the score for the page in log odds, the classifier combines these values using a logistic regression • The score translates to the computed probability that the page is phishing Slide 15 (of 32)

Phishing Classifier Infrastructure • Page Classification • Before the classifier automatically blacklists the page, it checks to make sure that the page does not have a high PageRank Slide 16 (of 32)

Evaluation • Evaluation Dataset • First • contains data collected between April 16, 2009 and July 14, 2009 with labes from July 15, 2009. • examine our selected features and train our evaluation models • Second • collected during the first two weeks of August, 2009, as a validation dataset. Slide 17 (of 32)

Evaluation Slide 18 (of 32)

Evaluation of Features Slide 19 (of 32)

Classifier Performance Slide 28 (of 32)

Conclusion • we describe our large-scale system for automatically classifying phishing pages which maintains a false positive rate below 0.1%. • Our classification system examines millions of potential phishing pages daily in a fraction of the time of a manual review process Slide 31 (of 32)

Q&A Slide 32 (of 32)

Large-Scale Automatic Classification of Phishing Pages

Large-Scale Automatic Classification of Phishing Pages

Presentation Transcript

Automatic Text Classification

Distributed Automatic Service Composition in Large-Scale Systems

AUTOMATIC CLASSIFICATION OF MEDICAL REPORT

Automatic Detection of Spamming and Phishing

Large Scale Visual Recognition Challenge (ILSVRC) 2013: Classification spotlights

Large Scale Multi-Label Classification

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

Automatic Wrappers for Large Scale Web Extraction

Landmark Classification in Large-scale Image Collections ICCV 09

I . Problem Improve large-scale retrieval / classification accuracy

Cost-Sensitive Learning for Large-Scale Hierarchical Classification of Commercial Products

LARGE SCALE

FINDING NEAR DUPLICATE WEB PAGES: A LARGE-SCALE EVALUATION OF ALGORITHMS

Automatic Wrappers for Large Scale Web Extraction

Landmark Classification in Large-scale Image Collections

Large scale

Automatic Classification of Bookmarked Web Pages

Automatic Classification of Bookmarked Web Pages

Automatic Classification of Bookmarked Web Pages

Automatic Wrappers for Large Scale Web Extraction

Large-Scale Wire-Speed Packet Classification on FPGAs

AUTOMATIC CLASSIFICATION OF MEDICAL REPORT