PhishDef: Innovative Phishing Protection Model

PhishDef: URL Names Say It All MichalisFaloutsos University of California, Riverside USA Anh Le, AthinaMarkopoulou University of California, Irvine USA

What is Phishing? • Social engineering and technical means to steal consumers’ personal identity, data, etc. • Cause billions of dollars of loss annually Anh Le - UC Irvine - PhishDef

Antiphishing.org Anh Le - UC Irvine - PhishDef

Example of a Phishing Site Anh Le - UC Irvine - PhishDef

Current Protection • Google Safe Browsing • Microsoft Smart Screen • Third-Party Anh Le - UC Irvine - PhishDef

Current Protection Model Google Safe Browsing • Motivation: • Blacklist-based protection is reactive -- -- cannot protect against zero-day phishing Anh Le - UC Irvine - PhishDef

Outline Phishing Background Motivation Our proposal New Protection Model Learning Algorithms Dataset Feature Selection Evaluation Results Concluding Remarks Anh Le - UC Irvine - PhishDef

Our Proposed Protection Model • Main challenges: Accuracy and Classification Latency • Which classification algorithm works best? • Which set of features works best? Anh Le - UC Irvine - PhishDef

Prior Work Whittaker et al. [NDSS ’10] Google Safe Browsing Ma et al. [SIGKDD ’09] Batch-based Classification Ma et al. [ICML ‘09] Batch-based vs. Online Learning Server-Side Classification Anh Le - UC Irvine - PhishDef

Main Contributions New Protection Model: Client-side classification Propose using Adaptive Regularization of Weights (AROW) High accuracy Resilient to noise Set of Lexical Features Fast to extract at client side Obfuscation resistant Anh Le - UC Irvine - PhishDef

Machine Learning Algorithms • Batch-based Support Vector Machine • Online Perceptron • Confident Weighted (CW) [Dredze et al., ICML 2008] • Adaptive Regularization of Weights (AROW)[Crammer et al., NIPS 2009] Anh Le - UC Irvine - PhishDef

Online Classification • Maintaining a weight vector and use it for classification • Online Perceptron Client Side: Trained Beforehand Extract In Real Time Server Side: Anh Le - UC Irvine - PhishDef

Online Classification • Confident Weighted (CW) • Adaptive Regularization of Weights (AROW) minimum change enough to correct last mistake minimum change increasing confidence penalty for mistake Anh Le - UC Irvine - PhishDef

Dataset • Phishing URLs • PhishTank (4,082) • MalwarePatrol (2,001) • Benign URLs • Open directory(4,012) • Yahoo directory (4,143) • Time period: June 2010 Anh Le - UC Irvine - PhishDef

Feature Selection • Lexical Features • External Features • Country, AS number, registration date, registrant, registrar, etc. Anh Le - UC Irvine - PhishDef

Outline Phishing Background Motivation Our proposal New Protection Model Learning Algorithms Dataset Feature Selection Evaluation Results Concluding Remarks Anh Le - UC Irvine - PhishDef

Evaluation Results: Lexical vs. Full Features • (+) ~ 1% • (-) Dependency on Remote Server • (-) Avg. Latency: 1.64 s Lexical features alone are better-suited than full features for client-side phishing classification Anh Le - UC Irvine - PhishDef

Evaluation Results:CW vs. AROW AROW is more resilient to noise than CW Anh Le - UC Irvine - PhishDef

Conclusion: PhishDef • Client-side phishing classification system • Proactive, on-the-fly classification of zero-day phishing URLs • Low delay client side (ms),high accuracy (97%) • Resilient to noisy data • Future Work: • Develop an add-on for Firefox Anh Le - UC Irvine - PhishDef

Questions Anh Le - UC Irvine - PhishDef

Anh Le - UC Irvine - PhishDef

Example of a Phishing Site http://pilety.ru/c548c205d7660ed0628b467d7d5aa54c9c3a7124/image/taxrefund.htm http://www.hmrc.gov.uk/intro-income-tax.htm Anh Le - UC Irvine - PhishDef

Evaluation Results:Batch-Based vs. Online Learning Online Learning outperforms Batched-Based Learningfor Phishing classification Anh Le - UC Irvine - PhishDef

Chrome 11 > Firefox 4 Anh Le - UC Irvine - PhishDef

PhishDef: Innovative Phishing Protection Model