Populated IP Addresses — Classification and Applications

Populated IP Addresses — Classification and Applications Chi-Yao Hong, UIUC Fang Yu, MSR Silicon Valley YinglianXie, MSR Silicon Valley ACM CCS (October, 2012)

Outline • Introduction • System Design • Implementation • Evaluation • Application A Seminar at Advanced Defense Lab

Introduction • While online services have become everyday essentials for billions of users, they are also heavily abused by attackers. • Web-based email • Online service providers often rely on IP addresses to perform blacklisting and service throttling. • For IP addresses that are associated with a large number of user requests, they must be treated differently. A Seminar at Advanced Defense Lab

Populated IP Addresses • We deffineIP addresses that are associated with a large number of user requests as Populated IP (PIP) addresses. • not equivalent to the traditional concept of proxies, NATs, gateways, or other middleboxes A Seminar at Advanced Defense Lab

Goal • In this paper, we introduce PIPMiner, a fully automated method to extract and classify PIPs. A Seminar at Advanced Defense Lab

System Design • We take a data-driven approach using service logs that are readily available to all service providers. • And we train a non-linear support vector machine (SVM) classifier that is highly tolerant of noise in input data. A Seminar at Advanced Defense Lab

System Flow • PIP Selection • Phase 1 : IP addresses with rL requests, rL= 1,000 • Phase 2: IP address has been used by at least uM accounts, together accounting for at least rM requests. • uM = 10, rM = 300 A Seminar at Advanced Defense Lab

Features • Population Featurescapture aggregated user characteristics. • Time Series Featuresmodel the detailed request patterns. • IP Block Level Features aggregate IP block level activities and help recognize proxy farms. A Seminar at Advanced Defense Lab

Population Features A Seminar at Advanced Defense Lab

Time Series Features A Seminar at Advanced Defense Lab

IP Block Level Features • large proxy farms often redirect trac to dierent outgoing network interfaces for load balancing purposes. • Determine neighboring IP addresses: • Neighboring IPs must be announced by the same AS. • Neighboring IPs are continuous over the IP address space, and each neighboring IP is itself a PIP. A Seminar at Advanced Defense Lab

EX: Block Level Time Series A Seminar at Advanced Defense Lab

Training and Classification • Non-linear SVM A Seminar at Advanced Defense Lab

Kernel Function k(xi, x) A Seminar at Advanced Defense Lab

Implementation • Data Parse and Feature Extraction (Stage 1) • We implement PIPMiner on top of DryadLINQ [link], a distributed programming model for large-scale computing. • Using a 240-machine cluster • Training and Testing (Stage 2) • Quad Core CPU with 8GB RAM • LIBSVM [link] and LIBLINEAR [link] toolkits A Seminar at Advanced Defense Lab

Evaluation • We apply PIPMiner to a month-long Hotmail login log pertaining to August 2010 and identify 1.7 million PIPaddresses. (200 MB ) • 0.5%of the observed IP addresses • the source of more than 20.1% of the total requests • Associated with 13.7% of the total accounts in our dataset • At Stage 1, PIPMinerprocesses a 296 GB dataset in only 1.5 hours. A Seminar at Advanced Defense Lab

PIP Score Distribution A Seminar at Advanced Defense Lab

PIP Address Distribution Dynamic IP Dynamic IP A Seminar at Advanced Defense Lab

Accuracy Evaluation • Among 1.7 million PIP addresses, 973K of them can be labeled based on the account reputation data. A Seminar at Advanced Defense Lab

Accuracy of Individual Componets A Seminar at Advanced Defense Lab

Accuracy against Data Length A Seminar at Advanced Defense Lab

Validation of Unlabeled Cases • Future Reputation • the reputation score of July 2011 (after 11 months) A Seminar at Advanced Defense Lab

Application • Windows Live ID Sign-up Abuse Problem • We focus on the sign-ups related to Hotmail and use the Hotmail reputation trace in July, 2011 (after 11 months) to determine whether a particular sign-up account was malicious or not. • We study the sign-up behavior on two types of the PIP addresses. • The first is the 1.7 million derived PIPs. • The second is the set of IP addresses that have more than 20 sign-ups from the Windows Live ID system, but they are not included in the 1.7 million PIPs. A Seminar at Advanced Defense Lab

Using PIPs to Predict User Reputation • Precision = 97% A Seminar at Advanced Defense Lab

Q & A Thank you for listening A Seminar at Advanced Defense Lab

Populated IP Addresses — Classification and Applications

Populated IP Addresses — Classification and Applications

Presentation Transcript

Chapter Outline

Move Update

Catatonia in Psychiatric Classification: A Home of its Own

International Classification of Functioning, Disability and Health (ICF)

FURTHER APPLICATIONS OF INTEGRATION

FURTHER APPLICATIONS OF INTEGRATION

Neonatal Cranial Ultrasound (Part II) Classification of intracranial Haemorhage and Leukomalacia

Classification

Classification with Multiple Decision Trees

Library of Congress Classification

Classification of Living Things

Automated landform classification using DEMs

Pointers

Classification of Seizures

CS490D: Introduction to Data Mining Prof. Chris Clifton

Spatial and Temporal Data Mining

Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 8 —

Chapter 6. Classification and Prediction

GENERAL HUMAN ANATOMY

Chapter 8: IP Addressing

Chapter 3: Supervised Learning

What is classification?