Internet Traffic Classification: On the Discriminative Power of Traffic Flow Features

Internet Traffic Classification: On the Discriminative Power of Traffic Flow Features AAF Workshop Cairo, Egypt, 2009.5.15 (Fri) Hyun-chul Kim hkim@mmlab.snu.ac.kr Joint work with Yeon-sup Lim, Jiwoong Jeong Seoul National Univ.

Why Internet Traffic Classification? MOTIVATION

Big struggles over the Internet • File sharing tussle • File sharing community v.s. intellectual property representatives (e.g., RIAA and MPAA). • Cyber-security battle • Malicious hackers v.s. Security management companies. • Network neutrality debate • ISPs vs. contents/service providers. 3

All the tussles boil down to... Internet (application) traffic classification (Broadly speaking, Internet System Measurement) 4 1/41 The emergence of Napster traffic in 1999-2000 [Plonka 01]

Internet Traffic Classification : Approaches so far • Port number-based. • Payload-based. • Host-behavior-based. • Flow-features-based. O(100) papers for the last 5-6 years.

What problem have we addressed ? PROBLEM DEFINITION

Key Questions • The best traffic classification approach? • Under what conditions? (backbone? edge? Bandwidth? …) • For what applications? (p2p? Web? Games? Streaming? …) • Why? • What are marginal contributions and limitations of each approach?

State of the Art Answer (2007) “Who knows ?????” :-x “One of the foremost challenges of traffic classification currently is effectively comparing between the many proposed approaches.” [Erman 07] 8

Why? [Erman 07, Moore 07] 盲人摸象 (Blind Men Guessing an Elephant) • Every traffic classification approach/technique • is evaluated using different (local) traces, often w.o. payload. • tracks different features, tune different parameters, even with • different definition of traffic unit and/or application category.

METHODOLOGIES

Shared codes, data, & expertise 4/41 11 2/22 11

We conducted • Comprehensive evaluation of • Port-based approach. • Host-behavior-based approach. • Flow-features-based approach. • Using 7 traces with payload • 3 backbone and 4 edge traces. • From Japan, Korea, Trans-pacific, and US.

Datasets

Tools evaluated • CoralReef : port-number based classification • Version 3.8 (or later). • BLINC : host behavior-based classification • WEKA : A collection of machine learning algorithms • 7 most often used / well-known algorithms. • Flow attributes selection. • Training set size vs Performance (Accuracy / F-Measure).

Machine learning algorithms Supervised machine learning algorithms Bayesian Decision Trees Rules Functions Lazy Naïve Bayesian, Support Vector Machine, [Moore 05, Williams 05] C4.5 [Williams 06B, Li 07, , k-Nearest Neighbors Bayesian Network [Williams 06B] Bennett 00] [Roughan 04] [Williams 06A, Neural Net. [Auld 07, Williams 06B] Nogueira 06] Naïve Bayes Kernel Estimation[Moore 05, Williams 05]

RESULTS and LESSONS LEARNED (in a brief)

Key Lessons Learned (~2008) • Port numbers as key features • Still useful in identifying many conventional applications. • Very powerful when used with (the first few) packet size info. • The first work that showed uni-directional traffic flow feature set is good enough for accurate traffic classification. • Support Vector Machine algorithm worked the best • Requires the smallest training set to achieve higher accuracy. • Scientifically grounded (reproducible) traffic classification research requires that researchers share tools, algorithms, and data sets froma wide range of Internet links to reproduce results.

More (Fundamental) Questions Q) If an algorithm A performs very well (with >90/95/99% classification accuracy)… • Why does the algorithm work that well? • i.e., where does the real good performance come from? A1) Is it because the algorithm itself is very smart(er) enough? A2) OR, Is it because the traffic classification itself rather an easy (not that a complicated) pattern classification problem?  How much performance is gained from each of (A1) and (A2)? How do we quantify them? 4/16

Selected key flow features * CFS (Correlation-based Feature Selection [Williams 06]) was used

Accuracy with each traffic flow feature Using the K-Nearest Neighbors method.

Accuracy with the size of the first n packets in traffic flows • Size of the first 4-5 packets only  ~ 85% of accuracy • Showing the feasibility of accurate real-time traffic classification.

How many packets do we have to go through identify specific TCP apps? • Size of the first 4-5 packets only  ~ 80~85% of accuracy

How many packets do we have to go through identify specific UDP apps? • Size of the first 1-2 packets only  ~ 80~100% of accuracy !!!!

Concluding Remarks • The Measured discriminative power of features • The first 4-5 packets ~ 85% accuracy. • Doesn’t have to be bi-directional TCP connection flows [Bernaille ‘06] • 1 + Protocol + Ports ~ 88.6% accuracy. • Even without any algorithmic intelligence/model (we did nothing but just distance calculation in the Euclidean Feature Space). 3. Real-time traffic classification with one-directional flow. • 4. Ok then, now we’ve got 11-17% left to achieve • Which algorithm(s), based on what basic theory, is the best to obtain/maximize the additional performance gain?

Developing an open-source traffic classification benchmark Open-source, Plug & Play Framework

References [Kim ‘08] Kim et al., “Internet Traffic Classification Demystified: Myths, Caveats, and the Best Practices,” ACM CoNEXT, Madrid, Spain, December 2008. [CoralReef ‘07] CoralReef. http://www.caida.org/tools/measurement/coralreef [Erman ‘06] Erman et al., “Traffic Classification Using Clustering Algorithms,” ACM SIGCOMM Workshop on Mining Network Data (MineNet), Pisa, Italy, September 2006. [Karagiannis ‘05] Karagiannis et al., “BLINC: Multi-level Traffic Classification in the Dark,” ACM SIGCOMM 2005, Philadelphia, PA, August 2005. [Won ‘06] Won et al., “A Hybrid Approach for Accurate Application Traffic Identification,” IEEE/IFIP E2EMON, April 2006. [Bernaille’06] Bernaille et al., “Early Application Identification,” ACM CoNEXT, Lisboa, Portugal, December 2006. 16/16

k-Nearest Neighbors Training instances for class A Training instances for class B Feature Y Testing instances to classify 7/16 Feature X (e.g., 1st packet size, …)

Internet Traffic Classification: On the Discriminative Power of Traffic Flow Features

Internet Traffic Classification: On the Discriminative Power of Traffic Flow Features

Presentation Transcript

ESOL for Driving

Traffic Management

Traffic Detection Systems

Lebanon County Traffic Impact Study Highway Occupancy Permit Workshop April 22, 2009

Traffic Engineering

Traffic Flow Characteristics (2)

Traffic Assignment Part II

Network Traffic Self-Similarity

Ohio Traffic Safety Office

Maintenance of Traffic (MOT) Concepts

On Traffic Analysis in Tor

Ohio Traffic Safety Office

Traffic Control Devices and Traffic Laws

Internet Streaming Media UDP-based 轉送訊務的量測

Wide-Area Traffic Management for Cloud Services

The Traffic Management Problem

Work Zone Traffic Control

Traffic Control Devices and Traffic Laws

Ohio Traffic Safety Office

Monitoring your NetScaler Traffic with AppFlow

Ohio Traffic safety office

How to Increase Blog Traffic in 3 Easy Ways