On the Validation of Traffic Classification Algorithms

On the Validation of Traffic Classification Algorithms Géza Szabó, Dániel Orincsay, Szabolcs Malomsoky, István Szabó Traffic Lab, Ericsson Research Hungary

Aim & Contents • Aim: • Introduce our novel validation method which makes it possible to measure the accuracy of traffic classification methods • Contents: • Requirements – How should validation be done? • Related work – How is it currently done? • Our proposal – What have we proposed? • Working mechanism – How does our proposal work? • Validation a state-of-the-art traffic classification method – What have we learnt from the validation? • Future work – What else can be done with the proposed method?

Requirements – How should validation be done? • Objective of traffic classification: • Identify applications in passively observed traffic • Validation of classification method by active test

Related work – How is it currently done? • CURRENTLY • Weak and ad hoc validation • No reliable and widely accepted validation technique • No reference packet trace with well-defined content is available • Dynamically allocated ports • Non-realistic environment • Proprietary protocols • Encryption • Be up2date S. Sen and J. Wang: Analyzing Peer-to-peer Traffic Across Large Networks • Header traces → port based method • Lot of flows • Simultaneous applications • Previously well-classified traces J. Erman, M. Arlitt and A. Mahanti : Traffic Classification Using Clustering Algorithms • Impossible to validate by others • Just hint • Impossible to repeat with same conditions T. Karagiannis, K. Papagiannaki and M. Faloutsos : BLINC: Multilevel Traffic Classification in the Dark L. Bernaille et al: Traffic Classification On The Fly

Our proposal

The proposed method for validation • Principle: • Packets are collected into flows at the traffic generating terminal • Flows are marked with the identifier of the application that generated the packets of the flow • The main requirements on the realization of the method: • It should not deteriorate the performance of the terminal • The byte overhead of marking should be negligible • The preferred realization is a driver that can be easily installed on terminals The position of the proposed driver within the terminal

Working mechanism • The packet is examined whether it is an incoming or outgoing packet • In case of an outgoing packet, the size of the packet is examined • Continues with only those packets which are smaller than the MTU decreased with the size of marking • The process continues with only TCP or UDP packets • According to the five-tuple identifier of the packet, it is checked whether there is already available information about which application the flow belongs to • Query operation system • Need marking: • Randomly • Only first • Leave the first • No mark The working mechanism of the introduced driver

Place of marking • Extending the original IP packet with one option field • Router Alert option field • Transparent for both the routers on the path and also for the receiver host (according to RFC 2113 [3]). • The first two characters of the corresponding executable file name are added • Increasing the size of the packet with 4 bytes • The packet size field in the IP header is also increased with 4 bytes • Header checksum is recalculated A marked packet of the BitTorrent protocol

Proof-of-concept

Reference measurement • Available at http://pics.etl.hu/˜szabog/measurement.tar • In a separated access network • Our driver has been installed onto all computers on this network • Duration of the measurement: 43 hours • Captured data volume: 6 Gbytes, containing 12 million packets • The measurement contains the traffic of the most popular • P2P protocols: • BitTorrent • eDonkey • Gnutella • DirectConnect • VoIP and chat applications: • Skype • MSN Live • FTP sessions • Download manager • E-mail sending, receiving sessions • Web based e-mail (e.g., Gmail) • SSH sessions • SCP sessions • FPS, MMORPG gaming sessions • Streaming: • Radio • Video • Web based The traffic mix of the measurement

Validation results (1) – Success • Combined traffic classification method (described in [1]) with the addition that the classification of VoIP applications has been extended with ideas from [2] • Accurately identified: • E-mail • Filetransfer • Streaming • Secure channel • Gaming traffic • Success due to: • Well-documented protocols • Open standards • Do not constantly change • Difficulties in case of…? • Encryption: • But: session initiation phase is critical as this phase can be identified accurately • Success: SSH or SCP The results of the classification compared [1] to the reference measurement [1] G. Szabo, I. Szabo and D. Orincsay: Accurate Traffic Classification [2] M. Perenyi and S. Molnar: Enhanced Skype Traffic Identification

Validation results (2) – P2P Difficulties: • Many TCP flows containing 1-2 SYN packets probably to disconnected peers • No payload in these packets =>the signature based methods can not work • Dynamically allocated source ports towards not well-known destination ports => the port based methods fail • Server search and P2P communication heuristic [1] methods also fail => there are no other successful flows to such IPs • Also some small non-P2P flows were misclassified into the P2P class • Not fully proper content of the port-application database • Creating too many port-application associations easily results in the rise of the misclassification ratio. • The constant change of P2P protocols • New features added to P2P clients day-by-day • Working mechanism can be typical for a selected client not the whole protocol itself The results of the classification compared [1] to the reference measurement [1] G. Szabo, I. Szabo and D. Orincsay: Accurate Traffic Classification [2] M. Perenyi and S. Molnar: Enhanced Skype Traffic Identification

Validation results (3) – Philosophy • Traffic which is the derivation of other traffic: • E.g., DNS traffic • MSN: HTTP protocol for transmitting chat messages • MSN client transmits advertisements over HTTP, but this cannot be recognized as deliberate web browsing • Hit := the classification outcome and the generating application type (the validation outcome) agreed • E.g., the chat on the DirectConnect hubs which has been classified as chat could have been considered as actually correct but in this comparison it was considered as misclassification The results of the classification compared [1] to the reference measurement [1] G. Szabo, I. Szabo and D. Orincsay: Accurate Traffic Classification

Validation results (4) – VoIP: MSN, Skype • High VoIP hit ratio is due to the successful identification • MSN Messenger • Skype • Skype is difficult to identify • Same problem as in the case of P2P • Proprietary protocol designed to ensure secure communication • [2] characteristic feature: the application sends packets even when there is no ongoing call with an exact 20 sec interval. • In [1]: a P2P identification heuristic which was designed to track any message which has a periodicity in packet sending • Extension of [1] was straightforward • The validation showed: • The deficiency of the classification of Skype • Simple extension of the algorithm • Idea of [1] has been validated as it proved to be robust for the extension with new application recognition • Also the validation mechanism proved to be useful The results of the classification compared [1] to the reference measurement [1] G. Szabo, I. Szabo and D. Orincsay: Accurate Traffic Classification [2] M. Perenyi and S. Molnar: Enhanced Skype Traffic Identification

Summary • We introduced a new active measurement method which can help in the validation of traffic classification methods. • The introduced method is a network driver • Mark the outgoing packets from the clients with an application specific marking • With the introduced method we created a measurement and used this to validate the method presented in [1] • The method has been proved to be working accurately • Some deficiencies in the classification • P2P applications • Skype Benefits: [1] G. Szabo, I. Szabo and D. Orincsay: Accurate Traffic Classification

Further work • Use the marking method at the measurement side for online traffic classification • Assumptions: • The terminals accessing an operator’s network are all installed with the proposed driver • The driver is made tamper-proof to avoid users forging the marking • Online clustering of the traffic into QoS classes based on the resource requirements of the generating application • Used by operators to charge on the basis of the used application by the user • Extension of the marking by other information about the traffic generating application • E.g., version number • Operator could track the security risks of an old application

Questions, discussion… • Thank you very much for your kind attention! • Contact: • E-mail: geza.szabo@ericsson.com

On the Validation of Traffic Classification Algorithms

On the Validation of Traffic Classification Algorithms

Presentation Transcript

Algorithms for Classification:

Google-based Traffic Classification

Classification Algorithms – Continued

Classification Algorithms

Traffic classification and applications to traffic monitoring

On the Limits of Dictatorial Classification

GENETIC ALGORITHMS FOR THE UNSUPERVISED CLASSIFICATION OF SATELLITE IMAGES

Internet Traffic Classification KISS

Internet Traffic Classification: On the Discriminative Power of Traffic Flow Features

Algorithms for Classification:

Comparison of Web Page Classification Algorithms

Validation and Evaluation of Algorithms

Survey of Packet Classification Algorithms

Validation of CIRA Tropical Cyclone Algorithms

Graph Algorithms: Classification

Classification and Validation

Performance Analysis of Packet Classification Algorithms on Network Processors

Classification Algorithms – Continued

Classification Algorithms

Classification Algorithms

Algorithms for Classification:

Algorithms for Classification: