An Internet Traffic Analysis Method withMapReduce Youngseok Lee, Wonchul Kang andHyeongu Son Chungnam National University. Presented By, Venkata Patlolla Old Dominion University CS 775 Distributed Systems, Dr. Mukkamala April 18, 2011
Agenda • Introduction • Related Work • MapReduce-Based Flow Analysis • Overview • Flow Analysis Method with MapReduce • Performance Environment • Experimental Environment • Flow Statistics Computation time • Recovery of a single node failure • Conclusion
Introduction • Flow based traffic monitoring methods are used by ISPs • Eg. Cisco NetFlow :easily monitors flows passing through routers and switches without observing each packet. • Netflow-compatible flow generators like “nProbe”, monitors packet stream in flow units. • As network is growing we need to monitor more routers & switches for security, quality of service and accounting reasons. • Typically, ISPs use high Performance servers with large storage system to collect and analyze flow data from many routers. • It is not easy to compute traffic statistics from many large flow files in short time. Packet sampling and aggregation is techniques are used to lessen continuous stream flow data.
Cont.. • Single server approach is not efficient when • analyzing flow data for large network (Tera and Peta-bytes) • when global internet worms or DDoS(Distributed denial of service) attack happens. • From Cluster file systems and cloud computing platform we achieve • Distributed parallel computing. • Fault tolerance. • Google, Yahoo, Facebook, Amazon are rigorously trying to develop Cluster file systems and cloud computing Platforms.
Introduction to MapReduce MapReduce: It is a software framework that supports distributed computing with two functions Map and Reduce on large datasets on clusters. • Google first developed MapReduce programming model for page ranking and web log analysis. • Yahoo released an open source system for cloud computing platform, called “Hadoop”. • Amazon provide Hadoop based cloud computing services called Elastic Compute Cloud(EC2) or simple storage service(S3). • Facebook also uses Hadoop to analyze web log in its network. • All these networks use cloud computing on cluster file systems as it provides fault tolerance to manage huge data easily.
Related work • Flow analysis tools such as flow-tools, Coral-Reef, flowscan are used for generating flow statistics such as port breakdown. • These tools run on single server with large storage system such as RAID or Network Attached Storage(NAS). • These tools are not efficient in processing tera or peta-byte flow data. • Analyzing traffic by parallel processing is done in many ways • One of it is DIPStorage uses P2P platform called Storage tanks. But, each tank with flow processing rule increase computation overhead • The MapReduce program developed by many is used to achieve less computation time and analyze huge amount of data.
MapReduce-Based Flow Analysis Architecture of flow measurement and analysis system
Architecture Description • Cloud Platform : Provides cluster file system and cloud computing functions. • Flow data from Routers are delivered to the clusters through unicasting or anycasting. • Master node operates cluster nodes to save and process the flow data. • Also the cluster configuration is handled by Master Node. • When flow data is achieved on to cluster filesystem , the Mapreduce flow analysis program is run on cloud platform. • Each cluster node architecture is as shown below.
Functionality • Flow collector: stores flow packets received into files and move them to local disk at cluster file systems periodically. • NetFlow packets from routers sent to cluster nodes in unicast. Uses UDP which is not reliable. We can use SCTP for reliablity. • Anycast can be used to provide load balancing with cluster nodes when receiving NetFlow packets. • Flow collector uses flow tools for NetFlow collecting and processing tools. • Mapper and Reducer will analyze flow data with Hadoop MapReduce library. • To manage huge data and to have fault tolerant service authors used HDFS.
Cont.. • HDFS follows write-once and read-many-times pattern. • HDFS has • Name node, manages filesystem metadata and provides management and control services . • Name node at master perform recovery and automatic backup of name nodes. • Data node, supplies block storage and retrieval services.
Flow Analysis method with MapReduce • MapReduce computation has Map and Reduce functions. • Map takes input key/value pair and produce intermediate key/value pairs. • Hadoop MapReduce library will group the intermediate values according to the same key. • Reduce will merge the intermediate values for smaller values. • To implement various flow analysis programs with MapReduce, we have to determine appropriate input key/value pairs. • Eg. : Analyze traffic by port breakdown. Which sums up the octet count for the port number. Key/value is (port,octet). • This is shown in following figure.
Explanation of example in detail • Input Flow Files : After flow data is sorted on local disk, we move raw NetFlow V5 to cluster filesystems, HDFS. • As Hadoop Mapper support only text files. We convert Netflow files to text. As text files are large we need to support binary files to input to mapper. Else, gzip files cannot be input. • Mapper : Reads each flow record split by newline. Each record have timestamp, port, Ip add., flag, octet count, packet count. • After reading, mapper filter out necessary flow attributes for a flow analysis job. • Flow analysis job sums up octet counts per destination port number, key/value pairs as (dst port, octets) is set. • The flow map task will write its temporary results on the local disk.
Cont.. • Reducer: input is fed into Reducer from Temp file i.e intermediate values generated by flow mappers. • list of octets belonging to the same destination port number will be summed up. • After merging octet values associated with the destination port, the flow reducer writes the octet value for each port number
Performance Evaluation • Testbed consisting of a master node and four data nodes. Each node has quadcore 2.83 GHz CPU, 4 GB memory, and 1.5 TB hard disk. HDFS is used for the cluster filesystem. All Hadoop nodes are connected with 1 Gigabit Ethernet cards. • 5 min flow file is not enough to asses the performance of MapReduce
Cont.. • Thus, to evaluate the flow statistics computation time for large data sets, we used input flow files collected forone day, one week, and one month. • The binary flow files are used inputs to flow-tools, whereas the text flow files to our MapReduce program.
Flow statistics computation time • Comparison between Flow tools on a single server and MapReduce Program. • Aim is to compute octet count for each destination port number. • Executed “flow-cat /flowdirectory/ | flow-stat -f 5 > result” commands of flow-tools to concatenate binary flow files stored in a directory and to calculate the flow statistics for the destination port. • MapReduce program reads text flow files and produces the octet count for each destination port.
Cont.. • Under a large data set of 108.2 million flows, MapReduce with four data nodes spent only 1.5 times more seconds to complete the job by recovering Map/Reduce failures. • Through experiments, it is clear that the flow computation job could successfully finish against a single node failure through the Hadoop fault-tolerant service.
Conclusion • MapReduce-based flow analysis method for a large-scale networks that could analyze efficiently and quickly big flow data against failures. • Flow computation time could be dramatically improved by 72% compared with the typical flow analysis tools. • Faulttolerant service against a single machine failure could be easily provided by MapReduce-based flow analysis. • Improve a few drawbacks of the current MapReduce-based approach such as batch processing jobs or text input file formats, and to develop convenient flow analysis tools based on MapReduce.