150 likes | 161 Vues
Netflow and Botnets. Steven M. Bellovin Columbia University. Hypothesis. Most hosts are either clients or servers P2P traffic is an exception Bots talk to other bots and thus to command and control node
E N D
Netflow and Botnets Steven M. Bellovin Columbia University smb
Hypothesis • Most hosts are either clients or servers • P2P traffic is an exception • Bots talk to other bots and thus to command and control node • By looking for unusual traffic flows – client-to-client traffic that isn’t P2P – we can find bots smb
Methodology • Use Netflow data to identify clients and servers • Classify nodes as clients or servers • Build a traffic matrix from the data to see which clients talk to which other clients • Exclude P2P traffic, which is generally identifiable based on flow size smb
Netflow • Originally from Cisco; now implemented by most router vendors • Also an IETF “Proposed Standard” • Records “flow information” – src/dst pairs (addresses and port numbers), length, timing, etc. – for “connections” through a given router • Intended for accounting and for traffic engineering smb
Problems with Netflow • Flows are unidirectional; need two records for complete picture • This is a consequence of Internet topology; most inter-ISP connections follow asymmetric paths • Routers often deliver sampled data; can miss flow start/end packets • Does not give unambiguous indication of client versus server smb
Strategy • Build tools at Columbia • Easy access to machines and data • Use existing archive of CU netflow data • Unclear if there are botnets present; get classification right first • Get other netflow archives (e.g., from predict.org) • Bring nominally-working code to AT&T to experiment with large-scale datasets • Compare with previous results from AT&T as check on correctness smb
Node Classification • Must use heuristics • Flag field in netflow data doesn’t show client vs. server • Timestamp not useful because of sampling • Current strategy: look at port number distribution • Clients usually use ports 48K-64K • Considering using node degree • But – problems with low-activity hosts? smb
Classification is Hard • Simple heuristics have not been satisfactory • Building visualization tools to help us understand the data smb
Ambiguous Host smb
Ambiguous Host Scatter Plot Is this the sort of host we’re looking for? smb
Current Status • Have basic tools built • Working with visualization tools to understand the data • Next steps: • Refine classification algorithms • Confirm analysis of bots in sample data • Try tools on larger dataset smb