180 likes | 297 Vues
This document outlines a simple clustering algorithm designed for spam detection, presented at the MIT Spam Conference in 2008 by Phil and Tom. The algorithm focuses on expanding clusters based on similar messages that share identical originating IP addresses, subject lines, or message bodies. It utilizes a dimensional model for organizing and classifying spam messages, processing a dataset of 1.7 million messages. The results highlight cluster sizes, top sender IPs, and subject lines, offering insights into spam clustering effectiveness and strategies for identifying similar message patterns.
E N D
Clustering Spam MIT Spam Conference 2008 Phil Tom
Simple Clustering Algorithm Clustering pseudocode Expand clusters to include similar messages: • Identical originating IP addresses. • Identical subject lines. • Identical message bodies. for each cluster in clusters expand cluster for each message in unclustered messages create a new cluster add message to cluster expand cluster
Expand Cluster By IP update sdbf_message set cluster_id = ? where (cluster_id <> ? or cluster_id is null) and sender_ip_id in (select sender_ip_id from sdbf_message where cluster_id = ?)
Expand Cluster By Body update sdbf_message m set cluster_id = ? from sdbd_body b where (m.cluster_id <> ? or m.cluster_id is null) and m.body_id in (select body_id from sdbf_message where cluster_id = ?) and m.body_id = b.body_id and b.size_in_bytes > 25
Expand Cluster By Subject update sdbf_message m set cluster_id = ? from sdbd_subject s where (m.cluster_id <> ? or m.cluster_id is null) and m.subject_id in (select subject_id from sdbf_message where cluster_id = ?) and m.subject_id = s.subject_id and (s.word_count > 1 or length(s.subject) > 10)
Test Data Set • Dec 22, 2007 - Dec 29, 2007 • Single “Received:” header tag only • No multi-part messages • 1.7 million messages • Roughly 20%
Messages per Cluster Size *Not including the big cluster
Top Clusters by IPs cluster_id | messages | subject | bodies | ips | networks | countries ------------+----------+---------+--------+--------+----------+----------- 1 | 1436206 | 99836 | 330852 | 325660 | 8940 | 177 62 | 26623 | 451 | 25992 | 1313 | 57 | 2 59 | 11322 | 19 | 15 | 962 | 4 | 1 68 | 1065 | 2 | 1065 | 609 | 12 | 4 69 | 4476 | 59 | 85 | 514 | 17 | 1 10477 | 5521 | 5 | 9 | 283 | 4 | 1 953 | 722 | 149 | 333 | 275 | 16 | 1 175 | 307 | 2 | 306 | 208 | 179 | 26 379 | 240 | 7 | 9 | 184 | 4 | 1 18219 | 5581 | 15 | 5212 | 153 | 119 | 26 3924 | 2934 | 20 | 2934 | 150 | 1 | 1 144 | 377 | 22 | 377 | 125 | 3 | 1 242 | 307 | 4 | 3 | 124 | 5 | 1 134 | 3399 | 48 | 169 | 114 | 17 | 1 209 | 156 | 4 | 155 | 105 | 96 | 19 198 | 1117 | 174 | 1100 | 101 | 4 | 1
The Big One Cluster 1 summary messages | subject | bodies | ips | networks | countries ----------+---------+--------+--------+----------+----------- 1436206 | 99836 | 330852 | 325660 | 8940 | 177 Top 10 countries by IP count messages | subjects | bodies | ips | networks | country_name ----------+----------+--------+-------+----------+--------------------- 254948 | 30854 | 62772 | 27464 | 1453 | United States 75969 | 5110 | 27366 | 27446 | 170 | Germany 114328 | 6558 | 39312 | 26758 | 147 | Spain 78378 | 4705 | 29291 | 25263 | 48 | Turkey 91527 | 4624 | 29926 | 20930 | 209 | United Kingdom 51708 | 3194 | 19983 | 16842 | 42 | Peru 52652 | 2848 | 19644 | 15533 | 148 | Columbia 39475 | 3059 | 13344 | 10129 | 152 | Chile 34827 | 5063 | 12790 | 9664 | 12 | Brazil 40144 | 4381 | 13368 | 9372 | 126 | Italy
Clustering the Big One • Create clusters on subject and body messages | cluster_id | ips | subjects | bodies ----------+------------+--------+----------+-------- 740447 | 34641 | 131024 | 34 | 136 fake watches 111122 | 34643 | 79419 | 330 | 59166 penis enlargement 76521 | 34642 | 59112 | 27 | 55129 online casino 55421 | 34644 | 44772 | 55 | 25023 fake name brand goods 27789 | 34653 | 7190 | 81 | 16225 viagra 26815 | 34646 | 11099 | 20 | 19680 valium 25679 | 34656 | 5990 | 14846 | 25644 online pharmacy 12953 | 34649 | 3391 | 45 | 5 stock investment 12924 | 34645 | 4149 | 3 | 5 porn 12919 | 34648 | 3483 | 9 | 12332 software 10071 | 34650 | 9240 | 17 | 9273 russian dating 1099737 messages 284493 unique IPs
Clustering the Big One (cont) Number of overlapping IPs between clusters
Am I Bot or Not? cluster_id | messages | subjects | bodies | ips | networks | countries ------------+----------+----------+--------+-------+----------+----------- 62 | 26623 | 451 | 25992 | 1313 | 57 | 2 messages | subjects | bodies | ips | networks | country_name ----------+----------+--------+-------+----------+--------------- 1246 | 87 | 1246 | 5 | 3 | Canada 25377 | 443 | 24746 | 1308 | 54 | United States • Subject content widely varied • Many blocks of consecutive IPs • Some blocks are entire or most of a /24
Failure is Success Delivery Notification cluster: cluster_id | messages | subject | bodies | ips | networks | countries ------------+----------+---------+--------+--------+----------+----------- 68 | 1065 | 2 | 1065 | 609 | 12 | 4 Subject Detail messages | subject ----------+------------------ 613 | Delivery failure 452 | failure delivery • Delivery notification from legitimate mail servers • Not clustered with spam or sources of spam
Chinese Spam Top 10 Chinese Clusters cluster_id | messages | subject | bodies | ips | networks | countries ------------+----------+---------+--------+--------+----------+----------- 59 | 11322 | 19 | 15 | 962 | 4 | 1 3534 | 9987 | 1803 | 8 | 19 | 3 | 1 12 | 8054 | 9 | 8 | 26 | 1 | 1 10477 | 5521 | 5 | 9 | 283 | 4 | 1 69 | 4476 | 59 | 85 | 514 | 17 | 1 134 | 3399 | 48 | 169 | 114 | 17 | 1 121 | 2347 | 10 | 10 | 1 | 1 | 1 456 | 2187 | 21 | 73 | 41 | 6 | 1 56 | 2047 | 29 | 45 | 61 | 14 | 1 4621 | 1944 | 3 | 4 | 5 | 1 | 1 All Chinese messages messages | ips | networks | clusters | country_name ----------+------+----------+----------+--------------- 92235 | 5179 | 197 | 922 | China 139 | 2 | 1 | 2 | Thailand 78 | 12 | 3 | 4 | United States 5 | 4 | 1 | 2 | Germany
Small Clusters • Varied subjects and bodies. • Manual clustering of “online pharmacy” spam Example subjects: Buy sugar pills online cheap!!!!11one Buy sugar pills online cheap!!!1cos(0) Buy sugar pills online cheap!111pi^0 Coalesced clusters: messages | ips | subjects | bodies | clusters ----------+------+----------+--------+---------- 30333 | 9685 | 19453 | 30298 | 3651
What’s Next? • Improve the similarity metrics • Cluster a population or random sample • Add time to the analysis