1 / 17

Density-Based Spam Detector

RESEARCH TRACK. Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data/text mining. INDUSTRIAL TRACK. Density-Based Spam Detector. (Acceptance rate = 40/337 = 12%). Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc.

Télécharger la présentation

Density-Based Spam Detector

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RESEARCH TRACK Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data/text mining INDUSTRIAL TRACK Density-Based Spam Detector (Acceptance rate = 40/337 = 12%) Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356-8502, Japan {fujikawa,yamazaki}@kddilabs.jp Fuminori ADACHI Takashi WASHIO Hiroshi MOTODA ISIR, Osaka University 8-1, Mihogaoka, Ibarakishi, Osaka 567-0047,Japan {adachi,washio,motoda}@ar .sanken.osaka-u.ac.jp

  2. Density-Based Spam Detector • A new spam detection method which use document space density information • The use of document space density • An efficient implementation through the use of a direct-mapped cache • Purpose • For the spam filter which is used in the mail server, it has to be: • High processing speed • maintain Easily • High accuracy • Privacy protection

  3. Hash DB Hash DB System Architecture Unsupervised learning-- solve privacy problem, maintenance problem Hash table-- solve processing speed problem Similarity, threshold-- solve spam filter accuracy problem Hash table Hash feature Read features Find feature by N-gram Mail corpus calculate similarity Write/update features, similar email, email calculate similarity > SPAM threshold SPAM An incoming email

  4. Related work • Bayesian-like approach • Rule-based approach • Checksum data base • http://www.dcc-servers.net/dcc/ • Vector representation • Hash-based text representation • Text retrieval、text compression、spam filtering • Direct-mapped cache is used to replace for LRU

  5. Density-based spam detector • Document space density • Count the number of similar e-mails • By counting the number of similar emails, the simple threshold is enough to distinguish spam from other emails.

  6. Mail System Design • Mass Mail Detector Monitoring network packets Analyzes SMTP traffic between mail servers and reconstructs the text of emails Transfers the text into vector representation

  7. Hash design • Hash based representations • Hash values of each length L substring are calculated, and then the first N of them are used as vector representation of the email

  8. Caching Architecture • Direct-mapped cache architecture • The hash data base store hash values of email and number of similar emails. • The direct-mapped cache is a simple hash table and the algorithms can find the entries of the of emails which have the same hash value through this cache.

  9. Similar emails • To check a single piece of email, in order to find similar previous e-mail which share S% of the same hash values • Algorithm

  10. Experimental results Distribution of similar e-mail follows Zipf’s law Summary of experimental results

  11. Experimental results Recall rate Cache usage Log

  12. Experimental results Comparison with other methods Testing method

  13. Experimental results Bsfilter -2 group of mail list S:528 mails H:1538 mails Training:H+1/2S Testing:1/2S Result: After some period, bsfilter missclassified most of mails Reason: change topics Effect of topic change

  14. Experimental results Effect of On-line Learning

  15. Maintenance and privacy • Supervised learning methods require a positive and negative example of spam • Someone has to check the contents of the mail manually and therefore has the potential to violate privacy. • Although our method requires a white list, maintaining such a white list is relatively easy, especially comparing it to maintaining a black list.

  16. Conclusion • High processing speed • 13000 emails per second (1.25 billion emails per day) • Maintenance free • 98% recall rate and 100% precision • Privacy protection

More Related