1 / 10

Developing an Effective Spam Filtering System Using Nearest Neighbor and Bayesian Techniques

This document outlines the collaborative efforts of team members Arron, La, Joey, Lei, David, and Cortez in differentiating between spam and non-spam emails. We propose diverse techniques for an unbiased spam filter, including a hash table using the nearest neighbor approach, augmented by additional data metrics like email size, subject line content, punctuation ratios, and IP addresses. Key methods, including Bayesian networks, are explored for calculating probabilities of spam classification. Evaluation strategies involve training/testing datasets for comparative analysis of filter effectiveness.

zita
Télécharger la présentation

Developing an Effective Spam Filtering System Using Nearest Neighbor and Bayesian Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spam Filtering Team Arron La Joey Lei David Cortez

  2. Problem • How to differentiate emails • Decide if an email is spam or non-spam • Gather a diverse knowledge base to develop an unbiased spam filter

  3. Techniques for Implementation • A hash table with “nearest neighbor approach” • Nearest neighbor approach with extra data • Bayesian or Neural Networks

  4. The hash table will contain important and common words that may indicate if an email is spam “Nearest Neighbor Approach” Non-Spam E-Mail Spam

  5. Nearest Neighbor Approach with Extra Data • Extra Data Consists are the following: • Size of the email • Content\Subject Line • Punctuation to word ratios • IP addresses

  6. Bayesian Network Approach • Create two hash tables that tallies the number of occurrences of each word in a spam/non-spam email • Create a third hash table that calculates the probability of each word • probability(word) { let g = (2 * # of hashNonSpam(word)) let b = (# of hashSpam(word)) if(g + b) > 5 then max( 0.1, (min 0.99, ((min (numOfSpam / b), 1) / ((min (g/ numOfNonSpam, 1) + min(1, (b/ numOfSpam))) } numOfSpam = # of spam emails numOfNonSpam = # of non-spam emails

  7. Bayesian Network Approach Continue.. • To check email: Take 20 words that has the probability farthest from 0.5 (meaning neutral words) • With those 20 words, use Bayes Rule ab..v prob(word) = ------------------------------ ab..v + (1 - a)(1 - b)..(1-v) If prob(word) > 0.9 == SPAM EMAIL

  8. Methods of Evaluation • Create a training and testing data set to determine effectiveness • Results to compare implementations to one another • Implementations can be compared to other well-known techniques

  9. Blacklist Domains/Emails “White list” Domains Authenticity Checking Header/Context Analysis Checksum Technology User Input Learning (Spam/Non-Spam Button) Classifying Non-Spam Other Techniques of Implementation

  10. Reference • “A Plan for Spam,” Paul Graham, 2003 August, www.paulgraham.com/spam • “Better Bayesian Filtering,” 2003 Spam Conference, www.paulgraham.com/better

More Related