100 likes | 207 Vues
This document outlines the collaborative efforts of team members Arron, La, Joey, Lei, David, and Cortez in differentiating between spam and non-spam emails. We propose diverse techniques for an unbiased spam filter, including a hash table using the nearest neighbor approach, augmented by additional data metrics like email size, subject line content, punctuation ratios, and IP addresses. Key methods, including Bayesian networks, are explored for calculating probabilities of spam classification. Evaluation strategies involve training/testing datasets for comparative analysis of filter effectiveness.
E N D
Spam Filtering Team Arron La Joey Lei David Cortez
Problem • How to differentiate emails • Decide if an email is spam or non-spam • Gather a diverse knowledge base to develop an unbiased spam filter
Techniques for Implementation • A hash table with “nearest neighbor approach” • Nearest neighbor approach with extra data • Bayesian or Neural Networks
The hash table will contain important and common words that may indicate if an email is spam “Nearest Neighbor Approach” Non-Spam E-Mail Spam
Nearest Neighbor Approach with Extra Data • Extra Data Consists are the following: • Size of the email • Content\Subject Line • Punctuation to word ratios • IP addresses
Bayesian Network Approach • Create two hash tables that tallies the number of occurrences of each word in a spam/non-spam email • Create a third hash table that calculates the probability of each word • probability(word) { let g = (2 * # of hashNonSpam(word)) let b = (# of hashSpam(word)) if(g + b) > 5 then max( 0.1, (min 0.99, ((min (numOfSpam / b), 1) / ((min (g/ numOfNonSpam, 1) + min(1, (b/ numOfSpam))) } numOfSpam = # of spam emails numOfNonSpam = # of non-spam emails
Bayesian Network Approach Continue.. • To check email: Take 20 words that has the probability farthest from 0.5 (meaning neutral words) • With those 20 words, use Bayes Rule ab..v prob(word) = ------------------------------ ab..v + (1 - a)(1 - b)..(1-v) If prob(word) > 0.9 == SPAM EMAIL
Methods of Evaluation • Create a training and testing data set to determine effectiveness • Results to compare implementations to one another • Implementations can be compared to other well-known techniques
Blacklist Domains/Emails “White list” Domains Authenticity Checking Header/Context Analysis Checksum Technology User Input Learning (Spam/Non-Spam Button) Classifying Non-Spam Other Techniques of Implementation
Reference • “A Plan for Spam,” Paul Graham, 2003 August, www.paulgraham.com/spam • “Better Bayesian Filtering,” 2003 Spam Conference, www.paulgraham.com/better