180 likes | 395 Vues
Bayesian Spam Filter. By Joshua Spaulding. Statement of Problem. “Spam email now accounts for more than half of all messages sent and imposes huge productivity costs…By 2007, Spam-stopping should grow to a $2.4 Billion Business.” Technology Review 8/03. Objective.
E N D
Bayesian Spam Filter By Joshua Spaulding
Statement of Problem “Spam email now accounts for more than half of all messages sent and imposes huge productivity costs…By 2007, Spam-stopping should grow to a $2.4 Billion Business.” Technology Review 8/03
Objective Using Bayes’ rule I will attempt to classify an email message as spam or non-spam (ham). I will use a corpus of spam and ham to determine the probability that a new email is spam given the tokens in the message.
Definition of Spam Unsolicited automated email
Bayes’ Rule P(A|B) = P(B|A)P(A) / P(B) P(A|B) is the conditional probability that event A occurs given that event B has occurred; P(B|A) is the conditional probability of event B occurring given that event A has occurred; P(A) is the probability of event A occurring; P(B) is the probability of event B occurring.
Bayes’ Rule P(spam|token) = P(token|spam)P(spam) / P(token) P(spam|token) – probability that email is spam given a token P(token|spam) – probability token exists given email is spam P(spam) – probability of email being spam P(token) – probability of token in email
Project Design (orig) • Read in large text file containing 1000 spam. • Read in large text file containing 1000 ham. • Create a file for each corpus consisting of the token and it’s occurrence in the corpus. • I'll then create another file with the token and the probability that an email containing it is spam using Bayesian rule. • When an email arrives I will parse the email. I will look up the probability that the email is spam given the token. I’ll then combine all the probabilities to determine the probability that the email is spam.
Project Design • Create Narl model from 100 spam and 100 ham contained in two separate CSV files. Used Narl’s built-in Excel Model function. (emailCorpus.narl) • Parse body slot from emailCorpus.narl, create word nodes and calculate the probability. (kb.narl) • Examine incoming text body, tokenize and create nodeNames. If nodeName is already in the kb then lookup the probability. Otherwise assign probability value of “0.5”.
Issues • Text is unknown and often incomplete. • Java data structures • Vector, StringTokenizer, floating-point operations • Unfamiliar with Narl
Enhancements • Read slots other than body. • Read data in from another format. Gain more knowledge about the email. • Better error handling. • Read email as they enter the mail server. • Regular expression matching of Stringtokenizer. • Performance tuning with more data. • Take advantage of Narl functionality??