Bayesian Filtering

Bayesian Filtering Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu

Background • Strong need exists to identify “bad” items in a population and remove them -- Examples: SPAM, Unsolicited IMs, Etc. • Filtering often results in “Arm’s Race” requiring rapid response • “Arm’s Race” favors inherently adaptive methods over others

Benefits of Filters • Less unwanted traffic, thus less wasted space on clients & servers • Greater use of internet services due to reduced customer frustration • Provide some protection against dangerous traffic: scams, phishing attacks, viruses, etc.

Downsides of Filtering • Exclusion of even one legitimate item (i.e., False Positives) less desirable than letting 10 or more illegitimate items pass. • Reducing the percentage of undesirable traffic often causes legitimate traffic to be excluded as well.

Cost of Filtering • Manual filtering has become prohibitive • Maintenance of static filters costs time & money • Time spent maintaining keywords or updating software delays response • “Arm’s Race” often results in ever escalating costs

Methodologies • Manual filtering prohibitive in terms of time • Static filtering based on heuristics and keywords does not adapt except via manual updates • Bayesian filtering is dynamic, adapting with each new item scanned and/or marked

What is Bayesian Filtering? • Uses Naïve Bayes Classifier, which uses Bayes Theorem • Classifier allows items to be adaptively categorized using probabilities & has low rate of False Positives • Most well-known use in SPAM filtering; often credited to initial work by Paul Graham (“A Plan for Spam”) in 2002

Naïve Bayes Classifier • Uses Bayes Theorem with assumptions that probabilities are independent (rarely true), thus “naïve” • Classifier can start with initial assumptions, i.e., probabilities that words occur in legitimate or illegitimate messages • Is trained over time and adapts. If final probability reaches some threshold, an item is rejected. Superior to keyword filtering.

Bayes Theorem • First presented in 1763 based on work by mathematician Thomas Bayes • Pr(A|B) = Pr(B|A)· Pr(A) / Pr(B) • Specifies relationships between conditional probabilities • Currently has practical use in many fields

Bayesian Filtering Usage • Uses user input to develop individual statistics • Probability matrix changes over time based on scanned messages and user decisions • Matrix is used to calculate probability a message is unwanted • Matrix adapts quickly to new input, resulting in surprisingly good results

Example Matrix

Example • Suppose the word “guarantee” occurs in 500 of 2000 Spam emails, but only in 5 of 1000 Non-Spam emails • The probability of Spam for this word is then (500 / 2000) / ((500 / 2000) + (5 / 1000)) = 0.98 • This probability is combined with that of others obtained from message to compute a probability for the entire message being Spam.

Bayesian Poisoning • Attempts to fool BF systems by adding irrelevant words (often hidden) • Type I attacks attempt to get messages through filter -- could be active or passive, with active producing feedback to sender via a “Web Bug” or other means • Type II attacks attempt to cause “False Positives”, i.e., force desirable messages to be rejected

Poisoning Effectiveness • Passive attacks are rarely effective as filters are individual and sender gets no feedback • Active attacks can be initially highly effective, if systems access “Web Bugs” • All attacks lose effectiveness as the filter adjusts to incoming traffic

Products that use Bayesian Filtering

Summary • BF adapts to individual needs • BF is highly effective • BF adapts more quickly than other solutions • BF is resistant to “poisoning”

References • [1] Sahami, M., et. al. “A Bayesian Approach to Filtering Junk E-Mail”, 1998 • [2] Graham, Paul. “A Plan for SPAM”, 2002 • [3] Graham-Cumming, John. “Does Bayesian poisoning exist?”, 2006

References, cont. • [4] Naive Bayes Classifier, Wikipedia, 2007 • [5] Bayes Theorem, Wikipedia, 2007

Bayesian Filtering