1 / 59

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering. Derek Bridge University College Cork work done with Sarah Jane Delany Dublin Institute of Technology. Overview. Introduction Case-Based Spam Filtering Feature-Based Feature-Free Experiments I

hija
Télécharger la présentation

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering Derek Bridge University College Cork work done with Sarah Jane Delany Dublin Institute of Technology

  2. Overview • Introduction • Case-Based Spam Filtering • Feature-Based • Feature-Free • Experiments I • Case Base Maintenance • Competence-Based Editing • Experiments II • Concept Drift • Incremental & periodic solutions • Experiments III • Conclusions

  3. Introduction • From the Spamhaus project (www.spamhaus.org) • “An electronic message is ‘spam’ IF: • the recipient's personal identity and context are irrelevant because the message is equally applicable to many other potential recipients; AND • the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for it to be sent.” • “[It’s] about consent, not content” • We focus on email spam

  4. Spam Filtering • Spam filtering is classification: • is an incoming email ham or spam? • Spam filters • procedural • whitelists, blacklists, challenge-response systems,… • collaborative • sharing signatures • content-based • rules, decision trees, probabilities, case bases,… • hybrid.

  5. Challenges of Spam Filtering • Spam is subjective and personal; • It is heterogeneous; • There is a high costs to false positives (where ham is classified as spam); and • It is constantly changing (‘concept drift’).

  6. Overview • Introduction • Case-Based Spam Filtering • Feature-Based • Feature-Free • Experiments I • Case Base Maintenance • Competence-Based Editing • Experiments II • Concept Drift • Incremental & periodic solutions • Experiments III • Conclusions

  7. Case-Based Reasoning New problem RETRIEVE Retrieved Case Learned Case General knowledge Previous Case RETAIN REUSE Tested/ Repaired Case Adapted Case REVISE [Aamodt & Plaza 1994]

  8. Case-Based Reasoning MAINTAIN General knowledge Previous Case

  9. Is Case-Based Reasoning (CBR) the answer? • Spam is subjective and personal; • It is heterogeneous; • There is a high costs to false positives (where ham is classified as spam); and • It is constantly changing (‘concept drift’). Users can have individual case bases created from their own emails It is known that CBR handles disjunctive concepts well We can bias CBR away from false positives Case bases can be updated incrementally

  10. Overview • Introduction • Case-Based Spam Filtering • Feature-Based • Feature-Free • Experiments I • Case Base Maintenance • Competence-Based Editing • Experiments II • Concept Drift • Incremental & periodic solutions • Experiments III • Conclusions

  11. Email Classification Using Examples (ECUE) • ECUE uses Case-Based Reasoning (CBR) to classify emails • A case base contains a user’s email (both ham and spam) • ECUE classifies an incoming email using the k-nearest neighbour algorithm: • It retrieves from the case base the k nearest neighbours (the k that are closest or most similar) • The cases it retrieves then vote to decide the class of the new email • To bias away from false positives, ECUE uses unanimous voting.

  12. Email Email Email Feature Extraction Email Casebase Feature-Based ECUE • Email • Features extracted (fij ) • words, characters, structural features • Binary representation: fi1= 1 or fi1= 0

  13. Information Gain used to select the 700 most predictive features Email Email Email Feature Extraction Email Feature Selection Casebase Casebase Feature-Based ECUE

  14. Competence-Based Editing usedto edit case base Email Email Email Feature Extraction Email Feature Selection Casebase Casebase Case Selection Casebase Feature-Based ECUE

  15. Email Email Email Feature Extraction Email Feature Selection Casebase Runtime System Casebase New Case Classification Case Selection Casebase spam! Feature-Based ECUE

  16. Feature-Based ECUE • The distance between cases is a count of the number of features that they do not share • Naïve Bayes classifier thought to be among the best for spam filtering • Feature-Based ECUE has comparable, and sometimes slightly better, accuracy than Naïve Bayes

  17. Overview • Introduction • Case-Based Spam Filtering • Feature-Based • Feature-Free • Experiments I • Case Base Maintenance • Competence-Based Editing • Experiments II • Concept Drift • Incremental & periodic solutions • Experiments III • Conclusions

  18. Feature-Free ECUE • Alternative to Feature-Based ECUE • Inspired by theory of Kolmogorov Complexity • K(x) = size of smallest Turing machine that can output x to its tape • K(x|y) = size of smallest Turing machine that can output x when given y • Basis for distance measure if K(x|y) < K(x|z) then y is more similar to x than z [Li et al. 2003]

  19. Case b•d reasoning Feature-Free ECUE • Approximate K(x) by C(x) C(x) = size of x after compression • Text compression exploits intra-document redundancy Case based reasoning

  20. = len(gzip( + )) docX docY = len(gzip( )) docX docY = len( ) docX docY Using Compression • Consider length of two documents allowing for inter-document redundancy = C(xy)

  21. = len(gzip( )) + len(gzip( )) docX docY = len( ) + len( ) docX docY Using Compression • Consider length of two documents not allowing for inter-document redundancy = C(x) + C(y)

  22. Compression-Based Dissimilarity (CDM) • Max value ≤ 1 (furthest)Min value > 0.5 (nearest) • However CDM(x,x) ≠ 0; CDM(x,y) ≠ CDM(y,x); CDM(x,y) + CDM(y,z) ≥ CDM(x,z) [Keogh et al 2004]

  23. Email Email Email Feature Extraction Email Feature Selection Casebase Runtime System Casebase New Email Classification Case Base Edit Casebase spam! Feature-Based ECUE

  24. Email Email Email Email Email Email Email Email Runtime System Casebase Case Base Edit Feature-Free ECUE New Email Classification Casebase spam!

  25. Experiments I • Created 4 datasets of 1000 emails from two years of email from two people • each dataset has 500 consecutive ham, 500 consecutive spam • 10-fold cross-validation • Settings: • k = 3 • Feature-based: 700 features • Feature-free: GZip as text compressor • Measures: • FPRate = #false positives/#ham • FNRate = #false negatives/#spam • Err = (FPRate + FNRate) / 2

  26. Results - % Error

  27. Results - % False Positives

  28. Overview • Case-Based Spam Filtering • Feature-Based & Feature-Free • Experiments I • Case Base Maintenance • Competence-Based Editing • Experiments II • Concept Drift • Incremental & periodic solutions • Experiments III • Conclusions

  29. Case Base Maintenance • Case base editing algorithms • remove redundant cases, and • remove noisy cases. • Their goal is to • reduce retrieval time but • maintain or even improve accuracy.

  30. Competence Model • For each case c, compute • coverage set of c • cases that have c as one of their k-NN and which have same class as c • liability set of c • cases that have c as one of their k-NN and which have different class from c y x x is in coverage set of c c y is in liability set of c

  31. Competence-Based Editing • Blame-Based Noise Reduction • For each case c with non-empty liability set (taken in descending order of size of liability set), • if the cases in c’s coverage set can still be correctly classified without c, then c can be deleted. • This emphasises removal of cases that cause misclassifications. • Conservative Redundancy Reduction • For each remaining case c (taken in ascending order of size of coverage set) • retain c but delete the cases in c’s coverage set • This retains cases close to class boundaries

  32. Results - % Error • Feature-based edited size = 75% and 65% • Feature-free edited size = 59% and 57%

  33. Results - % False Positives

  34. Overview • Case-Based Spam Filtering • Feature-Based & Feature-Free • Experiments I • Case Base Maintenance • Competence-Based Editing • Experiments II • Concept Drift • Incremental & periodic solutions • Experiments III • Conclusions

  35. Concept Drift • The target concept is not static • it changes according to season • it changes according to world events • people’s interests and tolerances change • there is an arm’s race: • ever more devious spamouflage! • We need to investigate behaviour over time

  36. Experiments III • Took ~10000 emails from two years of email from two people in date-order • Created a case base for each person from earliest 500 consecutive ham & earliest 500 consecutive spam • Remaining ~9000 emails presented chronologically as test cases • Same settings and measures as before • k = 3 • Feature-based: 700 features • Feature-free: GZip as text compressor

  37. Retention policies • CBR (and other lazy learners) can easily incorporate the most recent examples • retain-all: store all new emails in the case base • retain-misclassifieds: store a new email if our prediction is wrong

  38. Results - % Error • When we retain-misclassified cases, case bases increase in size by ~30%

  39. Results - % False Positives

  40. Retention • Bigger case base reduces efficiency • Obsolete cases may reduce accuracy • Obsolete features may reduce accuracy • Need a deletion policy

  41. instance selection instance weighting Incremental Solutions • Consider add-1-delete-1 • Case base size remains constant • retention policy • retain-all • retain-misclassified • forgetting policy • forget-oldest • forget-least-accurate

  42. #successes Accuracy = #retrievals Incremental Solutions • Consider add-1-delete-1 • Case base size remains constant • retention policy • retain-all • retain-misclassified • forgetting policy • forget-oldest • forget-least-accurate

  43. Results - % Error

  44. Results - % False Positives Negative effect on FPs?

  45. Periodic Solutions • Periodic • Feature-based: • retain-misclassified; • monthly, feature re-extraction, feature re-selection, case base rebuild and case base edit • Feature-free • retain-misclassified; • monthly, case base edit

  46. Email Email Email Feature Extraction Email Feature Selection Casebase Casebase Case Base Edit Casebase Feature-Based ECUE

  47. Results - % Error

  48. Results - % False Positives

  49. Overview • Case-Based Spam Filtering • Feature-Based & Feature-Free • Experiments I • Case Base Maintenance • Competence-Based Editing • Experiments II • Concept Drift • Incremental & periodic solutions • Experiments III • Conclusions

  50. Feature-Free ECUE: Advantages • Accuracy • lower error rate than traditional feature-based methods • often lower false positive rate • Costs • it uses the raw text • no need to extract, select or weight features • no need to update features as spam changes • Concept drift • simple retention/forgetting policies can be effective

More Related