1 / 24

Machine Learning Basics with Applications to Email Spam Detection

UGR Project - Haoyu li, brittany edwards , wei zhang under xiaoxiao xu and arye nehorai. Machine Learning Basics with Applications to Email Spam Detection. General background information about the process of machine learning. The process of email detection. Motivation of this project

eris
Télécharger la présentation

Machine Learning Basics with Applications to Email Spam Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UGR Project - Haoyu li, brittanyedwards, weizhang under xiaoxiaoxu and aryenehorai Machine Learning Basics with Applications to Email Spam Detection

  2. General background information about the process of machine learning

  3. The process of email detection • Motivation of this project • Pre-processing of data • Classifier Models • Evaluation of classifiers

  4. Motivation of this project • Spam email has been annoyed every personal email account • 60% of January 2004 emails were spam • Fraud & Phishing • Spam vs. Ham email

  5. Our Goal

  6. Spam Email example

  7. Ham Email example

  8. The process of email detection • Motivation of this project • Pre-processing of data • Classifier Models • Evaluation of classifiers

  9. Pre-processing of data • Convert capital letters to lowercase • Remove numbers, and extra white space • Remove punctuations  • Remove stop-words • Delete terms with length greater than 20. 

  10. Pre-processing of data • Original Email

  11. Pre-processing of data • After pre-processing

  12. Pre-processing of data • Extract Terms

  13. Pre-processing of data • Reduce Terms • Keep word length <20

  14. The process of email detection • Motivation of this project • Pre-processing of data • Classifier Models • Evaluation of classifiers

  15. Different classification methods • K Nearest Neighbor (KNN) • Naive Bayes Classifier • Logistic Regression • Decision Tree Analysis

  16. What is K Nearest Neighbor • Use k "closet" samples (nearest neighbors) to perform classification

  17. What is K Nearest Neighbor

  18. Initial outcome and strategies for improvement • KNN accuracy was ~64% - very low • KNN classifier does not fit our project  • Term-list is still too large  • Try different method to classify and see if evaluation results are better than KNN results • Continue to reduce size of term list by removing terms that are not meaningful

  19. Steps for improvement • Remove sparsity • Reduced length threshold • Created hashtable • Used alternative classifier • Naive- Bayes Classifier

  20. Hashtable • Calculate Hash Key for each term in term-list. • Once collision occurs, use the separate chain

  21. Naive- Bayes classifier

  22. Secondary Results • Correctness increases from 62% to 82.36%

  23. Suggestions for further improvement • Revise pre-processing • Apply additional classifiers

  24. Thank you • Questions?

More Related