Machine Learning Basics with Applications to Email Spam Detection

Machine Learning Basics with Applications to Email Spam Detection Brittany Edwards, Haoyu Li, and Wei Zhang under XiaoxiaoXu and Dr. Nehorai Department of Electrical and Systems Engineering Abstract Use machine learning to create a spam detection algorithm to decipher between spam and ham emails. Various classification methods were used and results were analyzed for the best outcome. Results With the use of k nearest neighbor with k = 3, the accuracy was ~64%, and therefore the initial conclusion was that this k nearest neighbor classifier does not fit this model properly. Also, the term list was still to large and therefore more pre-processing is necessary to remove meaningless terms. With the use of the naive-bayes classifier, the results yielded a higher accuracy, ~82%. These secondary results were much better than the first, but still not enough for an accurate algorithm. From these secondary results, it is seen that progress was made through the revisions made in pre-processing and the new classification method. Summary Initial results were not substantial to continue and produce a successful outcome, therefore further steps were made to improve progress and increase possibility of creating a correct algorithm. Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham emails. Email detection requires many steps in the pre-processing of data so that the data can then be classified using various methods to accurately classify the emails. Methods explored were: k nearest neighbor, naive-bayesclassification, logistic regression, and decision tree classification. After classifications were done, the best method was chosen by evaluating each classifier and determining the precision, recall, and F1 score of each, as well as examining the ROC curve. Conclusions The final results, after preliminary results and revisions were made, yielded a 32.38% increase in accuracy. More pre-processing needs to occur for the most accurate outcomes and therefore the best possible resulting algorithm. The initial classifier used, k nearest neighbor, did not yield the best results, and there are many possible contributing factors as to why this occurred. The naive-bayes classifier was much more accurate and lead to better results than the k nearest neighbor classifier did. The other two methods, logistic regression and decision tree classification, did not work correctly with our model and after future pre-processing revisions there is hope that this methods can be used. There was progress made, but more progress is hoped for with future efforts. Ham Email example Spam Email example Methods Initial pre-processing was done by removing stop-words, by using a package in the programming language R, which was used for this project. Then, the words were converted to lowercase, punctuation was removed, and words with lengths longer than 20 letters were removed. Then a hashtable was created to make words that were similar, and mapped them all to a single general form of the word. After pre-processing, the different classification methods were applied and results were analyzed. Naive-Bayes Classifier Naive-Bayes Classifier

Machine Learning Basics with Applications to Email Spam Detection

Machine Learning Basics with Applications to Email Spam Detection

Presentation Transcript

A Financial Approach to Machine Learning with Applications to Credit Risk

Applications of Machine Learning to Medical Informatics

EMAIL AND SPAM

Machine Learning Basics with Applications to Email Spam Detection

Exploiting Machine Learning to Subvert Your Spam Filter

Opinion Spam Detection

Email Spam Detection using machine Learning

Spam Email Detection

Machine Learning basics

Spam Email

Spam Detection

Graph Mining Applications to Machine Learning Problems

Basics Of Machine Learning

Machine learning Applications - SciExperts

17 Top Applications of Machine Learning with Python

Machine Learning Applications

Machine Learning Basics

Applications of machine learning

Topic Detection using Machine Learning

Bitcoin Ransomware Detection with Scalable Graph Machine Learning

Applications of Machine Learning to Ecological Modelling

Machine Learning Projects | Machine Learning Applications | Machine Learning Training | Simplilearn