30 likes | 186 Vues
Machine Learning Basics with Applications to Email Spam Detection Brittany Edwards, Haoyu Li, and Wei Zhang under Xiaoxiao Xu and Dr. Nehorai Department of Electrical and Systems Engineering. Abstract
E N D
Machine Learning Basics with Applications to Email Spam Detection Brittany Edwards, Haoyu Li, and Wei Zhang under XiaoxiaoXu and Dr. Nehorai Department of Electrical and Systems Engineering Abstract Use machine learning to create a spam detection algorithm to decipher between spam and ham emails. Various classification methods were used and results were analyzed for the best outcome. Results With the use of k nearest neighbor with k = 3, the accuracy was ~64%, and therefore the initial conclusion was that this k nearest neighbor classifier does not fit this model properly. Also, the term list was still to large and therefore more pre-processing is necessary to remove meaningless terms. With the use of the naive-bayes classifier, the results yielded a higher accuracy, ~82%. These secondary results were much better than the first, but still not enough for an accurate algorithm. From these secondary results, it is seen that progress was made through the revisions made in pre-processing and the new classification method. Summary Initial results were not substantial to continue and produce a successful outcome, therefore further steps were made to improve progress and increase possibility of creating a correct algorithm. Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham emails. Email detection requires many steps in the pre-processing of data so that the data can then be classified using various methods to accurately classify the emails. Methods explored were: k nearest neighbor, naive-bayesclassification, logistic regression, and decision tree classification. After classifications were done, the best method was chosen by evaluating each classifier and determining the precision, recall, and F1 score of each, as well as examining the ROC curve. Conclusions The final results, after preliminary results and revisions were made, yielded a 32.38% increase in accuracy. More pre-processing needs to occur for the most accurate outcomes and therefore the best possible resulting algorithm. The initial classifier used, k nearest neighbor, did not yield the best results, and there are many possible contributing factors as to why this occurred. The naive-bayes classifier was much more accurate and lead to better results than the k nearest neighbor classifier did. The other two methods, logistic regression and decision tree classification, did not work correctly with our model and after future pre-processing revisions there is hope that this methods can be used. There was progress made, but more progress is hoped for with future efforts. Ham Email example Spam Email example Methods Initial pre-processing was done by removing stop-words, by using a package in the programming language R, which was used for this project. Then, the words were converted to lowercase, punctuation was removed, and words with lengths longer than 20 letters were removed. Then a hashtable was created to make words that were similar, and mapped them all to a single general form of the word. After pre-processing, the different classification methods were applied and results were analyzed. Naive-Bayes Classifier Naive-Bayes Classifier