170 likes | 407 Vues
Introduction to Automatic Email Classification. Shih-Wen (George) Ke 7 th Dec 2005. Overview. Introduction to Enron Corpus Traditional Text Classification vs Email Classification Recent Work on Enron Corpus Our Work on Enron Corpus Summary
 
                
                E N D
Introduction to Automatic Email Classification Shih-Wen (George) Ke 7th Dec 2005
Overview • Introduction to Enron Corpus • Traditional Text Classification vs Email Classification • Recent Work on Enron Corpus • Our Work on Enron Corpus • Summary • Future Research Directions in Information Retrieval • Further Discussion
Overview • The nature of email classification is very different to that of traditional text classification tasks. • Email is time-dependent, poorly structured and written in informal format and no standard ways of preparing and evaluating email datasets have been proposed.
Introduction • Automatic Email Classification dates back to mid 90’s • Email Classification received little attention until recently because there was no standard email dataset available • Enron Email Corpus available in March 2004
Introduction – Enron Corpus • Distributed by William Cohen at Carnegie Mellon Uni. • Consists of 517,431 messages that belong to 150 users of Enron Corporation • Most users use folders to categorise their emails • Upper bound for the number of folders appears to be the log of the number of messages (Klimt & Yang, 2004)
Email Classification: Assumptions • Categorise email into folders – a.k.a. email foldering • Only personal and professional emails are considered here • Assume that users use folders to organise their emails • Other methods of organising emails, e.g. flag or label, are not considered here although they may provide more information in Email Classification
Our Work on Enron Corpus- Introduction • Users sometimes forget which folders they have created or which folders they should file the email under • So users tend to create new (duplicate) folders • Newly created folders adversely affect performance (Bekkerman et al., 2004) • Reduce the likelihood of users creating duplicate folders by improving the accuracy of assigning incoming emails to folders that were created in the first place • Compare state-of-the-art classifiers (kNN, SVM) and our own classifier - PERC in a simulation of real-time situation using various parameter settings
Our Work on Enron Corpus- The PERC • The PERC Classifier (PERsonal email Classifier) • Find a centroid cifor each category Ci • For each test document x: • Find k nearest neighbouring training documents to x • Similarity between x and the training document dj is added to similarity between x and ci • Sort similarity scores sim(x,Ci) in descending order • Decision to assign x to Cican be made using various thresholding strategies
Our Work on Enron Corpus- The PERC • The PERC Classifier (PERsonal email Classifier) where y(dj,Ci){0,1} is the classification for training document djwith respect to category Ci; sim(x,dj) is the similarity between test document x and training document dj; and sim(x,ci) is the similarity between test document x and the centroid ci of the category that dj belongs to.
Rationale for the Hybrid Approach • Centroid method overcomes data sparseness: emails tend to be short. • kNN allows the topic of a folder to drift over time. Considering the vector space locally allows matching against features which are currently dominant.
Our Work on Enron Corpus- Results SVM1 (c=1,j=1), SVM2 (c=0.01,j=1) Micro-averaging and Macro-average F1 over all users with standard deviation for kNN, SVM and PERC For Macro-averaging evaluations, PERC significantly outperformed kNN (t=2.786, p=0.032), SVM1 (t=2.533, p=0.044) and SVM2 (t=5.926, p=0.001)
Our Work on Enron Corpus- Conclusions • PERC has the highest accuracy of assigning test documents to small folders • kNN and PERC performed better with smaller k • Parameters of SVM can be sensitive to the number of training documents available • Investigate various parameter settings and training/test sets splits • Use of time will be investigated • A questionnaire-based study is being conducted in order to indicate the behaviour of real users in email management
Future Research Directions in IR • Use of time information • Training/test sets splits • Feature extraction, selection • Document representation • Qualitative evaluation • Threads detection, TDT for email • Mining sequential patterns • Burst of activity (Kleinberg, 2002)
References • Bekkerman, R., McCallum, A. and Huang, G. (2004) Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora. Technical Report IR-418, CIIR, University of Massachusetts. • Kleinberg, J. (2002) Bursty and Hierarchical Structure in Streams. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. • Klimt, B. & Yang, Y. (2004) The Enron Corpus: A New Dataset for Email Classification Research. European Conference on Machine Learning.