1 / 18

Enron Corpus: A New Dataset for Email Classification

Enron Corpus: A New Dataset for Email Classification. By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee. Introduction. Motivation Related Works The Enron Corpus Methods Evaluation Thread Information Conclusion. Motivation.

Angelica
Télécharger la présentation

Enron Corpus: A New Dataset for Email Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enron Corpus:A New Dataset for Email Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee

  2. Introduction • Motivation • Related Works • The Enron Corpus • Methods • Evaluation • Thread Information • Conclusion

  3. Motivation • Other corpuses focus on newsgroups or personal email data • Lack of common data set to evaluate the performance of email classification • Previous research uses different personal data sets • Difficulties to find actual use of email within a company • Obviously, companies do not like to share their internal emails • Privacy concerns for people working for the company

  4. Related Works • Other corpuses • 20 Newsgroups • http://people.csail.mit.edu/people/jrennie/20Newsgroups/ • Related Papers • Y. Diao, H. Lu, and D. Wu, A Comparative Study of Classification Based Personal E-mail Filtering (PAKDD ’00) • I. Androutsopoulos, et. al., An Experimental Comparison of Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages (SIGIR ‘00) • T. Payne, Learning Email Filtering Rules with Magi (Thesis 1994)

  5. 20 Newsgroups • Collection of approximately 20,000 newsgroup documents, spread out evenly across 20 different newsgroups • Sample newsgroups: • comp.graphics, rec.motorcycles, rec.sport.baseball, sci.electronics, talk.politics.misc, talk.religion.misc, etc. • Used originally in Ken Lang’s Newsweeder: Learning to filter netnews paper (ICML 1995) • Dataset on newsgroup data, probably not very useful for research in personal information management

  6. Enron Dataset • 619,446 messages (200,399 after cleaning) by 158 users • Average 757 messages per user • Shows most users do use folders to organize emails • Can use folder information to evaluate effectiveness for folder classification

  7. Enron Corpus’ Characteristics • Number of messages per user varies from a few messages to 10K + messages • Upper bound of folder seems to correlate to the log(# of messages) • Number of messages does not correlate to the lower bound (can have many messages but a few folders) • Question: how can we use this kind of information?

  8. Email Classification Features • Constructive text • BOW approach, feature used the most • Some fields are more important than the others • Stemming, stop word removal used, effectiveness not proven • Categorical text • “to” and “from” fields • BOW, useful for classification, but not as useful as constructive text • Numeric data • Size of message, number of replies, number of words, etc. • Not very useful • Thread information • Indicates how message relates to each other • Not fully exploited

  9. Email Features (Example) Numeric data Categorical text From: Mark Hills <mhills@cs.uiuc.edu> Subject: Re: When is the first lecture? When will the course page be updated? Date: Thu, 26 Aug 2004 13:41:09 -0500 Lines: 11 Message-ID: <cglafa$f3o$1@dcs-news1.cs.uiuc.edu> References: <cgl09c$bll$1@dcs-news1.cs.uiuc.edu> In-Reply-To: <cgl09c$bll$1@dcs-news1.cs.uiuc.edu> Joshua Blatt wrote: > When is the first lecture? When will the course page be updated? > > Thanks > > Josh The first lecture was today, during the normally scheduled time. Mark Thread information Contextual text

  10. Classification Method • Vector space model with SVM • Vector weight wi is evaluated using “ltc” (http://people.csail.mit.edu/people/jrennie/ecoc-svm/smart.html), which means: • l: new-tf = ln (tf) + 1.0 • t: new-wt = new-tf * log (num-docs/coll-freq-of-term) • c: divide each new-wt by sqrt (sum of (new-wts squared))

  11. Classification Method (Cont.) • Sort messages in chronological order, split into train and test set • Run SVM on term weighted vectors of • From • Subject • Body • To, CC • All fields • Linear regression on all fields seem to have the best performance

  12. Clustering Effectiveness

  13. Number of Messages vs. F1 • Number of message does not directly correlate to the accuracy • Question: What about the case where the user has only one folder, which makes classification trivial?

  14. Number of Folders vs. F1 • There’s correlation between the number of folders and the F1 score. • Question: Is this trivial as well? • Some elements in the messages not modeled, since SVM have more messages to train on.

  15. Thread Information • 200,399 messages, 101,786 threads, 71,696 threads with only one message • 61.63% of messages of corpus is in a thread. • Average thread size is 4.1 messages • Average folder per thread is 1.37 (meaning most messages of the thread stays in one folder) • Question: Not clear how threads are detected. How can we use this information?

  16. More Thread • D. Lewis, et. al., Threading Electronic Mail: A Preliminary Study (1997) • Lewis studied finding parent message using BOW, TF/IDF weighted, vector space approach on constructive text Document weight Query weight Similarity

  17. More Thread (Cont.) • Lewis’ work assumes that the thread information is incomplete in the message header. • May not be the case. • Algorithm by Jamie Zawinski is widely used in the original Netscape 4.x (maybe in recent Mozilla as well?) can group threaded messages effectively. • http://www.jwz.org/doc/threading.htm • Questions • How can we leverage the thread information in email messages more effectively? • Does this model extend to the more recent form of conversation such as blog and web forums as well?

  18. Conclusion • Pros • Introduce a new corpus that can be useful in evaluating classification performance on a large collection of personal mail • Unlike small collection of personal mails, corpus can also be used to analyze behavior within a company • Cons • Details on performing SVM and the linear weight for various fields are missing • Not clear how threads are detected

More Related