Text Mining in Data: Extracting Insights from Textual Information

E-Mail Filtering Soonyeon Kim

Good Site for Data Mining • http://liinwww.ira.uka.de/bibliography/ - The Collection ofComputer Science Bibliographies • Major Conferences in Data Mining - KDD 2000 of ACM SIGKDD http://www.acm.org/sigs/sigkdd/kdd2000/ - SIGMOD 2000 of ACM SIGMOD • Other Conferences - VLDB, IEEE ICDE, PAKDD conference

Text Mining:Finding Nuggets in Mountains of Textual Data • Author - Jochen Dorre, Peter Gerstl, Roland Seiffert - {doerre,gerstl,seiffert}@de.ibm.com • Method to find this paper - Searching from The Collection ofComputer Science Bibliographies - key word used : Data mining & Text classification

Brief Description • What is Text Mining? - same analytical functions of data mining to the domain of textual information. • How Text mining differs from Data mining? - Data mining : addresses a very limited part of data (structured information available in database) - Text mining : helps to dig out the hidden gold from textual information & requires the very complex feature extraction function • Describe in more detail the unique technologies that are key to successful text mining

Ifile: An Application of Machine Learning to E-mail Filtering • Author - Jason D. M. Rennie Artificial Intelligence Lab, MIT - jrennie@ai.mit.edu • Method to find this paper - KDD 2000 of ACM SIGKDD

Outline of Paper • Introduction - need for automated e-mail filtering - Ishmail - important issues regarding mail filtering • Mail Filtering - Classification Efficiency - Features - Naïve Bayes algorithm • IFILE • Experiment • Conclusion

Introduction • Popular E-mail clients allow users to manage their mail into folders by meaningful topic - popular e-mail client : Netscape Messenger, Pine, Microsoft Outlook, Eudora and EXMH • Ishmail - purpose of a prioritization system - alert the user when high-priority mail is arrived or a large number of messages have accumulated in a lower-priority folder • Barrier - implementation for mail filters (speed efficiency, database size, collection of supervised training data) - integration into e-mail clients

Classification Efficiency • Traditional classification method - kNN, C4.5, Naïve Bayes • Recent development - SVM (Support Vector Machine), Maximum Entropy discrimination) • Efficiency Problems - SVM and MEM provide significant improvement in accuracy, but at the cost of simplicity and time efficiency - kNN : time to classify

Classification Efficiency(2) • Naïve Bayes - efficient training, quick classification and extensibility to iterative learning - training : updating word counts - classification : normalized sum of the counts corresponding to the words in the document in question

Personal E-mail Filtering • Every user has a unique collection of e-mail • User organizes their e-mail in unique way • It pertains directly to his preference • Key fact for effective personal e-mail filtering - using the information made through the user interface of the mail client

Learning Architecture • Label is assigned to newly filtered e-mail message • Added to the classification model • Update the classification model : every filtered e-mail is a training example - assumed to be correct if user does not move the message to another folder - update the model if user moves misclassified mail into the appropriate folder • Update for Naïve Bayes - shift word counts from one folder to another

Features • Classification model act as a function f F C - F : Features C: class • Mail filter is a special classifier F C - F : characteristics of e-mail message C: mail folder - by considering each e-mail message as a bag of words function f maps an unordered set of words to a folder name f

Features(2) • Naïve Bayes keeps the track of word frequency statistics • Reduce the number of features for classification to make filtering more efficient • Feature selection cutoff - old, infrequent words are dropped - word that occur fewer than log(age)-1 times should be discarded from the model - age : number of e-mail messages added to the model since statistic has been kept for that word e.q. if “baseball” occurred in the 1st document and occurred 5 or fewer times in the next 63 document, the word and statistics would be eliminated from database.

Maintaining Dictionary • Cutoff Algorithm - word that occur fewer than log(age)-1 times should be discarded from the model e.q. “datamining” occurred in the 1st document 63 documents are coming after the document age = 1 + 63 = 64 log(age) – 1 = 5 if “datamining” appears 5 or fewer times, the word and statistics would be eliminated from database.

Maintaining Dictionary(2) -----------.idata------------- A B C list of folders(A:0 B:1 C:2) 5 2 6 total word instances 2 1 1 # of message party 4 0:2 1:1 belch 3 0:1 yellow 4 0:2 2:3 word age folder:frequency kick 2 1:1 peep 1 2:2 two msg in A - "party party belch yellow yellow" one msg in B - "party kick" one msg in C - "peep peep yellow yellow yellow"

Word Selection • Header Trimming • E-mail • Body: content • Header : list of fields pertaining to the message From: To: Subject: - keep this part Received: Date: Message-id - remove

Tokenizing text • Two techniques • Using stop list - decrease the amount of noise in the data by eliminating uninformative words e.g) pronoun, modifier, adverb • Stemming - link together words which have the same root e.g) serve, service, serves, served => same root serv

Naïve Bayes • What is Naïve Bayes? - Simple, yet effective classifier of text documents - Statistical Machine learning algorithm • Assumption • each document is considered as a set of words • Each word is independent

Naïve Bayes-1st step • Probability of d having been generated by ci - With the assumption that attribute values are independent,

Naïve Bayes-second step • Computing P(ci|d) for all classes • Find the class to be classified • Maximum likelyhood - Probability values are only used for comparison Purpose, P(d) can be dropped

Naïve Bayes • M-estimate • purpose : to give a reasonable probability in the case of sparse data -nj : number of instances of wj in the documents of class ci -n : total number of words in documents of class ci -|Vocabulary| : number of distinct words

Experimental Result • Information about the e-mail corpora on which classification experiments were conducted. • Four volunteers including author

Experimental Result

Experimental Result • Individual Experiments with different setting • Alpha lexer, stoplist used, header trimming, feature selection, no stemming • Alpha only lexer replaces alpha lexer • White lexer replaces alpha lexer • No stoplist is used • Stemming is used • No feature selection is used • All headers are used for classification purposes

Experimental Result • Three Lexers • Alpha lexer- default lexer- tokenizes strings of alphabetic characters • Alpha only lexer- tokenizes only strings of alphabetic characters- does not lex e-mail addresses into tokens • White lexer- tokenizes strings separated by whitespace

Experimental Result • Result • - No experimental environment setting provide the best results across all users • Experiment with highest average accuracy • - experiment #1 shows the best average result (89% accuracy) • - ranging from 86% to 91%

Experimental Result • Time Efficiency • - Naïve Bayes : “fast enough” • - 27 seconds to build a model of 7000+ e-mail messages (average 259 msg/second) • (tar-gzip of same msg requires 17 seconds) • Space Efficiency • - classification model built on 7000+ messages across 49 folders requires only 447,090 bytes

Filtering Junk E-Mail Soonyeon Kim

A Bayesian Approach to Filtering Junk E-Mail • Authors- Mehran Sahami, Susan Dumais, David Heckerman, Eric Horvitz- Stanford University & Microsoft Researh • From- AAAI 98 (American Association for Artificial Intelligence

Problems of Junk-mail • Wasting time- Many users must now spend a non-trivial portion of their time because of unwanted messages • Content of Material- Some of these messages can contain offensive material such as graphic pornography • Space problem-Junk-mails also quickly fill up file server storage space

Machine Learning Approach • Learning- system S learns from experience E with respect to a class of tasks T and performance P • Learning in junk-mailS : E-mail classifierT : classify an e-mail message as junk/legitimateP : fraction of correct predictionE : a set of pre-classified e-mail messages • Vector Space Model- to represent mail messages as feature vectors - e-mail message has single fixed-length feature vector- individual message can be represented as a binary vector denoting which word are present or absent. (1 for present 0 for absent)

Bayesian Classifier • e-mail message as a vector of N featuresX = X1, X2, X3, ..., XN- For example, X42 might be ‘the e-mail contains “money”’- x42=0 means “the message described by x does not contain the “money”’. • classify messages in K classes C = {c1 , c2} = {junk, legit} (K=2) • Now suppose we see a new e-mail message, with encoding x. We seek the probability that the class C is junk, Pr[C=junk | X=x] shorthand for Pr[C=junk | X1=x1 & X2=x2 & … & XN=xN]

Bayesian networks (a) a Naïve Bayesian classifier (b) a more complex Bayesian classifier with limited dependencies between the features

Bayesian Rule • Bayes theorem • ssume that each Xi is independent

Features • Words- fixed width vector <X = X1, X2,…, Xn> • Hand-crafted Phrasal Features- “FREE!”, “only $” ( as in “only $4.95”) and “be over 21”- 35 such hand-crafted phrases are includedDomain-specific features- domain type of sender (.edu or .com)- junk mail is usually not from .edu domainResolving familiar E-mail address- i.e. replace sdumais@microsoft.com with Susan DumaisTime- most junk E-mail is sent at night

Features(2) Peculiar punctuation - percentage of non-alphanumeric characters in the subject of a mail - “$$$$$ MONEY $$$$$” X : subject has peculiar punctuation Y : pct of total messages

Feature Selection • Mutual Information- Mutual information MI(A,B) is a numeric measure of what we can conclude about A if we know B, and vice-versa. - Example: If A and B are independent, then we can’t conclude anything: MI(A, B) = 0- Select 500 features with greatest value

Evaluation • Three ways • Using Domain-specific Features- Words only- Words + Phrases- Words + Phrases + Extra Features • Three way Categorization- 3 categories {porn-junk, other-junk, legit}instead of 2 categories {junk, legit}. • “Real” scenario

Using different features • The cost of missing legitimate email is much higher than the costing of inadvertently reading junk. • The authors wanted to make their system very “optimistic” so that it only predicts “junk” if it is very certain -- uses threshold 99.9%. • 1789 hand-tagged e-mail messages • 1578 junk • 211 legit • Split into… • 1538 training messages (86%) • 251 testing messages (14%)

Using different features • Result of experiment- words only- words + 35 phrasal features- words + phrasal features + 20 non-textual domain-specific features

Using different features Junk Precision = A / (A + C) Junk Recall = A / (A + B) Legit Precision = D / (D + B) Legit Recall = D / (D + C) Junk precision is of greatest concern to most users, because they would not want their legitimate mail discarded as junk

Using different features • Precision/ Recall curves for junk mail

Sub-classes of junk E-mail • Three way Categorization- 3 categories {porn-junk, other-junk, legit}instead of 2 categories {junk, legit} • Consider that classifying is correct if any “junk” messages is classifed either “porn-junk” or “other-junk” • Unfortunately, it didn’t work!- Probably because more parameters means need (exponentially!) more data to estimate them accuractely- some feature may be very clearly indicative of junk versus legitimate, but may not be powerful in three categories(they do not distinguish well between the sub-classes of junk

Sub-classes of junk E-mail • Precision/recall curves considering sub-groups of junk mail

Real Usage Scenario • Three kinds of messages1. Read and keep2. Read and discard (ex. Joke from a friend)3. Junk • Result • Misclassified mails – news stories from a e-mail news service that the user subscribes to. (No loss of significant information)

Real Usage Scenario • Precision/recall curves in real usage scenario

Text Mining in Data: Extracting Insights from Textual Information