PPT - Document Classification Comparison PowerPoint Presentation, free download

Document ClassificationComparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman

Overview • What we did • How we did it • Results • Why does this matter • Conclusions • Questions?

What did we do? • Compared document classification accuracy of three pieces of software on data from 20 newsgroups • Rainbow (Naïve Bayes) • C4.5 (Decision Tree) • Neural Network (Back-propagation) • Initially planned on taking a single document and locating other documents similar to it

How did we do it?. • Used Rainbow as benchmark • Used it to create a model of the data • Was trained and tested with a common set of data • Used perl scripts to separate the data into training/testing sets and create input files for C4.5 and the neural network software • Rainbow's ability to output word counts for the top N words was used to create the input files • Initially wanted to use word probabilities, but it is only capable of doing this with classes, not single documents

.How did we do it? • Modified image neural network from previous assignment so that it would look at documents instead of images • Needed to have 20 output nodes, one for each newsgroup • Took in 1000 words (initially at least) • Started with the default hidden nodes (4) and used all the way up to approximately 2000 (2x the number of inputs) • http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-10.html

Results • The Decision Tree software was able to get between 15% and 40% accuracy (depending on whether the tree was pruned and using test data) • Training set was about 17% after pruning • Test set was about 40% after pruning • Neural Network proved to be much more difficult than we at first thought • Very very slow (on full training data, took approximately 1 hour per epoch on a 1.2Ghz Linux machine) • Accuracy did not increase over many trials • Spent a great amount of time experimenting with the various paramaters • Learning Rate, Momentum, Hidden Units • Never got better than about 5% accuracy

.Results. • Rainbow • Approximately 80% accuracy • C4.5 and Rainbow made similar errors: • Misclassified documents within the similar groups: • Alt.atheism, talk.religion.misc, talk.politics.misc • Comp.*

Why is text classifcation important? • Spam detection • General mail filtering into folders • Automatically place documents in file system at proper location

Conclusions • Naïve Bayes seems to empirically be the best for classifying documents • At least for newsgroup data • Still made similar errors to C4.5 which used only word counts • If we had pre-processed the data better, perhaps removing outliers and normalizing the information then we could have gotten better results with the Neural Network • Word counts not enough to “specify” a document, C4.5 seemed to create a tree that did not generalize well to the test data • Neural Networks are definitely not “plug and chug,” every application is specific and needs specific parameters • Hard to know how much data to use, or how many features. • Most people don’t have 10000 emails to “train” with • Should investigate a threshold minimum for getting accurate results

Fin. • Questions?

Document Classification Comparison

Presentation Transcript

Machine Learning Classification for Document Review

Reductionism and Classification Require Detailed Comparison Consider 3D Comparison

Soil Classification Comparison Hall County, Georgia

Document Classification Comparison

Document Classification using the Natural Language Toolkit

SNoW &amp; FEX Libraries; Document Classification

Document Classification via Term Distribution Similarity

Document Classification Techniques using LSI

Document Classification

Incremental Context Mining for Adaptive Document Classification

Comparison of Web Page Classification Algorithms

Document classification

Comparison of Petroleum and Mineral Resources Classification

Temporal Comparison of Atmospheric Stability Classification Methods

Automatic Classification Document and Filing

Hierarchical Classification: Comparison with Flat Method

A Comparison of SOM Based Document Categorization Systems

Temporal Comparison of Atmospheric Stability Classification Methods

Auto-indexing – Essential for loan document classification

Naive Bayes for Document Classification

Document Classification using Deep Belief Nets

Document Classification Comparison

Presentation Transcript

Machine Learning Classification for Document Review

Reductionism and Classification Require Detailed Comparison Consider 3D Comparison

Soil Classification Comparison Hall County, Georgia

Document Classification Comparison

Document Classification using the Natural Language Toolkit

SNoW &amp;amp; FEX Libraries; Document Classification

Document Classification via Term Distribution Similarity

Document Classification Techniques using LSI

Document Classification

Incremental Context Mining for Adaptive Document Classification

Comparison of Web Page Classification Algorithms

Document classification

Comparison of Petroleum and Mineral Resources Classification

Temporal Comparison of Atmospheric Stability Classification Methods

Automatic Classification Document and Filing

Hierarchical Classification: Comparison with Flat Method

A Comparison of SOM Based Document Categorization Systems

Temporal Comparison of Atmospheric Stability Classification Methods

Auto-indexing – Essential for loan document classification

Naive Bayes for Document Classification

Document Classification using Deep Belief Nets

SNoW & FEX Libraries; Document Classification