Evaluation of Decision Forests on Text Categorization

Evaluation of Decision Forestson Text Categorization

Text Categorization • Text Collection • Feature Extraction • Classification • Evaluation

Text Collection • Reuters • Newswires from Reuters in 1987 • Training set: 9603 • Test set: 3299 • Categories: 95 • OHSUMED • Abstracts from medical journals • Training set: 12327 • Test set: 3616 • Categories: 75 (within Heart Disease subtree)

Feature Extraction • Stop Word Removal • 430 stop words • Stemming • Porter’s stemmer • Term Selection • by Document Frequency • Category independent selection • Category dependent selection • Feature Extraction • TF  IDF

Classification • Method • Each document may belong to multiple categories • Treating each category as a separate classification problem • Binary classification • Classifiers • kNN (k Nearest Neighbor) • C4.5 (Quinlan) • Decision Forest

C4.5 • A method to build decision trees • Training • Grow the tree by splitting the data set • Prune the tree back to prevent over-fitting • Testing • Test vector goes down the tree and arrives at a leaf. • Probability that the vector belongs to each category is estimated.

Decision Forest • Consisting of many decision trees combined by averaging the class probability estimates at the leaves. • Each tree is constructed in a randomly chosen (coordinate) subspace of the feature space. • An oblique hyperplane is used as a discriminator at each internalnode of the trees.

Why choose these 3 classifiers? • We do not have a parametric model for the problem (we cannot assume Gaussian distributions etc.) • kNN and decision tree (c4.5) are the most popular nonparametric classifiers. We use them as the baselines for comparison • We expect decision forest to do well since we have a high dimensional problem for which it is known to do well from previous studies

Evaluation • Measurements • Precision p = a / (a+b) • Recall r = a / (a+c) • F1 value F1 = 2rp / (r+p) • Tradeoff between Precision and Recall • kNN tends to have higher precision than recall, especially when k becomes larger.

Averaging scores • Macro-averaging • Calculate precision/recall for each category • Average all the precision/recall values • Assign equal weight to each category • Micro-averaging • Sum up classification decision of each document • Calculate precision/recall from the summations • Assign equal weight to each document • This was used in experiment because the number of documents in each category varies considerably.

Performance in F1 Value

Comparison between Classifiers • Decision Forest better than C4.5 and kNN • In category dependent case, C4.5 better than kNN • In category independent case, kNN better than C4.5

Category Dependent vs. Independent method • For Decision Forest and C4.5, category dependent better than independent. • But for kNN, category independent better than dependent. • No obvious explanation found.

Reuters vs. OHSUMED • All classifiers degrades from Reuters to OHSUMED • kNN degrades faster(26%) than C4.5(12%) and DF(12%)

Reuters vs. OHSUMED • OHSUMED is a harder problem because: • Documents are more evenly distributed • This even distribution confuses kNN recall rate more than others, because there are more confusion classes in the fixed size neighborhood.

Conclusion • Decision Forest is substantially better than C4.5 and kNN in text categorization • Difficult to make comparison with results of other classifiers outside this experiment, because • Different ways of spliting training/test set • Different term selection methods

Evaluation of Decision Forests on Text Categorization