Continuous Attributes: Computing GINI Index / 2

Continuous Attributes: Computing GINI Index / 2

Measure of Impurity: Entropy

Computing Entropy of a Single Node

Computing information Gain After Splitting

Problems with Information Gain

Gain Ratio

Measure of Impurity: Classification Error

Computing Error of a Single Node

Comparison among Impurity Measures For binary (2-class) classification problems

Misclassification Error vs Gini index

Example: C4.5 • Simple depth-first construction. • Uses Information Gain • Sorts Continuous Attributes at each node. • Needs entire data to fit in memory. • Unsuitable for Large Datasets. • Needs out-of-core sorting. • You can download the software from:http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz

Scalable Decision Tree Induction / 1 • How scalable is decision tree induction? • Particularly suitable for small data set • SLIQ (EDBT’96 — Mehta et al.) • Builds an index for each attribute and only class list and the current attribute list reside in memory

Scalable Decision Tree Induction / 2 • SLIQ Sample data for the class buys_computer Disk-resident attribute lists Memory-resident class list 0 1 2 3 4 5 6

Decision Tree Based Classification • Advantages • Inexpensive to construct • Extremely fast at classifying unknown records • Easy to interpret for small-sized tress • Accuracy is comparable to other classification techniques for many data sets • Practical Issues of Classification • Underfitting and Overfitting • Missing Values • Costs of Classification

Continuous Attributes: Computing GINI Index / 2