160 likes | 529 Vues
Continuous Attributes: Computing GINI Index / 2. Measure of Impurity: Entropy. Computing Entropy of a Single Node. Computing information Gain After Splitting. Problems with Information Gain. Gain Ratio. Measure of Impurity: Classification Error. Computing Error of a Single Node.
E N D
Comparison among Impurity Measures For binary (2-class) classification problems
Example: C4.5 • Simple depth-first construction. • Uses Information Gain • Sorts Continuous Attributes at each node. • Needs entire data to fit in memory. • Unsuitable for Large Datasets. • Needs out-of-core sorting. • You can download the software from:http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
Scalable Decision Tree Induction / 1 • How scalable is decision tree induction? • Particularly suitable for small data set • SLIQ (EDBT’96 — Mehta et al.) • Builds an index for each attribute and only class list and the current attribute list reside in memory
Scalable Decision Tree Induction / 2 • SLIQ Sample data for the class buys_computer Disk-resident attribute lists Memory-resident class list 0 1 2 3 4 5 6
Decision Tree Based Classification • Advantages • Inexpensive to construct • Extremely fast at classifying unknown records • Easy to interpret for small-sized tress • Accuracy is comparable to other classification techniques for many data sets • Practical Issues of Classification • Underfitting and Overfitting • Missing Values • Costs of Classification