Computational Intelligence: Methods and Applications

Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Source: Włodzisław Duch; Dept. of Informatics,UMK; Google: W Duch

Pruning How to avoid overfitting and deal with the noise in data? • Stop splitting the nodes if the number of samples is too small to make reliable decisions. • Stop if the proportion of samples from a single class (node purity) is larger than a given threshold - forward pruning. • Create a tree that fits all data and then simplify it - backward pruning. • Prune to improve the results on a validation set or on a crossvalidation test partition, not on the training set! • Use the MDL (Minimum Description Length) principle:MinimizeSize(Tree) + Size(Tree(Errors)) • Evaluate splits looking not at the effects on the next level but a few levels deeper, using beam-search instead of best-first search; for small trees exhaustive search is possible.

DT for breast cancer Leave only the most important nodes, even if they are not pure, prune the rest.

DT logical rules Each path from the root to a leaf in DT is a rule: the number of rules is equal to the number of rules, but some conditions and rules may be spurious and should be deleted.

General tree algorithm TDIDT - Top Down Iterative Decision Tree algorithm function DT(D: training set) returnstree; Tree' := construct_tree(D); Tree:= prune_tree(Tree'); returnTree; • functionconstruct_tree(D: training set) returnsTree; • T := generate_test_results(D); • t := select_best_test(T, D); • P := partition Dinduced by the testt; • if stop_condition(D, P) • thenreturn Tree=leaf(info(E)) • else • for allDjin P: tj := construct_tree(Dj); • returnnode(t, {(j,tj)};

ID 3 ID3: Interactive Dichotomizer, version 3, initially called CLS (Concept Learning System), R. Quinlan (1986) Works only with nominal attributes.For numerical attributes separate discretization step is needed. Splits selected using the information gain criterion Gain(D,X). The node is divided into as many branches, as the number of unique values in attribute X. ID3 creates trees with high information gain near the root, leading to locally optimal trees that are globally sub-optimal. No pruning has been used. ID3 algorithm evolved into a very popular C4.5 tree (and C5, commercial version).

C4.5 algorithm One of the most popular machine learning algorithms (Quinlan 1993) • TDIDT tree construction algorithm, several variants of the algorithm are in use, but textbooks do not describe it well. • Tests: X=?for nominal values, X<t, for t=(Xi+ Xi+1)/2 (only those pairs of X values should be checked where the class changes. • Evaluation criterion – information gainratio GR(Data,Attribute) • I(D) - information (entropy) contained in class distribution

C4.5 criterion Information gain: calculate information in the parent node and in the children nodes created by the split, subtract this information weighting it by the percentage of data that falls into k children’s nodes: • Information gainratio GR(D,X) is equal to the information gain divided by the amount of information in the split-generated data distribution: Why ratio? To avoid preferences of attributes with many values. ISdecreases the information gain for nodes that have many children.

CHAID CHi-squared Automatic Interaction Detection, in SPSS (but not in WEKA), one of the most popular trees in data mining packages. Split criterionfor the attribute X is based onc2test that measures correlations between two distributions. For a test, for example selecting a thresholdX<X0 (or X=X0) for each attribute, a distribution of classes N(wc|Test=True) is obtained; it forms a contingency table: class vs. tests. If there is no correlation with the class distribution then P(wc|Test=True)=P(wc)P(Test=True) N(wc|Test=True)=N(wc)N(Test=True) Compare the actual results nijobtained for each test with these expectation eij; if they match well then the test is not worth much. c2test measures the probability that they match by chance: select tests with the largest c2value for (nij-eij )2

CHAID example Hypothesis: test resultX<X0 (or X=X0) is correlated with the class distribution; then c2test has a small value (see Numerical Recipes). Expectation: eij= Ni0 x Ngj / N c2distribution for k=(Ni0-1).(Ngj -1)degrees of freedom. Example: class=species, X=tail length. Contingency table: Probability P() that the disagreement is not by chance is given by erf= error function, integrated Gaussian.

CART Classification and Regression Trees (Breiman 1984). Split criterion: instead of information, uses the change in Gini index; in a given nodepcis % of samples fromwc class; purity of node is: Other possibility: misclassification rate Stop criterion: MDL, parameter a, tradeoff between complexity and accuracy: here 2*Gini, 2*Mi, entropy atree complexity + leaf entropy

Other trees See the entry in WIKIpedia on decision trees. Several interesting trees for classification and regression are at the page of Wei-Yin Loh. YaDT: Yet another Decision Tree builder. RuleQuest has See 5, a new version of C4.5 with some comparisons. Only a demo version is available. Look also at their Magnum Opus software to discover interesting patters – association rules, and k-optimal rule discovery. Occam Tree from Visionary Tools.

Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Source: Włodzisław Duch; Dept. of Informatics,UMK; Google: W Duch

GhostMiner Philosophy • GhostMiner, data mining tools from our lab. http://www.fqspl.com.pl/ghostminer/ • or write “Ghostminer” in Google. • Separate the process of model building and knowledge discovery from model use => GhostMiner Developer & GhostMiner Analyzer. • There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, committees. • Provide tools for visualization of data. • Support the process of knowledge discovery/model building and evaluating, organizing it into projects.

SSV in GM SSV = Separability Split Value, simple criterion measuring how many pairs of samples from different classes are correctly separated. Define subsets of data D using a binary test f(X,s)to split the data into left and right subset D = LS RS. Tests: defined on vector X, usually on a single attribute Xifor continuous values comparing it with a threshold s, f(X,s) = T  Xi< s or a subset of values for discrete attributes. Another type of tests giving quite different shapes of decision borders is based on distances from prototypes.

SSV criterion Separability = the number of samples that are in LSsubset and are from class wc times the number of elements in RSfrom all the remaining classes, summed over all classes. If several tests/thresholds separate the same number of pairs (this may happen for discrete attributes) select the one that separates a lower number of pairs from the same class. SSV is maximized; first part should dominate, hence factor 2. Simple to compute, creates full tree using top-down algorithm with best-first search or beam search procedures to find better trees. Uses cross-validation training to select nodes for backward pruning.

SSV parameters Generalization strategy: defines how the pruning is done. First, try to find optimal parameters for pruning: final number of leaf nodes, or pruning degree k = remove all nodes that increased accuracy by k samples only. Given pruning degree or given nodes count define these parameters by hand. “Optimal” uses cross-validation training to determine these parameters: the number of CV folds and their type has to be selected. Optimal numbers: minimize sum of errors in the test parts of CV training. Search strategy: use the nodes of tree created so far and: use the best-first search, i.e. select the next node for splitting; use beam search: keep ~10 best trees, expand them; avoids local maxima of the SSV criterion function but rarely gives better results.

SSV example Hypothyroid disease, screening data with 3772 training (first year) and 3428 test (second year) examples, majority 92.5% are normal, the rest are primary hypothyroid or compensated hypothyroid cases. TT4 attribute: red is# errors; green - # correctly separated pairs; blue is# pairs separated from the same class (here always zero).

Wine data example Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars.Task: recognize the source of wine sample.13 quantities measured, continuous features: • malic acid content • alkalinity of ash • total phenols content • nonanthocyanins phenols content • color intensity • hue • proline. • alcohol content • ash content • magnesium content • flavanoids content • proanthocyanins phenols content • OD280/D315 of diluted wines • Wine robot

C4.5 tree for Wine J48 pruned tree: using reduced error (3x crossvalidation) pruning: ------------------ OD280/D315 <= 2.15 | alkalinity <= 18.1: 2 (5.0/1.0) | alkalinity > 18.1: 3 (31.0/1.0) OD280/D315 > 2.15 | proline <= 750: 2 (43.0/2.0) | proline > 750: 1 (40.0/2.0) Number of Leaves : 4 Size of the tree : 7 Correctly Classified Instances 161 90.44 % Incorrectly Classified Instances 17 9.55 % Total Number of Instances 178

WEKA/RM output WEKA output contains confusion matrix, RM has transposed matrix a b c <= classified as 56 3 0 | a = 1 (class number) 4 65 2 | b = 2 P(true|predicted) 3 1 44 | c = 3 2x2 matrix: TP FN P+ FP TN P- P+(M) P-(M) 1 WEKA: TP Rate FP Rate Precision Recall F-Measure Class 0.949 0.059 0.889 0.949 0.918 1 0.915 0.037 0.942 0.915 0.929 2 0.917 0.015 0.957 0.917 0.936 3 Pr=Precision=TP/(TP+FP) R =Recall = TP/(TP+FN) F-Measure = 2Pr*R/(P+R)

WEKA/RM output info Other output information: Kappa statistic 0.8891 (corrects for chance agreement) Mean absolute error 0.0628 Root mean squared error 0.2206 Root relative squared error 47.0883%Relative absolute error 14.2992%

Simplest SSV rules Decision trees provide rules of different complexity, depending on the pruning parameters. Simpler trees make more errors but help to understand data better. In SSV pruning degree or pruning nodes may be selected by hand. Start from small number of nodes, see how the number of errors in CV will change. Simplest tree: 5 nodes, corresponding to 3 rules; 25 errors, mostly Class2/3 wines mixed.

Wine – SSV 5 rules Lower pruning leads to more complex but more accurate tree. 7 nodes, corresponding to 5 rules; 10 errors, mostly Class2/3 wines mixed. Try to lower the pruning degree or increase the node number and observe the influence on the error. av. 3 nodes train 10x: 87,0±2,1% test 80,6±5,4±2,1% av. 7 nodes train 10x: 98,1±1,0%test 92,1±4,9±2,0% av.13 nodes train 10x: 99,7±0,4%test 90,6±5,1±1,6%

Wine – SSV optimal rules What is the optimal complexity of rules? Use crossvalidation to estimate optimal pruning for best generalization. Various solutions may be found, depending on the search parameters: 5 rules with 12 premises, making 6 errors, 6 rules with 16 premises and 3 errors, 8 rules, 25 premises, and 1 error. if OD280/D315 > 2.505  proline > 726.5  color > 3.435 then class 1 if OD280/D315 > 2.505  proline > 726.5  color < 3.435 then class 2 if OD280/D315 < 2.505  hue > 0.875  malic-acid < 2.82 then class 2 if OD280/D315 > 2.505  proline < 726.5 then class 2 if OD280/D315 < 2.505  hue < 0.875 then class 3 if OD280/D315 < 2.505  hue > 0.875  malic-acid > 2.82 then class 3

DT summary DT: fast and easy, recursive partitioning. Advantages: easy to use, very few parameters to set, no data preprocessing; frequently give very good results, easy to interpret, and convert to logical rules, work with nominal and numerical data. Applications: classification and regression. Almost all Data Miningsoftware packages have decision trees. Some problems with DT: few data, large number of continuous attributes, unstable; lower parts of trees are less reliable, splits on small subsets of data;DT knowledge expressive abilities are rather limited, for example it is hard to create a concept: “majority are for it”, easy for M-of-N rules.

Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Source: Włodzisław Duch; Dept. of Informatics,UMK; Google: W Duch

Regression and model trees Regression: numeric, continuous classesC(X), predict number. Select the split to minimize variance in the node (make data piecewise constant) Leaf nodes predict average values of training samples that reach it, so approximation is piecewise constant. Stop criterion: do not split the node ifs(Dk) < ks(E). Model trees: use linear regression in each node; only a subset of attributes is used at each node. Similar idea to the approximation by spline functions.

Some DT ideas Many improvements have been proposed. General idea: divide and conquer. Multi-variate trees: provide more complex decision borders; trees using Fisher or Linear Discrimination Analysis;perceptron trees, neural trees. Split criteria: information gain near the root, accuracy near leaves;pruning based on logical rules, works also near the root; Committees of trees: learning many trees on randomized data (boosting) or CV, learning with different pruning parameters. Fuzzy trees, probability evaluating trees, forests of trees ... http://www.stat.wisc.edu/~loh/ Quest, Cruise, Guide, Lotus trees

DT tests and expressive power DT: fast and easy, recursive partitioning of data – powerful idea. Typical DT with tests on values of single attribute has rather limited knowledge expression abilities. For example, if N=10 people vote Yes/No, and the decision is taken when the number of Yes votes > the number of No votes (a concept: “majority are for it”), the data looks as follows: 1 0 0 0 1 1 1 0 1 0 No 1 1 0 0 1 1 1 0 1 0 Yes 0 1 0 0 1 1 1 0 1 0 No Univariate DT will not learn from such data, unless a new test is introduced: ||X-W||>5, or WX>5, with W=[1 1 1 1 1 1 1 1 1 1] Another way to express it is by the M-of-N rule: IF at least 5-of-10 (Vi=1) Then Yes.

Linear discrimination Linear combination WX > q, with fixed W, defines a half-space. WX = 0 defines a hyperplane orthogonal to W, passing through 0 WX > 0 is the half-space in the direction of W vector WX > q is the half-space, shifted by q in the direction of W vector. Linear discrimination: separate different classes of data using hyperplanes, learn the best W parameters from data. Special case of Bayesian approach (identical covariance matrices); special test for decision trees. Frequently a single hyperplane is sufficient to separate data, especially in high-dimensional spaces!

Linear discriminant functions Linear discriminant function gW(X) = WTX + W0 Terminology: W is the weight vector, W0is the bias term (why?). IF gW(X)>0 Then Class w1, otherwise Class w2 W = [W0, W1 ... Wd] usually includes W0, and X=[1,X1, .. Xd] Discrimination function for classification may include in addition a step function Q(WTX) = ±1. Graphical representation of the discriminant function gW(X)=Q(WTX) One LD function may separate pairs of classes; for more classes or if strongly non-linear decision borders are needed many LD functions may be used. If smooth sigmoidal output is used LD is called a “perceptron”.

Distance from the plane gW(X) = 0for two vectors on the d-D decision hyperplane means: WTX(1) = -W0=WTX(2), or WT(X(1)-X(2))=0, so WT is  (normal to) the plane. How far is arbitrary X from the decision hyperplane? X Xp Let V =W/||W|| be the unit vector normal to the plane and V0=W0/||W|| X= Xp+ DW(X) V; but WTXp=-W0, therefore WTX = -W0+ DW(X) ||W|| Hence the signed distance: Distance = scaled value of discriminant function, measures the confidence in classification; smaller ||W|| => greater confidence.

K-class problems For K classes: separate each class from the rest using K hyperplanes – but then ... Fig. 5.3, Duda, Hart, Stork Perhaps separate each pair of classes using K(K-1)/2 planes? Still ambiguous region persist.

Linear machine Define K discriminant functions: gi(X)=W(i)TX+W0i, i =1 .. K IF gi(X) > gj(X), for all j≠i, Then select wi Linear machine creates K convex decision regions Ri,  largest gi(X) Hijhyperplane is defined by: gi(X) = gj(X) => (W(i)-W(j))TX+ (W0i-W0j) = 0 W = (W(i)-W(j))is orthogonal to Hijplane; distance to this plane is DW(X)=(gi(X)-gj(X))/||W|| Linear machines for 3 and 5 classes, same as one prototype + distance.Fig. 5.4, Duda, Hart, Stork

LDA is general! Suppose that strongly non-linear borders are needed. Is LDA still useful? Yes, but not directly in the input space! Add to X={Xi}input alsoXi2, and products XiXj, as new features. Example: LDA in 2D => LDA in 5D adding{X1,X2,X12, X22, X1X2} g(X1,X2)=W1X1+...+W5X1X2+W0is now non-linear! Hasti et al, Fig. 4.1

LDA – how? How to find W? There are many methods, the whole Chapter 5 in Duda, Hart & Stork is devoted to the linear discrimination methods. LDA methods differ by: formulation of criteria defining W; on-line versions for incoming data, off-line for fixed data; the use of numerical methods: least-mean square, relaxation, pseudoinverse, iterative corrections, Ho-Kashyap algorithms, stochastic approximations, linear programming algorithms ... “Far more papers have been written about linear discriminants than the subject deserves” (Duda, Hart, Stork 2000). Interesting papers on this subject are still being written ...

LDA – regression approach Linear regression model (implemented in WEKA) Y=gW(X)=WTX+W0 Fit the data to the known (X,Y) values, even if Y=1. Common statistical approach: use LMS (Least Mean Square) method, minimize the Residual Sum of Squares (RSS).

LDA – regression formulation In matrix form with X0=1, and W0 If X was square and non-singular than W = (XT)-1Y but nd+1

LDA – regression solution To search for the minimum of (Y-XTW)2 put derivatives to zero: this is a dxd matrix, and it should be positive definite in the minimum. Solution exist if X is non-singular matrix, i.e. all vectors are linearly independent, but if n<d+1 this is impossible, so sufficient number of samples is needed (there are special methods to solve it in n<d+1 case). pseudoinverse matrix has many interesting properties, see Numerical Recipes http://www.nr.com

LSM evaluation The solution using the pseudoinverse matrix is one of many possible approach to LDA (for 10 other see for ex. Duda and Hart). Is it the best result? Not always. For singular X due to the linearly dependent features, the method is corrected by removing redundant features. Good news: Least Mean Square estimates have the lowest variance among all linear estimates. Bad news: hyperplanes found in this way may not separate even the linearly separable data! Why? LMS minimizes squares of distances,not the classification margin.Wait for SVMs to do that ...

Computational Intelligence: Methods and Applications