Machine Learning in Real World: CART

Machine Learning in Real World:CART

Outline • CART Overview and Gymtutor Tutorial Example • Splitting Criteria • Handling Missing Values • Pruning • Finding Optimal Tree

CART – Classification And Regression Tree • Developed 1974-1984 by 4 statistics professors • Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford) • Focused on accurate assessment when data is noisy • Currently distributed by Salford Systems

CART Tutorial Data: Gymtutor CART HELP, Sec 3 in CARTManual.pdf • ANYRAQT Racquet ball usage (binary indicator coded 0, 1) • ONAER Number of on-peak aerobics classes attended • NSUPPS Number of supplements purchased • OFFAER Number of off-peak aerobics classes attended • NFAMMEM Number of family members • TANNING Number of visits to tanning salon • ANYPOOL Pool usage (binary indicator coded 0, 1) • SMALLBUS Small business discount (binary indicator coded 0, 1) • FIT Fitness score • HOME Home ownership (binary indicator coded 0, 1) • PERSTRN Personal trainer (binary indicator coded 0, 1) • CLASSES Number of classes taken. • SEGMENT Member’s market segment (1, 2, 3) – target

View data • CART Menu: View -> Data Info …

CART Example: Gymtutor

CART Model Setup • Target -- required • Predictors (default – all) • Categorical • ANYRAQT, ANYPOOL, SMALLBUS, HOME • Categorical: if field name ends in “$”, or from values • Testing • default – 10-fold cross-validation • …

Sample Tree

Color-coding using class

Decision Tree: Splitters

Tree Details

Tree Summary Reports

Pruning the tree

Keeping only important variables

Revised Tree

Automating CART: Command Log

Key CART features • Automated field selection • handles any number of fields • automatically selects relevant fields • No data preprocessing needed • Does not require any kind of variable transforms • Impervious to outliers • Missing value tolerant • Moderate loss of accuracy due to missing values

CART: Key Parts of Tree Structured Data Analysis • Tree growing • Splitting rules to generate tree • Stopping criteria: how far to grow? • Missing values: using surrogates • Tree pruning • Trimming off parts of the tree that don’t work • Ordering the nodes of a large tree by contribution to tree accuracy … which nodes come off first? • Optimal tree selection • Deciding on the best tree after growing and pruning • Balancing simplicity against accuracy

CART is a form of Binary Recursive Partitioning • Data is split into two partitions • Q: Does C4.5 always have binary partitions? • Partitions can also be split into sub-partitions • hence procedure is recursive • CART tree is generated by repeated partitioning of data set • parent gets two children • each child produces two grandchildren • four grandchildren produce 8 great grandchildren

Splits always determined by questions with YES/NO answers • Is continuous variable X£c ? • Does categorical variable D take on levels i, j, or k? • is GENDER M or F ? • Standard split: • if answer to question is YES a case goes left; otherwise it goes right • this is the form of all primary splits • example : Is AGE  62.5? • More complex conditions possible: • Boolean combinations: AGE<=62 OR BP<=91 • Linear combinations: .66*AGE - .75*BP< -40

Searching all Possible Splits • For any node CART will examine ALL possible splits. • CART allows search over a random sample if desired • Look at first variable in our data set AGE with minimum value 40 • Test split Is AGE £ 40? • Will separate out the youngest persons to the left • Could be many cases if many people have the same AGE • Next increase the AGE threshold to the next youngest person • Is AGE £ 43? • This will direct additional cases to the left • Continue increasing the splitting threshold value by value • each value is tested for how good the split is . . . how effective it is in separating the classes from each other • Q: Consider splits between values of the same class?

X Split Tables Q: Where splits need to be evaluated? Sorted by Blood Pressure Sorted byAge X

CART Splitting Criteria: Gini Index • If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed. • Advanced: CART also has other splitting criteria • Twoing is recommended for multi-class

Missing as a distinct splitter value • CHAID treats missing as a distinct categorical value • e.g AGE is 25-44, 45-64, 65-95 or missing • method also adopted by C4.5 • If missing is a distinct value then all cases with missing go the same way in the tree • Assumption: whatever the unknown value it is the same for all cases with missing value • Problem: can be more than one reason for a database field to be missing: • E.g. Income as a splitter wants to separate high from low • Levels most likely to be missing? High Income AND Low Income! • Don’t want to send both groups to same side of tree

CART Treatment of Missing Primary Splitters: Surrogates • CART uses a more refined method —a surrogate is used as a stand in for a missing primary field • surrogate should be a valid replacement for primary • Consider our example of INCOME • Other variables like Education or Occupation might work as good surrogates • Higher education people usually have higher incomes • People in high income occupations will usually (though not always) have higher incomes • Using surrogate means that missing on primary not all treated same way • Whether go left or right depends on surrogate value • thus record specific . . . some cases go left others go right

Surrogates Mimicking Alternatives to Primary Splitters • A primary splitter is the best splitter of a node • A surrogate is a splitter that splits in a fashion similar to the primary • Surrogate — variable with near equivalent information • Why Useful? • If the primary is expensive or difficult to gather and the surrogate is not • Then consider using the surrogate instead • Loss in predictive accuracy might be slight • If primary splitter is MISSING then CART will use a surrogate • if top surrogate missing CART uses 2nd best surrogate etc • If all surrogates missing also CART uses majority rule

CART Pruning Method: Grow Full Tree, Then Prune • You will never know when to stop . . . so don’t! • Instead . . . grow trees that are obviously too big • Largest tree grown is called “maximal” tree • Maximal tree could have hundreds or thousands of nodes • usually instruct CART to grow only moderately too big • rule of thumb: should grow trees about twice the size of the truly best tree • This becomes first stage in finding the best tree • Next we will have to get rid the parts of the overgrown tree that don’t work (not supported by test data)

Maximal Tree Example

Tree Pruning • Take a very large tree (“maximal” tree) • Tree may be radically over-fit • Tracks all the idiosyncrasies of THIS data set • Tracks patterns that may not be found in other data sets • At bottom of tree splits based on very few cases • Analogous to a regression with very large number of variables • PRUNE away branches from this large tree • But which branch to cut first? • CART determines a pruning sequence: • the exact order in which each node should be removed • pruning sequence determined for EVERY node • sequence determined all the way back to root node

Pruning: Which nodes come off next?

Order of Pruning: Weakest Link Goes First • Prune away "weakest link" — the nodes that add least to overall accuracy of the tree • contribution to overall tree a function of both increase in accuracy and size of node • accuracy gain is weighted by share of sample • small nodes tend to get removed before large ones • If several nodes have same contribution they all prune away simultaneously • Hence more than two terminal nodes could be cut off in one pruning • Sequence determined all the way back to root node • need to allow for possibility that entire tree is bad • if target variable is unpredictable we will want to prune back to root . . . the no model solution

Pruning Sequence Example 24 Terminal Nodes 21 Terminal Nodes 18 Terminal Nodes 20 Terminal Nodes

Now we test every tree in the pruning sequence • Take a test data set and drop it down the largest tree in the sequence and measure its predictive accuracy • how many cases right and how many wrong • measure accuracy overall and by class • Do same for 2nd largest tree, 3rd largest tree, etc • Performance of every tree in sequence is measured • Results reported in table and graph formats • Note that this critical stage is impossible to complete without test data • CART procedure requires test data to guide tree evaluation

Training Data Vs. Test Data Error Rates No. Terminal Nodes • Compare error rates measured by • learn data • large test set • Learn R(T) always decreases as tree grows (Q: Why?) • Test R(T) first declines then increases (Q: Why?) • Overfitting is the result tree of too much reliance on learn R(T) • Can lead to disasters when applied to new data R(T) Rts(T) 71 .00 .42 63 .00 .40 58 .03 .39 40 .10 .32 34 .12 .32 19 .20 .31 **10 .29 .30 9 .32 .34 7 .41 .47 6 .46 .54 5 .53 .61 2 .75 .82 1 .86 .91

Why look at training data error rates (or cost) at all? • First, provides a rough guide of how you are doing • Truth will typically be WORSE than training data measure • If tree performing poorly on training data error may not want to pursue further • Training data error rate more accurate for smaller trees • So reasonable guide for smaller trees • Poor guide for larger trees • At optimal tree training and test error rates should be similar • if not something is wrong • useful to compare not just overall error rate but also within node performance between training and test data

CART: Optimal Tree • Within a single CART run which tree is best? • Process of pruning the maximal tree can yield many sub-trees • Test data set or cross- validation measures the error rate of each tree • Current wisdom — select the tree with smallest error rate • Only drawback — minimum may not be precisely estimated • Typical error rate as a function of tree size has flat region • Minimum could be anywhere in this region

One SE Rule -- One Standard Error Rule • Original monograph recommends NOT choosing minimum error tree because of possible instability of results from run to run • Instead suggest SMALLEST TREE within 1 SE of the minimum error tree • Tends to provide very stable results from run to run • Is possibly as accurate as minimum cost tree yet simpler • Current learning — one SERULE is good for small data sets • For large data sets one should pick most accurate tree • known as the zero SE rule

In what sense is the optimal tree “best”? • Optimal tree has lowest or near lowest cost as determined by a test procedure • Tree should exhibit very similar accuracy when applied to new data • BUT Best Tree is NOT necessarily the one that happens to be most accurate on a single test database • trees somewhat larger or smaller than “optimal” may be preferred • Room for user judgment • judgment not about split variable or values • judgment as to how much of tree to keep • determined by story tree is telling • willingness to sacrifice a small amount of accuracy for simplicity

CART Summary • CART Key Features • binary splits • gini index as splitting criteria • grow, then prune • surrogates for missing values • optimal tree – 1 SE rule • lots of nice graphics

Decision Tree Summary • Decision Trees • splits – binary, multi-way • split criteria – entropy, gini, … • missing value treatment • pruning • rule extraction from trees • Both C4.5 and CART are robust tools • No method is always superior – experiment! witten & eibe

Machine Learning in Real World: CART