 Download Download Presentation Decision Trees

# Decision Trees

Télécharger la présentation ## Decision Trees

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012

2. Information Gain • InfoGain(S,A): expected reduction in entropy due to A

3. Information Gain • InfoGain(S,A): expected reduction in entropy due to A

4. Information Gain • InfoGain(S,A): expected reduction in entropy due to A

5. Information Gain • InfoGain(S,A): expected reduction in entropy due to A • Select A with max InfoGain • Resulting in lowest average entropy

6. Fraction of samples down branch i Disorder of class distribution on branch i Computing Average Entropy |S| instances Branch 2 Branch1 Sa2a Sa2b Sa1a Sa1b

7. Sunburn Example

8. Hair Color Height Lotion Weight Picking a Test Brown Blonde Tall Short Red Average Alex:N Annie:B Katie:N Sarah:B Emily:B John:N Sarah: B Dana: N Annie: B Katie: N Alex: N Pete: N John: N Dana:N Pete:N Emily: B Yes No Heavy Light Average Sarah:B Annie:B Emily:B Pete:N John:N Dana:N Alex:N Katie:N Dana:N Alex:N Annie:B Emily:B Pete:N John:N Sarah:B Katie:N

9. Entropy in Sunburn Example

10. Entropy in Sunburn Example S = [3B,5N]

11. Entropy in Sunburn Example S = [3B,5N]

12. Entropy in Sunburn Example S = [3B,5N] Hair color= 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454 Height = 0.954 - 0.69= 0.264 Weight = 0.954 - 0.94= 0.014 Lotion = 0.954 - 0.61= 0.344

13. Height Lotion Weight Picking a Test Tall Short Average Annie:B Katie:N Sarah:B Dana:N Yes No Heavy Light Average Sarah:B Annie:B Dana:N Katie:N Dana:N Annie:B Sarah:B Katie:N

14. Entropy in Sunburn Example S=[2B,2N] Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5 Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1

15. Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves

16. Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves • Select an inhomogeneous leaf node

17. Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves • Select an inhomogeneous leaf node • Replace that leaf node by a test node creating subsets that yield highest information gain

18. Building Decision Trees with Information Gain • Until there are no inhomogeneous leaves • Select an inhomogeneous leaf node • Replace that leaf node by a test node creating subsets that yield highest information gain • Effectively creates set of rectangular regions • Repeatedly draws lines in different axes

19. Alternate Measures • Issue with Information Gain:

20. Alternate Measures • Issue with Information Gain: • Favors features with more values • Option:

21. Alternate Measures • Issue with Information Gain: • Favors features with more values • Option: • Gain Ratio

22. Alternate Measures • Issue with Information Gain: • Favors features with more values • Option: • Gain Ratio • Sa : elements of S with value A=a

23. Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details

24. Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad?

25. Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad? • Harms generalization • Fits training data too well, fits new data badly

26. Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad? • Harms generalization • Fits training data too well, fits new data badly • For model m, training_error(m), D_error(m) – D=all data

27. Overfitting • Overfitting: • Model fits the training data TOO well • Fits noise, irrelevant details • Why is this bad? • Harms generalization • Fits training data too well, fits new data badly • For model m, training_error(m), D_error(m) – D=all data • If overfit, for another model m’, • training_error(m) < training_error(m’), but • D_error(m) > D_error(m’)

28. Avoiding Overfitting • Strategies to avoid overfitting:

29. Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping:

30. Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping: • Stop when InfoGain < threshold • Stop when number of instances < threshold • Stop when tree depth > threshold • Post-pruning

31. Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping: • Stop when InfoGain < threshold • Stop when number of instances < threshold • Stop when tree depth > threshold • Post-pruning • Grow full tree and remove branches • Which is better?

32. Avoiding Overfitting • Strategies to avoid overfitting: • Early stopping: • Stop when InfoGain < threshold • Stop when number of instances < threshold • Stop when tree depth > threshold • Post-pruning • Grow full tree and remove branches • Which is better? • Unclear, both used. • For some applications, post-pruning better

33. Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning

34. Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning • Build decision tree based on training data

35. Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning • Build decision tree based on training data • Until pruning does not reduce validation set performance • Compute perf. for pruning each nodes (& its children) • Greedily remove nodes that do not reduce VS performance

36. Post-Pruning • Divide data into • Training set: used to build the original tree • Validation set: used to perform pruning • Build decision tree based on training data • Until pruning does not reduce validation set performance • Compute perf. for pruning each nodes (& its children) • Greedily remove nodes that do not reduce VS performance • Yields smaller tree with best performance

37. Performance Measures • Compute accuracy on:

38. Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation

39. Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation • Weighted classification error cost: • Weight some types of errors more heavily

40. Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation • Weighted classification error cost: • Weight some types of errors more heavily • Minimum description length:

41. Performance Measures • Compute accuracy on: • Validation set • k-fold cross-validation • Weighted classification error cost: • Weight some types of errors more heavily • Minimum description length: • Favor good accuracy on compact models • MDL = error(tree) + model_size(tree)

42. Rule Post-Pruning • Convert tree to rules

43. Rule Post-Pruning • Convert tree to rules • Prune rules independently

44. Rule Post-Pruning • Convert tree to rules • Prune rules independently • Sort final rule set

45. Rule Post-Pruning • Convert tree to rules • Prune rules independently • Sort final rule set • Probably most widely used method (toolkits)

46. Modeling Features • Different types of features need different tests • Binary: Test branches on

47. Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches

48. Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches for each discrete value • Continuous?

49. Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches for each discrete value • Continuous? • Need to discretize

50. Modeling Features • Different types of features need different tests • Binary: Test branches on true/false • Discrete: Branches for each discrete value • Continuous? • Need to discretize • Enumerate all values