1 / 29

Data Mining

Data Mining. Decision Trees. dr Iwona Schab. Decision Trees. Method of classification Recursive procedure which ( progressively ) divides sets of n units into groups accoridng to a division rule

catrin
Télécharger la présentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining • DecisionTrees dr Iwona Schab

  2. DecisionTrees • Method of classification • Recursiveprocedurewhich (progressively) dividessets of nunitsintogroupsaccoridng to a divisionrule • Designedfor supervisedpredictionproblems (i.e. a set of inputvariablesisused to prodict the value of target variable • The primarygoalisprediction • The fittedtree model isused for target variableprediction for newcases (i.e. to scorenewcases/data) • Result: a finalpartition of the observations the Booleanrulesneeded to scorenew data

  3. DecisionTree • A predictive model represented in a tree-likestructure Root node A splitbased on the values of the input Internalnode Terminal node – the leaf

  4. Decissiontree • Nonparametricmethod • Allows for nonlinearrelationshipsmodelling • Sound concept, • Easy to interpret • Robustnessagainstoutliers • Detection and takingintoaccout of potentialinteractionsbetweeninputvariables • Additionalimplementation: categorisation of continiuosvariables, grouping of nopminalvalueds

  5. DecisionTrees • Types: • Classificationtrees(Categoricalresponsevariable) • the leafsgive the predictedclass and the probability of classmembership • Regressiontrees (Continousresponsevariable) • the leafsgive the predictedvalue of the target • Exemplaryapplications: • Handwritingrecognition • Medicalresearch • Financial and capitalmarkets

  6. DecisionTree • The path to eachleafexpresses as a Booleanrule: if … then… • The ’regions’ of the inputspacedetermined by the splitvalues • Intersections of subspacesdefined by a single splittingvariable • Regressiontree model is a multivariate step function • Leavesrepresent the perdicted target • Allcases in a particularleafaregiven the same predicted target • Splits: • Binary • Multiwaysplits (inputspartitionedintodisjoinedranges)

  7. Analyticaldecision • Recursivepartitioningrule / splittingcriterion • Pruningcriterion / stoppingcriterion • Assignement of predicted target variable

  8. Recursivepartitioningrule • Method used to fit the tree • Top-dow, greedyalgorithm • Starts at the rootnode • Splitsinvolvingeach single inputareexamined • Disjointsubsets of nominalinputs • Disjointranges of ordinal / intervalinputs • The splitingcriterion • Measures the reduction in variability of the target distribution in the childnode • used to choose the split • The splitchooseddetermines the partitioning of the observations • Partitionrepeted in eachchildnode as ifitwere a rootnode of a newtree • The partitioncontinuesdeeper in the tree – the processisrepeatedrecursivelyuntilisstopped by the stoppingrule

  9. Splits on (atleast) ordinalinput • Restrictions in order to preserve the ordering • Onlyadjacentvaluesaregrouped • Problem: • To partitionintoBgroupsinput with Ldistinctvalues (levels) •  • partitions • possiblesplits on a single ordinalinput • Anymonotonictransformation of the level of the input (with atleastanordinalmeasurementscale) gives the same split

  10. Splits on nominalinput • No restrictionsregardingordering • Problem: • to partitionintoBgroupsinput with Ldistinctvalues (levels)  • Numer of partitions: • - Stirling number of the secondkind • count the number of ways to partition a set of Llabelled objects into Bnonempty unlabelled subsets • The totalnumber of partitions:

  11. Binarysplits • Ordinalinput • Nominalinput

  12. Partitioningrule – possiblevariations • Incorporatingsometype of look-aheador backup • Oftenproduceinferiortrees • have not beenshown to be animprovement, Murthy and Salzberg, 1995) • Obliquesplits • Splits on lienearcombination of inputs (as apposite to the standard coordinte-axissplits. i.e. boundariesparallel to the inputcoordinates)

  13. Recursivepartitioningalghorithm • Start with L-waysplit • Collapse the twoleversthatareclosest (based on a splittingcriterion) • Repeat the process on the set of L-1 consolidatedlevels • … •  split of eachsize. • Choosetehbestsplit for the giveninput • Repeat the process for eachinput and choose the bestinput • CHAID algorithm • Additionalbacwardelimination step • Number of splits to considergraatlyreduced: • For ordinalinput: • For nominalinput:

  14. Stoppingcriterion • Governs the depth and complexity of the tree • Right balancebewteendepth and complexity • When the treeis to complex: • Perfect discriminantion in the trainingsample • Lost stability • Lost ability to generalisediscoveredpatterns and relations • Overfitted to the trainigsample • Difficultieswith interpretation of prodictiverules • Trade-off beetwen the adjustment to the trainingsample and ability to generalise

  15. Splittingcriterion • Impurityreduction • Chi-square test • Anexhaustivetreealgorithmconsiders: • allpossiblepartitions • Of allinputs • At everynode •  combinatorialexplosion

  16. Splitingcriterion • Minimiseimpuritywithinchildnodes / maximisedifferenciesbetweennewlysplitedchildnodes •  chose the splitintochildnodeswhich: • maximises the drop in inpurityresulting from the parnetsnodepartition • Maximisesdifferencebetweennodes • Measures of impurity: • Basic ratio • Giniimpurityindex • Entropy • Measures of difference • Based on relativefrequencies (classificationtree) • Based on target variance (regressiontree)

  17. BinaryDecisiontrees • Nonparamemetric model  no assumptionsregardingdistributionneeded • Classifiesobservationsintopre-definedgroups  target variablepredited for the wholeleafe • Supervisedsegmentation • In the baciscase: recoursivepartitionintotwoseparatecategories in order to maximisesimilarities of observationwithin the leaf and maximisedifferenciesbetweenleaves • Tree model = rules of segmentation • No previousselection of inputvariable

  18. Trees vs hierarchicalsegmentation Trees • Predictiveappraoch • Supervisedclassification • Segmentationbased on target variable • Eachpartitioningbased on one variableat the time (usually) • Hierarchicalsegmentation • Descriptiveapparoach • Unsupervisedclassification • Segmentationbased on allvariables • Eachpartitioningbased on allvariableat the time– based on distancemeasure

  19. Requirements • Large data sample • In case of classificationtrees: sufficientnumber of casesfallingintoeachclass of target (suggeested: min 500 cases per class)

  20. Stoppingcriterion • The nodereachespre-definedsize (e.g 10 or less cases) • The algorithmhas run the predefinednumber of generations • The splitresults in (too) small drop of impurity • Expecteslosses in the testingsample • Stability of resuls in the testingsample • Probabilisticassumptionsregarding the variables (e.g. CHAID algorithm)

  21. Target assignement to the leaf • Frequencybased • Thresholdneeded • Cost of misclassificationbased • α– cost of the I type error – e.g. averagecostincureddueto acceptance of a „bad” credit • β– cost of the II typeerror – e.g. averageincomelostdue to rejection of a „good” credit)

  22. Disadvantages • Lack of stability (often) • Stabilityassessment on the basis of testingsample, withoutformalstatisticalinference • In case of classificationtree: target valuecalculated in the separate step with a „simplistic” method ( dominatingfrequencyassignement) • Target valuecalculated on the leaflevel, not on the individualobservationlevel

  23. SplitingExample • Drop of impurityΔI • Basic ImpurityIndex Averageimpurity of childnodes

  24. SplitingExample • GiniImpurityIndex • Entropy • Pearson’s test for relativefrequencies

  25. SplitingExample • How to split the ordinal (in thiscase) variable „age”? (young+older) vs. medium?  (young+medium) vs. older?

  26. SplitingExample 1. Young + Older= r versus Medium = l • I(v)=min{400/2000 ;1600/2000}=0,2 p(r) = 1400/2000=0,7 p(l) = 600/2000=0,3 I(r) = 300/1400 I(l) = 100/600

  27. SplitingExample 2. Young + Medium= r versus Older= l • i(v)=min{400/2000 ;1600/2000}=0,2 p(r) = 1600/2000=0,8 p(l) = 400/2000=0,2 I(r) = 300/1600 I(l) = 100/400

  28. SplitingExample 1. Young + Older= r versus Medium = l • p(r) = 1400/2000=0,7 p(l) = 600/2000=0,3

  29. SplitingExample 2. Young + Medium= r versus Older= l • p(r) = 1600/2000=0,8 p(l) = 400/2000=0,2

More Related