Decision Tree Learning

Decision Tree Learning Kelby Lee

Overview • What is a Decision Tree • ID3 • REP • IREP • RIPPER • Application

What is Decision Tree

What is Decision Tree • Select best attribute that classifies examples • Top Down • Start with concept that represents all • Greedy Algorithm • Select attribute that classifies maximum examples • Does not backtrack • ID3

ID3 Algorithm • ID3(Examples, Target_attribute, Attributes) • Create a Root node for the tree • If Examples all positive? • Return Single Node Tree Root, with label = + • If Examples all negative? • Return Single node Tree Root, with label = - • If Attributes is empty • Return single-node tree Root, label = most common value of Target_attribute in Examples

ID3 Algorithm • Otherwise • A  Best_Attribute (Attributes, Examples) • Root  A • For each value vi of A • Add a new tree branch • Examples_svi is a subset of Examples for vi • If Examples_svi is empty? • Add leaf node label = most common value of Target_attribute • Add a new sub tree: ID3(Examples_svi, Target_attribute, Attributes – {A})

Selecting Best Attribute • New property of Attribute: Information Gain • Information Gain: Measures how well a given attribute separates the training examples according to their target classification

{E1+, E2+} att1 {E1+, E2+, E3-, E4-} {E3-, E4-} {E1+, E3-} att2 {E1+, E2+, E3-, E4-} {E2+, E4-} Information Gain att1 = 1 att2 = 0.5

Tree Pruning • Overfit and Simplify • Simplify Tree • In most cases it improves accuracy

REP • Reduced Error Pruning • Deletes Single Conditions or Single Rules • Improves on Noisy Data • O(n4) on large data sets

IREP • Incremental Reduced Error Pruning • Produces one rule at a time and eliminates all examples covered by that rule • Stops when no positive examples or pruning produces unacceptable error

IREP Algorithm PROCEDURE IREP(Pos, Neg) BEGIN Ruleset := 0 WHILE Pos != 0 DO /* Grow and Prune a New Rule */ split (Pos, Neg) into (GrowPos, GrowNeg) Rule := GrowRule( GrowPos, GrowNeg ) Rule := PruneRule( Rule, PrunePos, PruneNeg )

IREP Algorithm IF error rate of Rule on ( PrunePos, PruneNeg ) exceeds 50% THEN RETURN Ruleset ELSE Add Rule to Ruleset Remove examples covered by Rule from ( Pos, Neg ) ENDIF ENDWHILE RETURN Ruleset END

RIPPER • Repeated Grow and Simplify produces quite different results than REP • Repeatedly prune the rule set to minimize the error • Repeated Incremental Pruning to Produce Error Reduction (RIPPER)

RIPPER Algorithm PROCEDURE RIPPERk (Pos, Neg) BEGIN Ruleset : = IREP(Pos, Neg) REPEAT k TIMES Ruleset := Optimize(Ruleset, Pos, Neg) UncovPos : = Pos \ {data covered by Ruleset } UncovNeg : = Neg \ {data covered by Ruleset } Ruleset : = Ruleset  IREP(UncovPos, UncovNeg) ENDREPEAT END

Optimization Function FUNCTION Optimize (Ruleset, Pos, Neg) BEGIN FOR each rule r  Ruleset do split ( Pos, Neg) into (GrowPos, GrowNeg) and (PrunePos, PruneNeg) /* Compute Replacement for r */ r’ : = GrowRule (GrowPos, GrowNet) r’ : = PruneRule ( r’, PrunePos, PruneNeg ) guided by error of Ruleset \ {c}  {c’}

Optimization Function /* Compute Replacement for r */ r’’ : = GrowRule (GrowPos, GrowNet) r’’ : = PruneRule ( r’, PrunePos, PruneNeg ) guided by error of Ruleset \ {c}  {c’’} Replace c in Ruleset with best of c, c’, c’’ guided by description length of Compress(Ruleset\{c}  {x}) ENDFOR RETURN Ruleset END

RIPPER Data 3,6.0E+00,6.0E+00,4.0E+00,none,35,empl_contr,7.444444444444445E+00,14,false,9,gnr,true,full,true,full,good. 2,4.5E+00,4.0E+00,3.913333333333334E+00,none,40,empl_contr,7.444444444444445E+00,4,false,10,gnr,true,half,true,full,good. 3,5.0E+00,5.0E+00,5.0E+00,none,40,empl_contr,7.444444444444445E+00,4.870967741935484E+00,false,12,avg,true,half,true,half,good. 2,4.6E+00,4.6E+00,3.913333333333334E+00,tcf,38,empl_contr,7.444444444444445E+00,4.870967741935484E+00,false,1.109433962264151E+01,ba,true,half,true,half,good.

RIPPER Names file good,bad. dur: continuous. wage1: continuous. wage2: continuous. wage3: continuous. cola: none, tcf, tc. hours: continuous. pension: none, ret_allw, empl_contr. stby_pay: continuous. shift_diff: continuous. educ_allw: false, true. holidays: continuous. vacation: ba, avg, gnr. lngtrm_disabil: false, true. dntl_ins: none, half, full. bereavement: false, true. empl_hplan: none, half, full.

RIPPER Output Final hypothesis is: bad :- wage1<=2.8 (14/3). bad :- lngtrm_disabil=false (5/0). default good (34/1). =====================summary================== Train error rate: 7.02% +/- 3.41% (57 datapoints) << Hypothesis size: 2 rules, 4 conditions Learning time: 0.01 sec

RIPPER Hypothesis bad 14 3 IF wage1 <= 2.8 . bad 5 0 IF lngtrm_disabil = false . good 34 1 IF . .

IDS • Intrusion Detection System

IDS • Use Data Mining to Detect Anomaly • Better than Pattern Matching since may be possible to detect undiscovered attacks

RIPPER IDS data 86,543520084,192168000120,2698,192168000190,22,6,17,40,2096,158723779,14054,normal. 87,543520084,192168000190,22,192p168p0p120,2698,6,16,40,58387,39130843,46725,normal. ........................... 11,543520084,192168000190,80,192168000120,2703,6,16,40,58400,39162494,46738,anomaly. 12,543520084,192168000190,80,192168000120,2703,6,16,1500,58400,39162494,45277,anomaly.

RIPPER IDS names normal,anomaly. recID: ignore. timestamp: symbolic. sourceIP: set. sourcePORT: symbolic. destIP: set. destPORT: symbolic. protocol: symbolic. flags: symbolic. length: symbolic. winsize: symbolic. ack: symbolic. checksum: symbolic.

RIPPER Output Final hypothesis is: anomaly :- sourcePORT='80' (33/0). anomaly :- destPORT='80' (35/0). anomaly :- ack='7.01238e+07' (3/0). anomaly :- ack='7.03859e+07' (2/0). default normal (87/0). =================summary===================== Train error rate: 0.00% +/- 0.00% (160 datapoints) << Hypothesis size: 4 rules, 8 conditions Learning time: 0.01 sec

RIPPER Output anomaly 33 0 IF sourcePORT = 80 . anomaly 35 0 IF destPORT = 80 . anomaly 3 0 IF ack = 7.01238e+07 . anomaly 2 0 IF ack = 7.03859e+07 . normal 87 0 IF . .

IDS Output

Conclusion • What is a Decision Tree • ID3 • REP • IREP • RIPPER • Application

Decision Tree Learning