Implementing Hoeffding Decision Trees in DB2

Implementing Hoeffding Decision Trees in DB2 CS240A, Win 2003 Carlo Zaniolo

Decision Tree Learning Algorithms • Decision Tree Example Temp Mild Cool Hot Yes No Yes

New Application Challenges • Classical Learning: In-Memory Data, All Data Available at Beginning • New Scenario: Very Large Data, Streaming in • SOLUTIONS: Incremental Learning

V1 A B C1 C2 V2 X Y C1 C2 Incremental Decision Tree Construction Intuitively, a small number of samples is sufficient to choose the best attributes to test on each node.

Hoeffding Decision Tree • Hoeffding Bound: Given a random variable r, we make n independent observations to estimate its mean and get . Hoeffding bound states that with probability 1 - , the true mean of the variable is at least - , where  is : • Hoeffding Tree: • random variable r: the difference between the information gain given by the best attributes and the 2nd best attributes. • observation: training samples falling into the node so far. • goal: the best attribute is chosen with confidence of 1 - . • mechanics: maintaining a distribution table: ( attr, attr_val, class, # samples )

Make the Best Out of DB2 • Input: training and testing data both in DB2 tables. • Output: a decision tree in DB2 table. • DB2 utilities you may consider: • UDF for learning, • Recursive SQL for prediction. Have Fun !

References: • Pedro Domingos, Geoff Hulten, Mining High-Speed Data Streams, ACM SIGKDD 2000. • Don Chamberlin, A Complete Guide to DB2 Universal Database, Morgan Kaufmann, 1998.

Implementing Hoeffding Decision Trees in DB2

Implementing Hoeffding Decision Trees in DB2

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Handling Numeric Attributes in Hoeffding Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision Trees

Decision trees

Decision Trees

Decision Trees