1 / 92

RainForest – A Framework for Fast Decision Tree Construction of Large Datasets

RainForest is a framework for decision tree classifiers that addresses scalability and quality issues in tree construction algorithms for large databases.

ellsworth
Télécharger la présentation

RainForest – A Framework for Fast Decision Tree Construction of Large Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RainForest– A Framework for Fast Decision Tree Construction of Large Datasets Authors: Johannes Gehrke Raghu Ramakrishnan Venkatesh Ganti Presented by: Xin Li & Omid Rouhani

  2. Outline • The very first impression of the RainForest • Decision tree classifiers • Formal problem definition • Dealing with large databases: Sprint • The RainForest framework • Top-Down Decision Tree Induction Schema • RainForest refinement • Main steps of RainForest algorithms • The RainForest family of algorithms • RF-Write, RF-Read, RF-Hybrid, RF-Vertical • Experimental results • Conclusions

  3. The Very First Impression of RainForest Q: What’s in a rainforest? (Monteverde Costa Rica rainforest)

  4. The Very First Impression of RainForest Q: What’s in a rainforest? A: Trees! All kinds of trees, and they all grow fast in the rainforest! (Monteverde Costa Rica rainforest)

  5. The Very First Impression of RainForest Q: What’s in a rainforest? A: Trees! All kinds of trees, and they all grow fast in the rainforest! Similarly, RainForest is a unifying framework for decision tree classifiers, under which we can independently deal with the scalability issue from the quality issue of the tree construction algorithms. (Monteverde Costa Rica rainforest)

  6. Outline • The very first impression of the RainForest • Decision tree classifiers • Formal problem definition • Dealing with large databases: Sprint • The RainForest framework • Top-Down Decision Tree Induction Schema • RainForest refinement • Main steps of RainForest algorithms • The RainForest family of algorithms • RF-Write, RF-Read, RF-Hybrid, RF-Vertical • Experimental results • Conclusions

  7. Decision Tree Classifiers (1)-- Formal problem definition • Family = F(n)All records in the databasethat corresponds to node n

  8. Decision Tree Classifiers (1)-- Formal problem definition • Family = F(n)Example 1 Age > 20 Age <=20 F(n) = All entries in database corresponding to people over 20 year

  9. Decision Tree Classifiers (1)-- Formal problem definition F(n) = Entire database • Family = F(n)Example 2

  10. Decision Tree Classifiers (1)-- Formal problem definition • Splitting criteria = crit(n) • What attribute to split • What values each branch corresponds to Example Age > 20 Age <=20

  11. Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted

  12. Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted

  13. Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted Database (large) Stored in memory (small)

  14. Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted Sorted Database (large) Stored in memory (small)

  15. Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted Still Sorted Database (large) Split age >20

  16. Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted Attribute lists Database (large) Split age >20

  17. Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Summary Sprint algorithm • Binary cuts • Scales well for large databases • With Rainforest framework • We can use other algorithms (C4.5, ID3, FACT etc.) • Also scale up well to large databases

  18. Outline • The very first impression of the RainForest • Decision tree classifiers • Formal problem definition • Dealing with large databases: Sprint • The RainForest framework • Top-Down decision tree induction schema • RainForest refinement • Main steps of RainForest algorithms • The RainForest family of algorithms • RF-Write, RF-Read, RF-Hybrid, RF-Vertical • Experimental results • Conclusions

  19. Database The RainForest Framework (1)-- Top-down decision tree induction schema • At the root r, examine the database and compute the best crit(r). r Crit(r) n

  20. Database The RainForest Framework (1)-- Top-down decision tree induction schema • At the root r, examine the database and compute the best crit(r). • Recursively, at a non-root node n, examine F(n) and compute crit(n) until the class label of F(n) can be determined. r r Crit(r) n n Crit(n)

  21. The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif

  22. The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif Step1: decide the splitting criterion

  23. The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif Step2: create the child partitions

  24. The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif Step3: recursively build sub-trees

  25. The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif generalization

  26. The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif RainForest Refinement BuildTree(Node n, dataPartition D, algorithm CL) (1a) for each predictor attribute p (1b) call CL.find_best_partitioning(AVC-set of p) (1c) endfor (2a) k = CL.decide_splitting_criterion();

  27. The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 2: At each node n, the utility of a predictor attribute A as a possible splitting attribute is examined independent of other attributes. How well can the data be separated with A? Utility(Age) = 2, Utility($$) = 1

  28. The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 2: At each node n, the utility of a predictor attribute A as a possible splitting attribute is examined independent of other attributes. How well can the data be separated with A? Utility(Age) = 2, Utility($$) = 1 crit(n) = bestPartion(Age)

  29. The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 3: At each node n, to compute the utility of a predictor attribute A as a possible splitting attribute, the information about the class label distribution for each distinct attribute value of A is sufficient. Class distribution of Age Utility(Age)

  30. The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 3: At each node n, to compute the utility of a predictor attribute A as a possible splitting attribute, the information about the class label distribution for each distinct attribute value of A is sufficient. Class distribution of Age AVC-set Utility(Age)

  31. The RainForest Framework (2)-- RainForest refinement • AVC-set of a predictor attribute A at node n: • The projection of F(n) onto A and the class label whereby counts of the individual class labels are aggregated. • AVC = Attribute-Value, Classlabel F(n) AVC-set($$) AVC-set(Age)

  32. The RainForest Framework (2)-- RainForest refinement • AVC-group of a node n: the set of all AVC-sets at node n r n1 n2 AVC-group(r) AVC-group(n1) AVC-group(n2) AVC-set(age) AVC-set($$)

  33. The RainForest Framework (2)-- RainForest refinement • Refined RainForest Schema RainForest Refinement BuildTree(Node n, dataPartition D, algorithm CL) (1a) for each predictor attribute p (1b) call CL.find_best_partitioning(AVC-set of p) (1c) endfor (2a) k = CL.decide_splitting_criterion(); Compute the utility for p Decide the splitting criterion

  34. The RainForest Framework (2)-- RainForest refinement • Refined RainForest Schema RainForest Refinement BuildTree(Node n, dataPartition D, algorithm CL) (1a) for each predictor attribute p (1b) call CL.find_best_partitioning(AVC-set of p) (1c) endfor (2a) k = CL.decide_splitting_criterion(); Seperated!

  35. AVC group AVC set 1 Database AVC set 2 AVC set 3 The RainForest Framework (2)-- RainForest refinement • Main memory requirement of the genetic RainForest Tree induction schema is determined by AVC-sets RAM Memory C4.5 with RainForest schema C4.5 Decision Tree

  36. AVC group AVC set 1 Database AVC set 2 AVC set 3 The RainForest Framework (2)-- RainForest refinement • Main memory requirement of the genetic RainForest Tree induction schema is determined by AVC-sets RAM Memory C4.5 with RainForest schema C4.5 Decision Tree proportional to proportional to the # of distinct attribute values and the # of class labels in F(n) the # of data records in F(n)

  37. The RainForest Framework (2)-- RainForest refinement • Fit the AVC-group into the main memory • For most real-life datasets, AVC-group(r) are expected to fit entirely in the main memory • OR, at least each single AVC-set of the root node fits in main memory • The size of AVC-sets of non-root nodes will be bounded by the that of the root node • Different algorithms are proposed depending on the amount of available main memory

  38. The RainForest Framework (3)-- Main steps of RainForest algorithms • For each tree node n • Step1: AVC-group construction • Need one scan of the data partition • Step2: Choose splitting attribute and partition criterion on the attribute • Computation is based on the AVC-sets • Step3: partition D across the children nodes • Read and write once, partitioning D into child “buckets” • If memory is sufficient, build AVC-groups for one or more children at the same time as an optimization

  39. The RainForest Framework (3)-- Main steps of RainForest algorithms • For each tree node n • Step1: AVC-group construction • Need one scan of the data partition • Step2: Choose splitting attribute and partition criterion on the attribute • Computation is based on the AVC-sets • Step3: partition D across the children nodes • Read and write once, partitioning D into child “buckets” • If memory is sufficient, build AVC-groups for one or more children at the same time as an optimization

  40. The RainForest Framework (3)-- Main steps of RainForest algorithms • For each tree node n • Step1: AVC-group construction • Need one scan of the data partition • Step2: Choose splitting attribute and partition criterion on the attribute • Computation is based on the AVC-sets • Step3: partition D across the children nodes • Read and write once, partitioning D into child “buckets” • If memory is sufficient, build AVC-groups for one or more children at the same time as an optimization

  41. The RainForest Framework (3)-- Compare to Sprint

  42. Outline • The very first impression of the RainForest • Decision tree classifiers • Formal problem definition • Dealing with large databases: Sprint • The RainForest framework • Top-Down decision tree induction schema • RainForest refinement • Main steps of RainForest algorithms • The RainForest family of algorithms • RF-Write, RF-Read, RF-Hybrid, RF-Vertical • Experimental results • Conclusions

  43. Algorithms in RainForest (1)-- Overview • RF-Write, RF-Read and RF-Hybrid • Requires thatAVC-group ofroot node fitsin memory • RF-Vertical • Requires thateach AVC-setof root node fitsin memory AVC sets RAM Memory Set #1 # 2 # 3 AVC group RAM Memory Set #1 # 2 # 3

  44. Algorithms in RainForest (1)-- RF-Write • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. Database

  45. Algorithms in RainForest (1)-- RF-Write • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. Database AVC group AVC set 1 Scan 1 AVC set 2 AVC set 3

  46. Algorithms in RainForest (1)-- RF-Write k = 2 for binary tree • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. For example C4.5 or ID3 Database AVC group AVC set 1 Scan 1 AVC set 2 AVC set 3

  47. Algorithms in RainForest (1)-- RF-Write • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. We write to the database Database AVC group AVC set 1 Scan 1 AVC set 2 AVC set 3

  48. Algorithms in RainForest (1)-- RF-Write • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. Database AVC group AVC set 1 Scan 1 AVC set 2 AVC set 3

  49. Algorithms in RainForest (1)-- RF-Write Read database 2 times • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. Write to database once Database AVC group AVC set 1 Scan 1 AVC set 2 AVC set 3

  50. Algorithms in RainForest (2)-- RF-Read • Does not write to memory. • Only reads from memory.

More Related