Télécharger la présentation
## BOAT

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**BOAT**Bootstrapped Optimistic Algorithm for Tree Construction**CIS 595FALL 2000**Presentation Prashanth Saka**BOAT is a new algorithm for decision tree construction that**improves both in functionality and performance, resulting in a gain of around 300%. • The reason, only two scans over the entire training data set. • The first scalable algorithm with the ability to incrementally update the tree w.r.t., to both insertions and deletions over the dataset.**Take a sample D ́ D from the training database and**construct a sample tree with coarse splitting criteria at each node using bootstrapping. • Make one scan over the database D and process each tuple by ‘streaming it’ down the tree. • At the root node, n, update the counts of buckets for each numerical predictor attribute. • If ‘t’ falls in the confidence interval, ‘t’ is written into a temporary file Sn at node n, else it is sent down the tree.**Then the tree is processed top-down.**• At each node a lower bounding technique is used to check whether the global minimum value of the impurity function could be lower than i’, the minimum impurity value. • If the check is successful, then we are done with the node n. Else, we discard n and its sub tree during the current construction.**Each decision tree has exactly one incoming edge and zero or**two outgoing edges. • Each leaf is labeled with one class label. • Each internal node is labeled with one predictor attribute Xn called the splitting attribute. • Each internal node has the splitting predicate qn associated with it. • If Xn is numerical, then qn is in the form Xn xn, where xn belongs to dom(Xn); xn is the split point at node n.**The combined information of splitting attribute and**splitting predicates at a node n is called the splitting criterion at n.**We associate at each node nT, a predicate**fn: dom(X1) x … x dom(Xm) { true, false }, called its node predicate as follows: for the root node n, fn = true Let n be a non-root node with the parent p, whose splitting predicate is qp. If n is the left child of p, then fn = fp qp; If n is the right child of p, then fn = fp¬ qp**Since each leaf node n T is labeled with a class label,**it encodes a classification rule fn c, where c is the label of n. T: dom(X1) x … x dom(Xm) dom(C) and is therefore a classifier called a decision classifier. For a node nT, with parent p, Fn is the set of records in D that follows the path from the root to node n, when being processed by the tree. Formally, Fn = { t D : f(n) is true }**Here, the impurity based split selection methods are**considered, which produce binary splits. • The impurity based split selection methods calculate the splitting criterion by trying to minimize a concave splitting function imp. • At each node, all the predictor attributes X are examined and the impurity of the best split on X is calculated, and the final split is chosen such that the value of imp is minimized.**Let T be the final tree constructed using the split**selection method CL on the training database, D. • As D does not fit into the memory, consider D’D such that D’ fits into the memory. • Compute a sample T’ from D’. • Each node n T’ has a sample splitting criteria consisting of a sample splitting attribute and a split point. • We can use this knowledge of T’ to guide us in the construction of T, our final goal.**Consider a node n in the sample tree T’ with numerical**sample attribute Xn, and sample splitting predicate Xn x. • By T’ being close to T we mean that the final splitting attribute is at node n is X and that the final split point is inside a confidence interval around x. • For categorical attributes, both the splitting attribute as well as the splitting subset have to match.**Bootstrapping: The bootstrapping method can be applied to**the in-memory sample D’ to obtain a tree T’ that is close to T with high probability. • In addition to T’, we also obtain the confidence intervals that contain the final split points with for nodes with numerical splitting attributes. • We call the information at node n obtained through bootstrapping the coarse splitting criterion at node n.**After finding the final splitting attribute at each node n**and also the confidence interval of the attribute values that contain the final split point. • To decide on the final split point we need to examine the value of the impurity function only at the attribute values inside the confidence interval. • If we had all the tuples that fall inside the confidence interval of n in-memory, then we could calculate the final split point exactly by calculating the value of the impurity function at these points only.**To bring these tuples into memory we make one scan over D**and keep all tuples that fall inside the confidence interval at any node in-memory. • Then we post process each node with a numerical splitting attribute to find the exact value of the split point using the tuples collected during the database scan. • This phase is called the clean-up phase. • The coarse splitting criterion at node n obtained from the sample D’ through bootstrapping is only correct with a high probability.**Whenever the course splitting criterion at n is not correct,**we detect it during the clean-up phase, and can take necessary action. • And hence, the method guarantees to find exactly the same tree as if a traditional main memory algorithm were run on the complete training set.