1 / 46

Mining High-Speed Data Streams

Mining High-Speed Data Streams. Presented by: Yumou Wang Dongyun Zhang Hao Zhou. Introduction. The world’s information is doubling every two years. From 2006 to 2011, the amount of information grew by a factor of 9 in just five years. Introduction.

sadah
Télécharger la présentation

Mining High-Speed Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining High-Speed Data Streams Presented by: Yumou Wang Dongyun Zhang Hao Zhou

  2. Introduction • The world’s information is doubling every two years. • From 2006 to 2011, the amount of information grew by a factor of 9 in just five years.

  3. Introduction • By 2020 the world will generate 50 times the amount of information and 75 times the number of "information containers" • However, IT staff to manage it will grow less than 1.5 times. • Current algorithms can only deal with small amount of data less than a day’s data of many applications. • For example, banks, telecommunication companies.

  4. Introduction • Problems : When new examples arrive at a higher rate than they can be mined, the amount of unused data grows without bounds as time progresses. • Today, to deal with these huge amount of data in a responsible way is very important. • Mining these continuous data streams brings unique opportunities, but also new challenges.

  5. Background • Design Criteria for mining High Speed Data Streams • It must be able to build a model using at most one scan of the data. • It must use only a fixed amount of main memory. • It must require small constant time per record.

  6. Background • Usually, use KDD system to operate this examples when they arrive. • Shortcomings: learning model learned are highly sensitive to example ordering compare to the batch model. • Others can produce the same model as batch version but very slower.

  7. Classification Method • Input: Examples of the form (x,y), y is the class label, x is the vector of attributes. • Output: A model y=f(x), predict the classes y of future examples x with high accuracy.

  8. Decision Tree • One of the most effective and widely-used classification methods. • A decision tree is a decision support tool that uses a tree-like graph or model.  • Decision trees are commonly used in machine learning.

  9. Building a Decision Tree • 1. Starting at the root. • 2. Testing all the attributes and choose the best one according to some heuristic measure. • 3. Split one node into branches and leaves. • 4. Recursively replacing leaves by test nodes.

  10. Example of Decision Tree

  11. Example of Decision Tree

  12. Problems • There are some problems existed in traditional decision tree. • Some of them assume that all training data examples can be stored simultaneously in main memory. • Disadvantages: Limited the number of examples can be learned from. • Disk-based decision tree learners: examples in disk, repeatedly reading them. • Disadvantages: expensive when learning complex trees.

  13. Hoeffding Trees • Designed for extremely large datasets • Main idea: To find the best attribute at a given node by considering only a small subset of the training examples that pass through the node. • Using how many examples is sufficient

  14. Hoeffding Bound • Definition: The statistical result that can decide how many examples “n” using by each node is called Hoeffding bound. • Assume: R—the range of variable r • n independent observations • mean: r’ • With probability 1-δ, the true mean of r is at least r’-є

  15. Hoeffding Bound • This function is a decreasing function • n is bigger, the є is smaller • It is the difference between true value and mean value of r.

  16. Hoeffding Tree Algorithm

  17. Hoeffding Tree Algorithm • Inputs: S -> is a sequence of examples, X -> is a set of discrete attributes, G(.) -> is a split evaluation function, δ -> is one minus the desired probability of choosing the correct attribute at any given node. • Outputs: HT -> is a decision tree.

  18. Hoeffding Tree Algorithm • Goal: • Ensure that, with a high probability, the attribute chosen using n examples, is the same as that would be chosen using infinite examples. • Let Xa be the attribute with the highest observed G’ and Xb be with second highest attribute. • After seeing n examples. • Let ΔG’ = G’(Xa) – G’(Xb) • ΔG’ > ϵ • Thus a node needs to accumulate examples from the stream until ϵ becomes smaller than ΔG.

  19. Hoeffding Tree Algorithm • The algorithm constructs the tree using the same procedure as ID3. It calculates the information gain for the attributes and determines the best attributes. • At each node it checks for condition ΔG > ϵ. If the condition is satisfied, then it creates child nodes based on the test at the node. • If not it streams in more training examples and carries out the calculations till it satisfies the condition.

  20. Hoeffding Tree Algorithm • Memory cost • d—number of attributes • c—number of classes • v—number of values per attribute • l—number of leaves in the tree • The memory cost for each leaf is O(dvc) • The memory cost for whole tree is O(ldvc)

  21. Advantages of Hoeffding Tree • 1. Can deal with extremely large datasets. • 2. Each example to be read at most once in a small constant time. Makes it possible to mine online data sources. • 3. Build very complex trees with acceptable computational cost.

  22. VFDT—Very Fast Decision Tree • Breaking ties • Reduce waste • Useful under condition where • Use of • Split may not change with a single example • Significantly reduce the time of re-computation • Memory cleanup • Measurement of • Clearance of least promising leaves • Option of enabling reactivation

  23. VFDT—Very Fast Decision Tree • Filtering out poor attributes • Dropping early • Reduces memory consumption • Initialization • Can be initialized with other existing tree • Set a head start • Rescans

  24. Tests—Configuration • 14 Concepts • Generated by random decision trees using • Number of leaves: 2.2k to 61k • Noise level: 0 to 30% • 50k examples for testing • Available memory: 40MB • Legacy processors

  25. Tests—Synthetic data • , ,

  26. Tests—Synthetic data

  27. Tests—Synthetic data

  28. Tests—Synthetic data

  29. Tests—Synthetic data

  30. Tests—Synthetic data • Time consumption • 20m examples • VFDT takes 5752s to read, 625s to process • 100k examples • C4.5 takes 36s • VFDT takes 47s

  31. Tests—parameters • W/ & w/o over-pruning

  32. Tests—parameters • W/ ties vs. w/o ties • 65 nodes vs. 8k nodes for VFDT • 805 nodes vs. 8k nodes for VFDT-boot • 72.9% vs. 86.9% for VFDT • 83.3% vs. 88.5% for VFDT-boot • vs. • VFDT: +1.1% accuracy, +3.8x time • VFDT-boot: -0.9% accuracy, +3.7x time • 5% more nodes

  33. Tests—parameters • 40MB vs. 80MB memory • 7.8k more nodes • VFDT: +3.0% accuracy • VFDT-boot: +3.2% accuracy • vs. • 30% less nodes • VFDT: +2.3% accuracy • VFDT-boot: +1.0% accuracy

  34. Tests—web data • For predicting accesses • 1.89m examples • 61.1% with most common class • 276230 examples for testing

  35. Tests—web data • Decision dump • 64.2% accuracy • 1277s to learn • C4.5 with 40MB memory • 74.5k examples • 2975s to learn • 73.3% accuracy • VFDT-bootstrapped with C4.5 • 1.61m examples • 1450s to learn after initialization(983s to read)

  36. Tests—web data

  37. Mining Time-Changing Data Streams

  38. Why is VFDT not Enough? • VFDT, assume training data is a sample drawn from stationarydistribution. • •Most large databases or data streams violate this assumption • –Concept Drift: data is generated by a time-changing concept function, e.g. • •Seasonal effects • •Economic cycles • •Goal: • –Mining continuously changing data streams • –Scale well

  39. Why is VFDT not Enough? • Common Approach: when a new example arrives, reapply a traditional learner to a sliding window of w most recent examples • –Sensitive to window size • •If w is small relative to the concept shift rate, assure the availability of a model reflecting the current concept • •Too small w may lead to insufficient examples to learn the concept • –If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high.

  40. CVFDT • CVFDT (Concept-adapting Very Fast Decision Tree learner) • –Extend VFDT • –Maintain VFDT’s speed and accuracy • –Detect and respond to changes in the example-generating process

  41. CVFDT (contd.) • With a time-changing concept, the current splitting attribute of some nodes may not be the best anymore. • An out dated subtree may still be better than the best single leaf, particularly if it is near the root. • – Grow an alternative subtree with the new best attribute at its root, when the old attribute seems out-of-date. • Periodically use a bunch of samples to evaluate qualities of trees. • – Replace the old subtree when the alternate one becomes more accurate.

  42. How CVFDT Works

  43. Example

  44. Sample Experiment Result

  45. Conclusion and Future Work • CVFDT is able to maintain a decision-tree up-to—date with a window of examples by using a small constant amount of time for each new examples that arrives. • Empirical studies show that CVFDT is effectively able to keep its model up-to-date with a massive data stream even in the face of large and frequent concept shifts. • Future Work: Currently CVFDT discards subtrees that are out-of-date, but some concepts change periodically and these subtrees may become useful again – identifying these situations and taking advantage of them is another area for further study.

  46. Thank You

More Related