1 / 19

Data Stream Management Systems Checkpoint

Data Stream Management Systems Checkpoint. CS240B Notes by Carlo Zaniolo UCLA CSD With slides from a KDD04 tutorial by Haixun Wang, Jian Pei & Philip Yu. Mining Data Streams: Challenges. On-line response (NB), limited memory, most recent windows only Fast & Light algorithms needed:

walker
Télécharger la présentation

Data Stream Management Systems Checkpoint

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Stream Management SystemsCheckpoint CS240B Notes by Carlo Zaniolo UCLA CSD With slides from a KDD04 tutorial by Haixun Wang, Jian Pei & Philip Yu

  2. Mining Data Streams: Challenges • On-line response (NB), limited memory, most recent windows only • Fast & Light algorithms needed: • Must minimize usage of memory and CPU • Requires only one (or a few) passes through data • Concept shift/drift: change mining set statistics • Render previously learned models inaccurate or invalid • Robustness and Adaptability: quickly recover/adjust after concept changes. • Popular machine learning algorithms no longer effective: • Neural nets: slow learner requires many passes • Support Vector Machines (SVM): computationally expensive • Apriori: many passes and expensive (association rule mine difficult for on data streams)

  3. The Decision Tree Classifier • Learning (Training) : • Input: a data set of (a, b), where a is a vector, b a class label • Output: a model (decision tree) • Testing: • Input: a test sample (x, ?) • Output: a class label prediction for x

  4. Decision Tree Classifiers • A divide-and-conquer approach • Simple algorithm, intuitive model • Typically a decision tree grows one level for each scan of data • Multiple scans are required • But if we can use small samples these problem disappears • But data structure is not ‘stable’ • Subtle changes of data can cause global changes in the data structure

  5. Stable Trees Using Samples How many samples do we need to build a tree in constant time that is nearly identical to the tree a batch learner (C4.5, Sprint,...) Nearly identical? • Categorical attributes: • with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner • identical decision tree • Continuous attributes: • discretize them into categorical ones ...Forget concept changes for now

  6. Hoeffding Trees • Hoeffding bound is applied to the information gain • Error decreases when n (# of samples) increases • At each node, we shall accumulate enough samples (n) before we make a split • Scales better than traditional DT algorithms • Incremental: the nodes are are created incrementally as news samples stream in • Sub-linear with sampling • Small memory requirement • Cons: • Only consider top 2 attributes • Tie breaking takes time • Grow a deep tree takes time • Discrete attribute only

  7. VFDT • Very Fast Decision Tree [Domingos, Hulten 2000] • Several Improvements: faster and less memory • Concept Changes? A naïve approach: • Place a sliding window on the stream • Reapply C4.5 or VFDT whenever window moves • Time consuming!

  8. CVFDT • Concept-adapting VFDT • Hulten, Spencer, Domingos, 2001 • Goal • Classifying concept-drifting data streams • Approach • Make use of Hoeffding bound • Incorporate “windowing” • Monitor changes of information gain for attributes. • If change reaches threshold, generate alternate subtree with new “best” attribute, but keep on background. • Replace if new subtree becomes more accurate.

  9. Classifiers for Data Streams • Fast and Light Classifiers: • Naïve Bayesian: one pass to count occurrences • Sliding windows, tumbles and slides • Adaptive Nearest Neighbor Classification Algorithm--ANNCAD Fast and Light Classifiers • Ensembles of Classifiers--decision trees or others • Bagging Ensembles and • Boosting Ensembles

  10. Basic Ideas • Stream partitioned into sequential chunks • Train a classifier from each chunk • Accuracy of voting ensembles is normally better than that of a single classfier. • Method1. Bagging • Weighted voting: weights are assigned to classifiers based on their recent performance on the current test examples • Only top K classifiers are used • Method2. Boosting • Majority voting • Classifiers retired by age • Boosting used in training

  11. Bagging Ensemble Method

  12. Mining Streams with Concept Changes • Changes detected by drop in accuracy or by other methods • Build new classifiers on new windows • Search among old ones those that have now become accurate

  13. Boosting Ensembles for Adaptive Mining of Data Streams Andrea Fang Chu, Carlo Zaniolo [PAKDD2004]

  14. Mining Data Stream: Desiderata Fast learning (preferably in one pass of the data.) Light requirements (low time complexity, low memory requirement) Adaptation (model always reflects the time-changing concept)

  15. Adaptive Boosting Ensembles Training stream is split into blocks (i.e., windows) Each individual classifier is learned from a block. A boosting ensemble of (7—19 members) is maintained over time Decisions are taken by simple majority As the N+1 classifier is build, boost the weight of the tuples misclassified by the first N Change detection is explored to achieve adaptation.

  16. Fast and Light Experiments show that boosting ensembles of “weak learners” provide accurate prediction Weak Learners An aggressively pruned decision tree, e.g., shallow tree (this means fast!) Trained on a small set of examples (this mean light in memory requirements!)

  17. Adaptation: Detect changes that cause significant drops in ensemble performance  gradual changes: concept drift abrupt changes: concept schift

  18. Adaptability • The error rate is viewed as a random variable • When it drops significantly from the recent average the whole ensemble is dropped • And a new one is quickly re-learned • Cost/performance of boosting ensembles is better than that of bagging ensembles [KDD04] • BUT ???

  19. References • Haixun Wang, Wei Fan, Philip S. Yu, Jiawei Han. Mining Concept Drifting Data Streams using Ensemble Classifiers. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) 2003. • Pedro Domingos, Geoff Hulten. Mining High Speed Data Streams. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) 2000. • Geoff Hulten, Laurie Spencer, Pedro Domingos. Mining Time-Changing Data Streams. In the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) 2001. • Wei Fan, Yi-an Huang, Haixun Wang, Philip S Yu. Active Mining of Data Streams. In the SIAM International Conference on Data Mining (SIAM DM) • 2004Fang Chu, Yizhou Wang, Carlo Zaniolo, An adaptive learning approach for noisy data streams, 4th IEEE International Conference on Data Mining (ICDM), 2004 • Fang Chu, Carlo Zaniolo: Fast and Light Boosting for Adaptive Mining of Data Streams. PAKDD 2004: 282-292. • Yan-Nei Law, Carlo Zaniolo, An Adaptive Nearest Neighbor Classification Algorithm for Data Streams, 2005 ECML/PKDD Conference, Porto, Portugal, October 3-7, 2005.

More Related