310 likes | 407 Vues
This work presents a unified framework supporting the interactive exploration of density-based clusters in streaming windows. The concept of density-based clusters, cluster detection in sliding windows, and pattern-specific window templates for density-based clustering queries are discussed. The review of existing algorithms and the introduction of the proposed IWIN solution for efficient cluster structure maintenance are highlighted. The importance of evolution semantics in cluster analysis and the benefits of an integrated maintenance approach are emphasized.
E N D
A Unified Framework Supporting Interactive Exploration of Density-Based Clusters In Streaming Windows Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted This work is supported under NSF grants CCF-0811510, IIS-0119276, IIS-0414380.
What are Density-Based Clusters? • Clusters that are defined by individual data points (tuples) and their local “neighborhood”. • How they are different from K-median style clustering? Cluster 1 Cluster 2 Cluster 2 Cluster 1 Cluster 3 Cluster 4
range θ cnt θ Formal Definition 1 Core Object: has more than neighbors in distance from it. 16 2 14 4 9 6 17 12 5 Edge Object: not core object but a neighbor of a core object. 8 13 7 15 Noise: not core object and not a neighbor of any core object. A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them
Cluster Detection in Sliding Windows W1 W2 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Pattern-specific Window-specific Template Density-Based Clustering Query Over Sliding Windows
Application Examples: Are there intensive-transaction areas in last 1 hour transactions? clusters transaction info Stock Market Stock Analysts Where are the main clusters formed by enemy war-crafts position info clusters Battle field 5 Commander
State-of-Art • Existing algorithms for density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09]. • Extra-N suffers from the performance inefficiency as the slide/win rate increases. • No evolution semantics defined for density-based cluster changes over the time. • No existing system allowing interactive exploration of density-based clusters in streaming windows.
Goals • A more efficient density-based clustering algorithm over streams. • An evolution semantics that intuitively explain cluster changes. • A visualized pattern space allowing interactive exploration of clusters.
Review: existing algorithm– Extra-N • In highly dynamic streaming environments: • Re-computation. • Incremental cluster maintenance. • Extra-N[Yang09] proposed a hybrid neighbor relationship (neighborship) mechanism to represent cluster structure. • Maintain “Exact Neighborships” (neighbor lists) for none-core objects. • Maintain “Abstract Neighborships” (cluster memberships) for core objects. • A general concept of “Predicted View” is applied to efficiently update the cluster structure. —Key: a compact and easy-maintainable cluster representation.
Concept of Predicted Views 9 3 9 3 9 9 2 2 14 13 14 13 14 13 14 13 6 6 6 12 12 12 12 5 5 5 8 8 8 11 11 11 11 7 7 7 1 1 15 15 15 15 10 10 10 10 16 16 16 16 4 4 Current View of W0 Predicted View of W1 Predicted View of W2 Predicted View of W3 window size=16, slide size=4, time=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 W0 W1 W2 W3
Update Predicted Views 9 3 18 18 18 18 9 9 2 14 13 14 13 14 13 14 13 19 19 19 19 6 6 12 12 12 5 5 8 8 17 11 11 7 7 17 17 11 17 1 15 15 15 15 10 10 10 20 20 16 20 20 16 16 16 4 Predicted View of W2 Predicted View of W3 Expired View of W0 Current View of W1 Predicted View of W4 window size=16, slide size=4, time=1 New Data Points 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 W1 W2 W3 W4
Inefficiency of Extra-N • When Slide/Win rate increases, (for example Win=10000, slide=10), large number of predicted views need to be maintained independently. • Heavy burden to both CPU and memory resources. Slide Win
Proposed Solution: IWIN • Any relationship between the cluster identified ?
“Growth Property” among DB-cluster Sets Grow c6 c5 c4 c6 c5 c4 If any cluster Ci in Clu_Set1 is “contained” by one cluster in Clu_Set2, Clu_Set2 is a “Growth” of Clu_Set1 . Independent Cluster Structure Storage Hierarchical Cluster Structure Storage
Integrated Vs. Independent Maintenance of Predicted Views IWIN: Integrated maintenance Extra-N: Independetmaintenance
Benefits of Integrated Maintenance • Benefits for Memory Resources: Memory space needed by storing cluster sets identified by multiple queries in QG is independent from |QG|. • Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster structure (which are usually similar) can be maintained incrementally, rather than independently. • IWIN outperforms Extra-N in both CPU and memory utilizations.
Goals • A more efficient density-based clustering algorithm over streams. • An evolution semantics that intuitively explain cluster changes. • A visualized pattern space allowing interactive exploration of clusters.
Why we need evolution semantics? • Analysts need to know how clusters change over time. • It is hard to observe by looking at the clusters only (even with visualization). History: Did any clusters merge? Now: Are their any new cluster? Future: Is there any cluster breaking shortly? Commander
Proposed Semantics • Single Step Evolutions: • birth • termination • split • merge • Preserve/expand/shrink • Multi Step Evolutions: • split-expand • split-merge • shrink-split / /
How to Compute • Extract Predicted Evolution (before window slide) • Update Evolution (after window slide) split preserve shrink preserve
Conclusion for Proposed Semantics • Intuitively describe the cluster evolution over the time. • Easily maintainable: can be computed on-the-fly during cluster maintenance.
Goals • A more efficient density-based clustering algorithm over streams. • An evolution semantics that intuitively explain cluster changes. • A visualized pattern space allowing interactive exploration of clusters.
Outline • What is Neighbor-Based Pattern Detection • State-of-Art • Potential Solutions & Their Inefficiency • Proposed Solution: Extra-N • Experimental Study • Conclusion
Why needed? • Analysts need to navigate along the time axis to learn the current, review the history, and predict the near future. • Example: how are the two clusters in current window related to those detected 30 minutes back? • Analysts need to study the clusters and their evolution at different abstraction level. • Example: for routine traffic monitoring, only the position of major clusters need to be reported; when accident happened, specific information of cluster members need to be reported.
Evaluation for IWIN • Alternative Methods: • Incremental DBSCAN [Ester98] • Extra-N [Yang09] • IWIN • Real Streaming Data: • GMTI data recording information about moving vehicles [Mitre08]. • STT data recording stock transactions from NYSE [INETATS08]. • Measurements: • Average processing time for each tuple. • Memory footprint.
Conclusion • Presented the first unified framework supporting interactive exploration of density-based clusters in streaming windows. • Designed a more efficient density-based clustering algorithm IWIN. • Define the first evolution semantics for density-based clusters. • Our experimental study confirms the both the efficiency and effectiveness of our proposed framework.
Future work • Support multiple queries. • Support other pattern types, such as outliers, association rules… • Support pattern storage and match. • More?
The End Thanks