1 / 35

Density based Clustering

Density based Clustering. Anushree Garg, krithika chandramouli. Types of Clustering algorithms. Partitioning based K-Means, K- Medoids Hierarchical based BIRCH, Chameleon Density based DBScan, DenCLUE , D-Stream Grid Based STING, WaveCluster. DBScan.

zeal
Télécharger la présentation

Density based Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Density based Clustering Anushree Garg, krithikachandramouli

  2. Types of Clustering algorithms • Partitioning based • K-Means, K-Medoids • Hierarchical based • BIRCH, Chameleon • Density based • DBScan, DenCLUE, D-Stream • Grid Based • STING, WaveCluster

  3. DBScan “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise” - Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu • Density- Based • Used to discover cluster with arbitrary shape • Minimum requirements of Domain Knowledge

  4. Definitions • Core Point • A point having more than the MinPts in its EpsNeighborhood • Boundary Point • Not a core point • Direct Density Reachable • Point p is directly density reachable from q if q is a core point and q is in EpsNeighborhood of p • Density Reachable • Point P is density reachable from q is there are a chain of points p1,…, pM, such that p(i+1) is directly density reachable from pi • Density Connected • P and Q are density connected if there is a point O such that p and q are density reachable from O

  5. Algorithm • Start with arbitrary point p • Retrieve all points density-reachable from p • If p is a core point it includes a cluster • If p in a border point NO cluster and next point is visited in the database • Repeat process till all points are visited

  6. Performance Evaluation CLARANS DBScan

  7. Evaluation (Cont.)

  8. Conclusion (DBScan) • Based on Density Based Clustering • Can effectively find arbitrary shaped clusters • Does not need major domain knowledge

  9. Denclue “An Efficient Approach to Clustering in Large Multimedia Databases with Noise” -Alexander Hinneburg, Daniel A. Keim • Density based clustering • Uses Influence function • Handle large amount of noise

  10. Idea • Each data point has an influence that extends over a range • Influence function • Add all influence functions • Density function

  11. Influence functions

  12. Definitions • Density Attractor x* • Local maximum of the density function • Density attracted points • Points from which a path to x* exists for which the gradient is continuously positive

  13. Center Defined Clusters • All points that are density attracted to a given density attractor x* • Density function at the maximum must exceed x • Points that are attracted to smaller maxima are considered outliers

  14. Arbitrary-Shape Clusters Merges center defined clusters if a path exists for which the density function continuously exceeds x

  15. Algorithm • Step 1: Construct a map of data points • Uses hypercubes of with edge length 2s • Only populated cubes are saved • Step 2: Determine density attractors for all points using hill-climbing • Keeps track of paths that have been taken and points close to them

  16. Step 1: Constructing the map • Hypercubes contain • Number of data points • Pointers to data points • Sum of data values (for mean) • Save populated hypercubes in B+ tree

  17. Step 2: Clustering Step • Uses only highly populated cubes and cubes that are connected to them • Hill-climbing based on local density function and its gradient • Points within s/2 of each hill-climbing path are attached to clusters as well

  18. Time Complexity / Efficiency • Worst case, for N data points • O(N log(N)) • Average case • O(log(N)) • Explanation: Only highly populated areas are considered • Up to 45 times faster than DBSCAN

  19. Experimental Evaluation

  20. Comparison with DBSCAN • Corresponding setup • Square wave influence function radius s models neighborhood e in DBSCAN • Definition of core objects in DBSCAN involves MinPts <=> x • Density reachable in DBSCAN becomes density attracted in DENCLUE

  21. Conclusion (DenClue) • Denclue is faster than most other algorithms • Efficient Data Structure • Used for large multimedia databases • Can work well with large number of outliers

  22. D-STREAM • Chen, Yixin, and Li Tu. "Density-based clustering for real-time stream data."Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007.

  23. Data stream clustering • High dimensional stream in real time – a challenging task • Massive volumes of raw data arrives real time – can be scanned only once • Applications – stocks, weather monitoring ..

  24. Clustering algorithms – then vs now • Then • Used single phase model • Treat data stream clustering as continuous version of static clustering • Divide and conquer • Weigh outdated and recent data equally • Don’t capture evolving characteristics of the data • CluStream: 2 phase framework • Offline component based on k-means – identifies spherical clusters, not arbitrary • Requires multiple scans of data

  25. Clustering algorithms – then vs now • Now • D-stream is density based • Doesn’t treat data stream as long sequence of static data • Dynamism of stream – decay factor • Doesn’t require user to specify the number of clusters • Discretize the data space into grids – new data maps to these grids

  26. The D-stream algorithm • Key features of the algorithm • Timestamp of data point labelled by integer • Online component + Offline component • Online component • Reads incoming data record • Places this multi-dimensional record into appropriate density grid • Updates characteristic vector of grid • Offline component • Dynamically adjusts clusters in the time gap (time between arrival of data) • Periodically regulated clusters

  27. D-stream definitions • Input – d dimensions defined in space S = S1 X S2 X .. Sd • Density grid – space Siis divided onto pipartitions • Grid g = S1,j1 X S2,j2 .. Sd,jd= (j1, j2, .. jd) • Every data record x = (x1, x2, .. xd) mapped onto g • Timestamp of arrival T(x) • Density coefficient at time t is given by λ∈ (0, 1) • λ ∈ (0, 1) decay factor • Grid Density • For each grid the time when the last data was received is recorded so that density is updated

  28. D-stream definitions • Characteristic vector of a grid is (tg, tm , D, label, status) • tg – last time of update of g • tm – last time when g was removed from grid_List • D – grid density • Label – class label • Status - SPORADIC or NORMAL to remove sporadic grids • Dense grid • Sparse grid • Transitional grid • Sporadic grids – contain very few data points

  29. D-stream process

  30. Components of D-stream • New data x, mapped to grid g, and density is updates • Scheme gradually reduces density of record & grid • Periodically form clusters • Time interval of inspecting grid cant be too long or short • Compute minimum time for dense grid to become sparse grid • Remove sporadic grids • Grid containing very few data points • Removed by density thresolding • Grid_List keeps track of all grids under analysis

  31. Initial cluster formation

  32. Updating clusters

  33. Results • Data – network intrusion data stream, synthetic • Data points – 30K – 85K

  34. Conclusion • D Stream is a clustering technique for fast changing data streams • Finds clusters in arbitrary shapes • Sporadic grids are dynamically removed

  35. Thank you

More Related