1 / 50

Clustering Methods

Clustering Methods. Professor: Dr. Mansouri Presented by : Muhammad Abouei &Mohsen Ghahremani Manesh. Clustering Methods. Density-Based Clustering Methods DBSCAN ( D ensity B ased S patial C lustering of A pplications with N oise)

Télécharger la présentation

Clustering Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Methods Professor: Dr. Mansouri Presented by : Muhammad Abouei &Mohsen Ghahremani Manesh

  2. Clustering Methods • Density-Based Clustering Methods • DBSCAN (Density Based Spatial Clustering of Applications with Noise) • OPTICS (Ordering Points To Identify the Clustering Structure) • DENCLUE(DENsity-based CLUstEring) • Grid-based Clustering

  3. Density Based Clustering

  4. DBSCAN Concepts • ε -neighborhood: Points within ε distance (radius) of a point. • MinPts: minimum number of points in cluster (ε-neighborhoodof that point). ε-neighborhood of q ε-neighborhood ofp MinPts = 5 where ε and MinPts are a user-defined function.

  5. DBSCAN Concepts • Density : number of points within a specified radius (ε) Density(p)=5

  6. DBSCAN Concepts • Core point: A point is a core point if it has more than a specified number of points (MinPts) within ε • These are points that are at the interior of a cluster ε-neighborhood of q ε-neighborhood ofp pis a core point (MinPts = 5) q is not a core point.

  7. DBSCAN Concepts • Directly density-reachable :point p is directly density-reachable from a point q w.r.t. ε , MinPts if • p belongs to ε -neighborhood of q, • qis a core point, MinPts= 4 p is DDR from q. q is not DDR from p! DDR is an asymmetric relation.

  8. DBSCAN Concepts • Density-reachable:A point p is density-reachable from a point q w.r.t. ε , MinPts if there is a chain of points P1, …, Pn, P1=q, Pn=psuch that Pi+1is directly density-reachable from Pi . Or, point p is density-reachable form q, if there is a path (chain of points) from p to q consisting of only core points. MinPts = 4 p is DR from q. q is not DR from p! p is not core. DR is an asymmetric relation.

  9. DBSCAN Concepts • Density-connectivity: point p is density-connected to point q w.r.t. ε , MinPts if there is a point r such that both, p and q are density-reachable from r w.r.t. ε and MinPts. MinPts = 4 p and q are density-connected. DC is an symmetric relation.

  10. DBSCAN Concepts • Border point : A border point has fewer than MinPts within ε, but is in the neighborhood of a core point MinPts =5 ε= circle radius

  11. DBSCAN Concepts • Noise (outlier) point : is any point that is not a core point nor a border point. MinPts =5 ε= circle radius

  12. DBSCAN Concepts • DBSCAN relies on a density-based notion of cluster. • Cluster : a cluster C is a non-empty set of density-connectedpoints that is maximal w.r.t. density-reachability. • Maximality: For all p, q; if q∈ C and if pis density-reachable from qw.r.t. ε and MinPts, then also p∈ C. MinPts = 3 ε = circle radius

  13. DBSCAN Algorithm • Arbitrary select a point p • Retrieve all points density-reachable from p w.r.t. ε and MinPts. • If p is a core point, a cluster is formed. • If p is a borderpoint, no points are density-reachable from p and DBSCANvisits the next point of the database. • Continue the process until all of the points have been processed.

  14. DBSCAN MinPts = 4

  15. DBSCAN DBSCAN is Sensitive to Parameters. MinPts= 4

  16. DBSCAN Core, Border and Noise Points: MinPts= 4,ε = 10 Original Points Point types: core, border and noise

  17. DBSCAN When DBSCAN works well: • Resistant to Noise • Can handle clusters of different shapes and sizes Original Points Clusters

  18. DBSCAN When DBSCAN does not work well: • Varying densities • High-dimensional data

  19. DBSCAN Complexity If a spatial index (ex, kd-tree, R*-tree) is used, the computational complexity of DBSCAN is O(n.logn), where n is the number of database objects. Otherwise, it is O(n2).

  20. OPTICS • Core distance: smallest ε that makes it a core object. If p is not core, it is undefined. Core Distance of p or ε′ : distance between p and its 4-thNN. MinPts= 5 ε = 3 cm

  21. OPTICS • Reachability distance: of r w.r.t. p is the greater value of the core distance of p and the Euclidean distance between p & r. If p is not a core object, distance reachability between p & q is undefined. reachability-distance ε, MinPts(p, r) = ε′ reachability-distance ε, MinPts(p, r′) = d(p, r′ ) MinPts = 5 ε = 3 cm

  22. OPTICS

  23. OPTICS

  24. OPTICS

  25. OPTICS

  26. OPTICS

  27. OPTICS

  28. OPTICS

  29. OPTICS

  30. OPTICS

  31. OPTICS

  32. OPTICS • Color image segmentation using density-Based clustering

  33. DENCLUE • DENCLUE (DENsity-based CLUstEring) • Major features • Solid mathematical foundation • Good for data sets with large amounts of noise • Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets • Significant faster than existing algorithm (faster than DBSCAN by a factor of up to 45) • But needs a large number of parameters

  34. DENCLUE • Technical Essence • Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree- based access structure.

  35. DENCLUE • Technical Essence • DENCLUE is based on the following concepts: • Influence function • Density function • Density attractors.

  36. DENCLUE • Influence function: The influence function f y(x) for a point (data space) at point x is a positive function that decays to zero as x “moves away” from . • Typical examples are: and where σ is a user-defined function.

  37. DENCLUE • Density function :The density function at x based on a data space of N points; i.e. D = {x1,…, xN}; is defined as the sum of the influence function of all data points at x : • The goal of the definition: • Identify all “significant” local maxima, xj*, j=1,…,m of f D(x) • Create a cluster Cjfor each xj*and assign to Cjall points of D that lie within the “region of attraction” of xj*.

  38. DENCLUE • Example: Density Computation D={x1,x2,x3,x4} f DGaussian(x) = influence(x1)+influence(x2)+influence(x3)+influence(x4) =0.04+0.06+0.08+0.6=0.78 Remark: the density value of y would be larger than the one for x.

  39. DENCLUE • Density attractors :Density attractors are local maxima of the overall density function f D(x). • Clusters can then be determined mathematically by identifying density attractors. • A hill-climbing algorithm guided by the gradient can be used to determine the density attractor of a set of data points.

  40. DENCLUE • Density-attracted : A point x is density-attracted to a density attractorx*, if there exists a set of points x0,x1, …,xksuch that x0= x ,xk= x* and the gradient of xi-1is in the direction of xifor 0<i<k.

  41. DENCLUE • Center-Defined Cluster :A center-defined cluster (w.r.t. to σ, ε) for a density attractor x* is a subset C D, with x C being density-attracted by x* and f D(x) ε. • Outlier: Point x D is called outlier if it is density-attracted by a local maximum xo*with f D(xo*) < ε.

  42. DENCLUE • Multicenter defined clusters : Multicenter defined clusters are a set of center-defined clusters linked by a path of significance.

  43. DENCLUE • An arbitrary-shape cluster : An arbitrary-shape cluster (w.r.t. to σ, ) for a set of density attractors X is a subset C D, where , x is density-attracted to , and a path P from to with

  44. DENCLUE • Note : that the number of clusters found by DENCLUE varies depending on σ, .

  45. DENCLUE • DENCLUE is able to detect arbitrarily shaped clusters. • The algorithm deals with noise very satisfactory. • The worst-case time complexity of DENCLUE is O(N.log2N). • Experimental results indicate that the average time complexity is O(log2N). • It works efficiently with high-dimensional data. • DENCLUE needs at least 3 parameters to be determined, i.e. σ, .

  46. Grid-based • Using multi-resolution grid data structure • Clustering complexity depends on the number of populated grid cells and not on the number of objects in the dataset • Several interesting methods: • CS Tree (Clustering Statistical Tree) • STING • WaveCluster

  47. Grid-based • Basic Grid-based Algorithm • Define a set of grid-cells. • Assign objects to the appropriate grid cell and compute the density of each cell. • Eliminate cells, whose density is below a certain threshold τ. • Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function).

  48. Grid-based • Fast: • No distance computations, • Clustering is performed on summaries and not individual objects; complexity is usually O(no_of_populated_grid_cells) and not O(no_of_objects), • Easy to determine which clusters are neighboring.

  49. References • A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. • A.K. Jain and M. N. Murty and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, vol 31. No 3,pp 264-323, 1999. • A. L. N. Fred, J. M. N. Leitão, A New Cluster Isolation Criterion Based on Dissimilarity Increments, IEEE • “Optimal grid-clustering: Toward breaking the curse of dimensionality in high-dimensional clustering,”in Proc. 25th VLDB Conf.,1999, pp. 506–517.

  50. ?

More Related