1 / 13

DBSCAN

DBSCAN. Data Mining algorithm . School of Electrical Engineering, University of Belgrade Department of Computer Engineering. Professor Dr Veljko Milutinović. Student Milan Micić 2011/3323 milan.z.micic@gmail.com. Content. Introduction The DBSCAN basic idea Algorithm DBSCAN on R

ziya
Télécharger la présentation

DBSCAN

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DBSCAN Data Mining algorithm School of Electrical Engineering, University of Belgrade Department of Computer Engineering Professor Dr Veljko Milutinović Student Milan Micić 2011/3323 milan.z.micic@gmail.com

  2. Content • Introduction • The DBSCAN basic idea • Algorithm • DBSCAN on R • Example • Advantages • Disadvantages • References 2/13

  3. Introduction • Data clustering algorithms • Using in machine learning, pattern recognition, image analyses, information retrieval, and bioinformatics • Hierarchical, centroid-based, distribution-based, density-based, etc 3/13

  4. DBSCAN basic idea • Density-Based Spatial Clustering of Applications with Noise • Munich,1996 • Derived from a human natural clustering approach • Input parameters • The size of epsilon neighborhood – ε • Minimum points in cluster – MinPts • Neighborhood of a given radius εhas to contain at least a minimum number of points MinPts 4/13

  5. DBSCAN basic idea • Directly density-reachable, p1 from p2 • p1 belongs to the ε neighborhood of p2 • p2's neighborhood size is greater than a given parameter MinPts • Density-reachable, p0 from pn • Exists a chain of points p1,..., pn-1, where pi+1 is directly density-reachable from pi • Core, border and noise point 5/13

  6. Algorithm DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited N = regionQuery(P, eps) if sizeof(N) < MinPts mark P as NOISE else C = next cluster expandCluster(P, N, C, eps, MinPts) expandCluster(P,N,C,eps,MinPts) add P to cluster C for each point P' in N if P' is not visited mark P' as visited N' = regionQuery(P', eps) if sizeof(N') >= MinPts N = N joined with N' if P' is not yet member of any cluster add P' to cluster C • Complexity with indexing structure: O(n*log(n)) 6/13

  7. DBSCAN on R • GNU General Public License  • Various methods for clustering and cluster validation • Interface functions for many methods implemented in language R • DBSCAN: O(n2) • FPC - Flexible Procedures for Clustering • dbscan(x,0.2,showplot=2) • dbscan Pts=600 MinPts=5 eps=0.2 • 0 1 2 3 4 5 6 7 8 9 10 11 • seed 0 50 53 51 52 51 54 54 54 53 51 1 • border 28 4 4 8 5 3 3 4 3 4 6 4 • total 28 54 57 59 57 54 57 58 57 57 57 5 7/13

  8. Example • Astronomy task • Identifying celestial objects by capturing the radiation they emit • Captured noise (by sensors, diffuse emission from atmosphere and space itself) • Eliminating method – to constrain the relevant intensity by a known threshold • In this case – only pixels whose intensity are less than 50 (and consequently darker) are being considered 8/13

  9. Example • DBSCAN algorithm applied on individual pixels • Linking together a complete emission area • Each of the generated cluster will define a celestial entity • ε = 5, MinPts = 5, 64 clusters and 224 outliers found 9/13

  10. Disadvantages • Appropriate parameters εand MinPts • Numerous experiments indicates best MinPts = 4 • Clustering datasets with large difference in densities • “Curse of dimensionality” • In every algorithm based on the Euclidean distance for high-dimensional data sets 10/13

  11. Advantages • Does not require number of clusters in the data a priori • Can find arbitrarily shaped clusters • Even clusters completely surrounded by a different cluster • Mostly insensitive to the ordering of the points in the database • Only border points might swap cluster membership • Has a notion of noise • Requires just two parameters 11/13

  12. References • Martin Ester, Hans-Peter Kriegel, Joerg Sander, XiaoweiXu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Institute for Computer Science, University of Munich,1996; • MehmedKantardzic: “Data Mining: Concepts, Models, Methods, and Algorithms”, 2011; • Wikibooks: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Density-Based_Clustering; • Wiki: http://en.wikipedia.org/wiki/DBSCAN 12/13

  13. Thank you for your attention! Questions Milan Micić milan.z.micic@gmail.com 13/13

More Related