980 likes | 1.24k Vues
Clustering II. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Cluster Analysis. What is Cluster Analysis? Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods
E N D
Clustering II AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology
Cluster Analysis • Whatis Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits Hierarchical Clustering 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Do not require the number of clusters k as an input Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) Strengths of Hierarchical Clustering 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point Step 4 Step 1 Step 3 Step 2 Step 0 Main types of Hierarchical Clustering agglomerative a a b b a b c d e c c d e d d e e divisive Step 3 Step 2 Step 1 Step 0 Step 4 5
More popular hierarchical clustering technique Basic algorithm is straightforward Compute the Similarity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the similarity matrix Until only a single cluster remains Key operation is the computation of the similarity of two clusters Different approaches of defining the distance between clusters distinguish the different algorithms Agglomerative Clustering Algorithm 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Start with clusters of individual points and a similarity matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . Starting Situation Similarity Matrix 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation C3 C4 Similarity Matrix C1 C5 C2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
We want to merge the two closest clusters (C2 and C5) and update the similarity matrix C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation C3 C4 Similarity Matrix C1 C5 C2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
How to measure two closest clusters? How to update the similarity matrix? Questions C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Similarity Matrix C1 C2 U C5 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Similarity Matrix 11 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Similarity Matrix 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Similarity Matrix 13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Similarity Matrix 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Similarity Matrix 15 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one link in the similarity graph. Cluster Similarity: MIN or Single Link 1 2 3 4 5 16 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Strength of MIN Original Points Two clusters • Can handle non-elliptical shapes 17 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Two Clusters Limitations of MIN Original Points • Sensitive to noise and outliers 18 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters Determined by one pair of points in the two clusters Cluster Similarity: MAX or Complete Linkage 1 2 3 4 5 19 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Two Clusters Strength of MAX Original Points Less susceptible to noise and outliers 20 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Two Clusters Limitations of MAX Original Points • Tends to break large clusters • Complete linkage tends to find compact clusters of approximately equal diameters 21 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Limitations of MAX Original Points Two Clusters • Difficulty handling different sized clusters and convex 22 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Similarity of two clusters is the average of pairwise similarity between points in the two clusters. Need to use average connectivity for scalability since total similarity favors large clusters 1 2 3 4 5 Cluster Similarity: Group Average 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Compromise between Single and Complete Link Strengths Less susceptible to noise and outliers Limitations Biased towards globular clusters Hierarchical Clustering: Group Average 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Hierarchical Clustering in Matlab • Matlab functions for hierarchical clustering • pdist: compute pairwise distance • linkage: create hierarchical cluster tree • single (min, shortest) • complete (max, furthest) • average (average distance) • centroid (centroid distance) • median (center of mass distance) • ward (inner squared distance, minimum variance algorithm) • dendrogram: plot of hierarchical cluster tree • cluster: construct clusters from linkages 25 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Other Hierarchical Clustering Methods • Major weakness of agglomerative clustering methods • do not scalewell: time complexity of at least O(n2), where n is the number of total objects • can never undo what was done previously • Integration of hierarchical with distance-based clustering • BIRCH(SIGMOD’96): uses CF-tree and incrementally adjusts the quality of sub-clusters • CURE(SIGMOD’98): each cluster is represented from a fixed number of points • ROCK(ICDE’99): clustering categorical data by neighbor and link analysis • CHAMELEON(Computer’99): hierarchical clustering using dynamic modeling 26 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Overall Framework of CHAMELEON k-nearest neighbor graph Construct a Sparse Graph Partition the Graph Data Set Merge Partition high interconnectivity between two clusters Final Clusters edge-cut of connecting Ci with Cj Inter-connecitivity~ -------------------------------------------------------------------------------------------------------------- edge-cut of breaking Ci into two parts + edge-cut of breaking Cj into two parts 27
Cluster Analysis • Whatis Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 28 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Density-Based Clustering Methods 29 • Clustering based on density (local cluster criterion), such as density-connected points • Major features: • Discover clusters of arbitrary shape • Handle noise • One scan • Need density parameters as termination condition • Several interesting studies: • DBSCAN: Ester, et al. (SIGKDD’96) • OPTICS: Ankerst, et al (SIGMOD’99). • DENCLUE: Hinneburg & D. Keim (SIGKDD’98) • CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Two parameters: Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Eps-neighborhood of that point Eps-neighborhood of q: NEps(q)={x belongs to D |dist(q,x) <= Eps} Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if p belongs to NEps(q) q is a core point : condition|NEps (q)| >= MinPts p MinPts = 5 Eps = 1 cm q Density-Based Clustering: Basic Concepts Asymmetric ! 30
Density-reachable A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi Density-connected A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q o Density-Reachable and Density-Connected p p1 q Asymmetric ! Symmetric ! 31 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
DBSCAN is a density-based algorithm. Density = number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is any point that is not a core point or a border point. DBSCAN (1) 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape with noise Outlier Border Eps = 1cm MinPts = 5 Core DBSCAN (2) fewer than MinPts within Eps But has a core neighbor more than MinPts within Eps 33 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps and MinPts. If p is a core point, a cluster is formed byexpanding If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed. DBSCAN: The Algorithm (1) 34 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Pseudocode DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited N = getNeighbors (P, eps) if sizeof(N) < MinPts mark P as NOISE else C = next cluster expandCluster(P, N, C, eps, MinPts) DBSCAN: The Algorithm (2) expandCluster(P, N, C, eps, MinPts) add P to cluster C for each point P' in N if P' is not visited mark P' as visited N' = getNeighbors(P', eps) if sizeof(N') >= MinPts N = N joined with N' if P' is not yet member of any cluster add P' to cluster C 35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
DBSCAN: Core, Border and Noise Points Point types: core, border and noise Original Points Eps = 10, MinPts = 4 36 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Clusters When DBSCAN Works Well Original Points • Resistant to Noise • Can handle clusters of different shapes and sizes 37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Original Points • Varying densities • High-dimensional data (MinPts=4, Eps=9.92) 38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Idea: for points in one cluster, their kth nearest neighbors are at roughly the same distance Noise points have the kth nearest neighbor at farther distance So, plot sorted distance of every point to its kth nearest neighbor DBSCAN: Determining EPS and MinPts 40 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Cluster Analysis • Whatis Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 41 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Grid-Based Clustering Method 42 • Specially useful on spatial data clustering • Spatial data --- geographically referenced data • temperature and salinity of Red sea Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Grid-Based Clustering Method 43 • Several interesting methods • STING (a STatisticalINformation Grid approach) by Wang, Yang and Muntz (VLDB’97) • WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) • A multi-resolution clustering approach using wavelet method • CLIQUE: Agrawal, et al. (SIGMOD’98) • On high-dimensional data (thus put in the section of clustering high-dimensional data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
STING: A Statistical Information Grid Approach each cell is attached a number of sufficient statistics (count, maximum, minimum, mean, standard deviation) reflecting the set of data points falling in the cell. 44 • The spatial area is hierarchically divided into rectangular cells, corresponding to different levels of resolution • Efficiently process “region oriented” queries, related to the set of regions satisfying a number of conditions including area and density. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
The STING Clustering Method 45 • Statistical info of each cell is calculated and stored beforehand and is used to answer queries • Parameters of higher level cells can be easily calculated from parameters of lower level cell • count, maximum, minimum, mean, standard deviation • type of distribution — normal, uniform, etc. • Use a top-down approach to answer spatial data queries Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
The STING Clustering Method 46 • Advantages: • Query-independent, easy to parallelize, incremental update • O(K), where K is the number of grid cells at the lowest level • Disadvantages: • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected • the clustering quality depends on the grid granularity: too fine, and the computational cost exponentially increases; too coarse, the query answering quality is poor. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
WaveCluster: Clustering by Wavelet Analysis Subband emphasizing the horizontal edges Subband emphasizing the vertical edges 47 • Goal: arbitrarily-shaped densely populated regions in the feature space • Key idea: applywavelet transform on the feature space to find the dense regions, that is the clusters. • Wavelet transform is a signal processing technique that decomposes a signal into different frequency ranges • Boundaries and edges of the clusters high frequency parts of feature signal • Clusters low frequency parts Subband emphasizing the corners
WaveCluster: Clustering by Wavelet Analysis 48 48 • How to apply wavelet transform to find clusters • quantize feature space to form a grid structure and assign points to the grid units Mj . • apply discrete wavelet transform on each unit Mj to get new feature space units Tk. • find connected components (dense regions) in the transformed feature space at different levels. • each connected component (a set of units Tk) is considered as a cluster. Assign labels to the units, showing a unit Tk belongs to a cluster cn. • make the lookup table mapping Tk to Mj . • map each object in Mjto the clusters to which the corresponding Tk belongs. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Cluster Analysis • Whatis Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 50 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Model-Based Clustering 51 • What is model-based clustering? • Attempt to optimize the fit (likelihood) between the given data and some mathematical model • Based on the assumption: Data are generated by a mixture of underlying probability distribution Each component of the mixture a cluster • E.g., Mixture of Gaussians Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining