Prepared by: Mahmoud Rafeek Al-Farra

College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining Chapter 6_2: Clustering Methods Prepared by: Mahmoud Rafeek Al-Farra 2013 www.cst.ps/staff/mfarra

Course’s Out Lines • Introduction • Data Preparation and Preprocessing • Data Representation • Classification Methods • Evaluation • Clustering Methods • Mid Exam • Association Rules • Knowledge Representation • Special Case study : Document clustering • Discussion of Case studies by students

Out Lines • Definition of Clustering • Clustering Process • Clustering Algorithms (Methods) • Cluster validation

Definition ? • Clustering is a division of data into groups of similar objects. • Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. • Clustering is an unsupervised classification problem.

Clustering Process (Document Case) Preprocessing step Clustering Algorithm • Document cleaning • Feature selection or extraction. • Similarity Measure • Criterion Clustering Function Documents samples 1 2 Clusters 3 4 Cluster validation Results interpretation • External Indices • Internal Indices • Relative Indices. Knowledge

Clustering algorithm design or selection • This step is usually combined with the selection of a corresponding proximity measure and the construction of a criterion function. • Obviously, the proximity measure directly affects the formation of the resulting clusters. • Almost all clustering algorithms are explicitly or implicitly connected to some definition of proximity measure.

Clustering algorithm design or selection • In order to be able to group similar data objects a proximity metric has to be used to find which objects (or clusters) are similar. • Similarity Measure can be done through measure how much two objects are similar to each other (Similarity) or measure how mach two objects are different (dissimilarity ). • There is a large number of similarity metrics reported in the literature due to the large number of representation models and clustering algorithms.

Clustering algorithm design or selection Document cluster Document cluster Inter-Cluster Sim. Intra-Cluster Sim. Document cluster

Clustering Algorithms • Once a proximity measure is chosen, the construction of a clustering criterion function makes the partition of clusters an optimization problem, which is well defined mathematically, and has rich solutions in the literature.

Clustering Algorithms • K-means • Fuzzy C-means • Bisecting k-means Partitional Clustering NN Clustering Clustering Algorithms Density Clustering Grid Clustering Agglomerative (AHC) Hierarchical Clustering Divisive (DHC)

{a, b,c,d,e} {a}, {b,c,d,e} {a}, {b,c}, {d,e} {a}, {b,c}, {d}, {e} {a}, {b}, {c}, {d}, {e} a b c d e Hierarchical Clustering • Hierarchical techniques produce a nested sequence of partitions, with a single all-inclusive cluster at the top and singleton clusters of individual objects at the bottom. • The result of a hierarchical clustering algorithm can be viewed as a tree, called a dendogram.

Hierarchical Clustering • AHC starts with the set of objects as individual clusters; then, at each step merges the most two similar clusters. • This process is repeated until a minimal number of clusters have been reached, or, if a complete hierarchy is required then the process continues until only one cluster is left.

Hierarchical Clustering • DHC Methods work from top to bottom, starting with the whole data set as one cluster, and at each step split a cluster until only singleton clusters of individual objects remain

Partitional Clustering • Partitional clustering techniques create a one-level (un-nested) partitioning of the data points. • If K is the desired number of clusters, the partitional approaches typically find all K clusters at once. • The most known class of partitional clustering algorithms are the k-means algorithm and its variants. Centroids

Neural Networks-Based Clustering • Neural networks (NNs) are able to learn complex relationships from data samples either in a supervised or unsupervised fashion. • In supervised leaning, a labeled set of data is used to train the network for modeling the input and output functions, prior to testing. Whereas unsupervised networks do not use such a priori knowledge but they can learn the underlying relationships from the data.

Next: • Cluster validation • Examples of Clustering algorithm • Prepare 2 slides for each of the following clustering algorithm: • Density Clustering • Grid Clustering

Thanks

Prepared by: Mahmoud Rafeek Al-Farra