Friday, November 9, 2001 Cecil P. Schmidt Department of Computing and Information Sciences,

KDD - Group Presentation #8 - Fall ‘01 Constructive Induction for Knowledge Discovery: Clustering Techniques for KDD Friday, November 9, 2001 Cecil P. Schmidt Department of Computing and Information Sciences, Kansas State University http://www.cis.ksu.edu/~cps4444 cps4444@cis.ksu.edu

Presentation Outline • Background and definitions • The K-Means Algorithm • Similarity, Association, and Distance • K, Weights, and Scaling • Agglomerative Algorithms • Agglomeration by Single Linkage • Agglomeration By Comparison of Centroids • Two-level Approach • Summary and Discussion • Bibliography

Background and Definitions • What is clustering? • Clustering is a data mining activity that can be described as undirected knowledge discovery or unsupervised learning. • There is no pre-classified data and no distinction between dependent and independent variables • We search for groups of records that are similar to one another in some way • The expectation is that these similar instances will behave in similar ways • The output from a clustering algorithm is basically a statistical description of the cluster centroids (the center of the cluster) with the number of components in each cluster • Why clustering? • Allows us to make sense of complex questions • Eliminate noise from the data

K-Means Clustering • Most commonly used method in practice • First published by J.B. MacQueen in 1967 • The algorithm works as follows • Step 1: select K data points to be the seeds • e.g. use first K records from data set to be the seeds • each seed is an embryonic cluster with one element • Step 2: assign each record to the cluster whose centroid is the nearest. • The distance to the nearest centroid can be calculated by using the Euclidean distance metric • Step 3: calculate the centroids of the new clusters • Average the positions of each point in the cluster along each dimension • e.g. If there are 100 points in a cluster and we use three dimensions (fields) to cluster on, then the new centroid will have three dimensions where each dimension is the average over the same dimension of the 100 points. • Step 4: repeat steps 2 and 3 until the centroids no longer change

The K-Means Algorithm - Example [1] • Suppose we have a set of people that we wish to cluster into three groups. • Each instance is simply defined by one dimension (age) • We initialize the centroids to 1, 20, and 40 respectively • The corresponding table shows the results after steps 1 and 2 of the K-Means algorithm have completed • Note the highlighted cluster membership

The K-Means Algorithm - Example [2] • After the steps 1 and 2 are complete we recalculate the centroid values which are now 5, 12, and 48 respectively. • We then recalculate the distance metric for each instance (repeat step 2) • P5 is now closer to C2 than to C1 therefore we must recalculate the means for centroids C1 and C2 • C3 did not have a change to its membership so we don’t have to recalculate it

The K-Means Algorithm - Example [3] • After the second step is complete we recalculate the centroid values again which are now 4, 11, and 48 respectively. • We then recalculate the distance metric for each instance (repeat step 2) • P4 is now closer to C2 than to C1 therefore we must recalculate the means for centroids C1 and C2 • C3 did not have a change to its membership so we don’t have to recalculate it

The K-Means Algorithm - Example [4] • After the second step is complete we recalculate the centroid values again which are now 3, 10, and 48 respectively. • We then recalculate the distance metric for each instance. • There were no changes to membership in any of the clusters so the algorithm terminates • We are left with three clusters centered around centroids 3, 10, and 48 respectively.

Similarity, Association, and Distance • Similarity, Association, and Distance • How do we convert our intuitive notions that members of a cluster have some type of natural association to a representative metric? • We could use a geometric conversion but there are problems with this • Many variable types, such as categorical variables and many numerical variables such as rankings, cannot be converted. • In a database the contributions of one dimension may be more important than another • To understand these issues we must review measurement theory [3] • nominal - has no meaning; e.g. sports uniform numbers • ordinal - means one before other; e.g. class rank • interval - distance between two observations; no well understood zero; Temp. • ratio - has well understood zero; e.g. feet to meters • absolute - no conversion required; eg. lines of code • Two of the most often used measures include • Euclidian distance - the square root of the sum of the squared distances • Number of common features - count of the degree of overlap which could produce a ratio of number of matches to total number of fields

K, Weights, and Scaling • How do we choose K? • In many cases we have no prior knowledge of the number of clusters there should be • K is often chosen at random and with the results tested for the cluster strength; eg. average distance between records in a cluster • Subjective evaluation is also required • K could be a hyper-parameter with fitness determined by a cluster strength metric • Weighting and Scaling of variables (A Data Cleansing Process) • Scaling deals with the problem that different variables are measured in different units • Converting all measurements to scale; eg. Feet, inches, and miles to inches • How about different types of measurements? This is a problem! • We can overcome this somewhat by mapping all variables to a common range so that a change in ratio is comparable between the variables • Weighting deals with the problem that we care about some variables more than others • Weighting can be used to bias one field over another • It can also be used as an optimization parameter with GA’s

Agglomerative Algorithms • Agglomerative Methods • Start out with each data point forming its own cluster and gradually merge clusters until all points have gathered together to form one big cluster • Preserves history of the cluster evolution • Considered hierarchical • The cluster distance metric used for merging can be one of the following: • Single Linkage: Distance between the closest members of each cluster • Complete Linkage: Distance between most distant members of each cluster • Comparison of centroids: Distance between the centroids of each cluster

Agglomeration by Single Linkage • Clustering People by Age • Use single linkage on a one dimensional vector • Create clusters based on an age difference of one years

Agglomeration By Comparison of Centroids • Minimal Spanning Tree Clustering (MSTC) • Step 1 - Initialize the set of clusters • The set of clusters is set to be the set of points. (i.e. - each point is a cluster) • Step 2 - Calculate the cluster center • The distance between each cluster center is calculated with respect to all other cluster centers. • The two clusters with the minimum distance between them are fused to form a single cluster. • Step 3 - Repeat • Repeat Step 2 until all components are grouped into the final required set of clusters.

MSTC - Example • Mess personnel would like to identify four groups of food items from a larger group of seven food items so that if the soldiers select at least one item from each of the group they will obtain a certain fat and protein content. • The seven food items will be grouped into four groups of food items based on the abundance of fat and protein content in the food. • The following is the table that gives the fat and protein content in the food items.

MSTC - Example - Step 1 • The first step in the Minimal Spanning Tree Algorithm is to assume each these values to be centroids of a cluster • The table can be redrawn as below:

MSTC - Example - Step 2 [1] • Step 2: Calculate the distance between every two of the centroids using the Euclidean metric. • For example, the distance between C1 and C2 is calculated.

MSTC - Example - Step 2 [2] • The results are formulated into a table as shown below:

MSTC - Example - Step 2 [3] • The minimum distance between any two points is 4.01 (C3 and C7). • These two points can be merged into a single cluster. • We rename this cluster as the C37 centroid. • The values of P for C37 centroid would be (4.2 + 3.9)/2 which is 4.05 and F for C37 centroid would be (35 + 39)/2 which is 37. • The table showing the values of P and F is as follows :

MSTC - Example - Step 3 [1] • The third step is to repeat the second step until the number of clusters is reduced to 4. • This step means that the distance between any of the two points taken together is to be calculated as described above. • The recalculated distances are given below :

MSTC - Example - Step 3 [2] • The minimum distance between any two points is 5.03 and this distance is between C2 and C5. • These two points can be merged into a single point and is called the C25 centroid. • The values of P for C25 centroid would be (8.2 + 7.6)/2 which is 7.90 and F for C25 centroid would be (15 + 20)/2 which is 17.5. • The table showing the values of P and F is as follows :

MSTC - Example - Step 3 [3] • Next, we need to find the distance between each of the two points taken together as in step 2. • The distances are calculated and displayed in the table given below :

MSTC - Example - Step 3 [4] • The minimum distance between any two points is 5.08. This distance is between C1 and C6. • These two points can be merged into a single point and is called the C16 centroid. • The value of P for C16 centroid are be (1.1 + 2.0)/2 which is 1.55 and F for C16 centroid is (55 + 60)/2 which is 57.50. • Finally, the data is divided into four groups of food items (clusters) with the fat and protein contents as specified (four centroids). • The table showing the values of P and F is as follows :

Two-level Approach • Self-Organizing Map (SOM) [4] • Two-level approach to clustering • Step 1: From N samples we create M prototypes • Each prototype is a two-dimensional grid of map units • Step 2: From the M prototypes we apply a conventional method of clustering such as an agglomerative method or a variation of K-Means • One of the benefits behind a two-level approach is that we can significantly reduce the computational cost

Summary and Discussion • Clustering is a data mining activity which allows us to make sense out of the data • Unsupervised Learning • We looked at two types of algorithms • Nonheirarchical; e.g. K-Means • Herirachical; e.g. Agglomeration Algorithms such as MSTC • We discussed issues, parameters, and optimizations which can be done • Similarity, Association, and Distance • Choosing K, Scaling, and weighting • Use of Genetic Algorithms for optimization of the hyper-parameters • Finally we discussed a hybrid approach • Self-Organizing Maps • Use two-levels in the clustering process • The next step will be to look at specific algorithms and compare them to K-Means (the benchmark) • Much room for research in this field

Bibliography • [1] Menasce’, D.A., Denning, P.J., et.al., DAU Stat Refresher Module, http://cne.gmu.edu/modules/dau/stat/clustgalgs/clust4_frm.html, Center for the New Engineer, George Mason University, Fairfax, Virginia • [2] Berry, M. J. A., and Linoff, G. S. Data Mining Techniques for Marketing, Sales, and Customer Support. John Wiley and Sons, New York, NY, 1997. • [3] Gustafson, D., CIS 740 Software Engineering Course Notes, Fall, 2000 • [4] Vesanto, J. and Alhoniemi, E., Clustering of the Self-Organizing Map, IEEE Transactions on Neural Networks, accepted

Friday, November 9, 2001 Cecil P. Schmidt Department of Computing and Information Sciences,

Friday, November 9, 2001 Cecil P. Schmidt Department of Computing and Information Sciences,

Presentation Transcript

Friday, January 12, 2001 William H. Hsu Department of Computing and Information Sciences, KSU http://www.cis.ksu.edu/~bh

Wednesday, November 15, 2000 Cecil P. Schmidt Department of Computer Information Sciences, KSU http://www.cis.ksu.edu/~c

Thursday, December 06, 2001 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, February 25, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, January 21, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Monday, April 2, 2001 Rich Pickler Department of Computing and Information Sciences, KSU Readings:

Thursday, November 29, 2001 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, 02 February 2007 William H. Hsu Department of Computing and Information Sciences, KSU

Tuesday, November 20, 2001 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, 21 November 2003 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, 07 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, 05 November 2004 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, 01 February 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Monday, January 22, 2001 William H. Hsu Department of Computing and Information Sciences, KSU

Thursday, November 08, 2001 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, 23 May 2003 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, February 18, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, 25 January 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Friday, January 14, 2000 William H. Hsu Department of Computing and Information Sciences, KSU

Thursday,04 October 2001 William H. Hsu Department of Computing and Information Sciences, KSU