1 / 28

Machine Learning and Data Mining Clustering

+. Machine Learning and Data Mining Clustering. (adapted from) Prof. Alexander Ihler. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A. Overview. What is clustering and its applications? Distance between two clusters.

jerome
Télécharger la présentation

Machine Learning and Data Mining Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. + Machine Learning and Data MiningClustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

  2. Overview • What is clustering and its applications? • Distance between two clusters. • Hierarchical Agglomerative clustering. • K-Means clustering.

  3. What is Clustering? • You can say this “unsupervised learning” • There is not any label for each instance of data. • Clustering is alternatively called as “grouping”. • Clustering algorithms rely on a distance metric between data points. • Data points close to each other are in a same cluster.

  4. Unsupervised learning • Supervised learning • Predict target value (“y”) given features (“x”) • Unsupervised learning • Understand patterns of data (just “x”) • Useful for many reasons • Data mining (“explain”) • Missing data values (“impute”) • Representation (feature generation or selection) • One example: clustering X2 X1

  5. Clustering Example Clustering has wide applications in: • Economic Science • especially market research • Grouping customers with same preferences for advertising or predicting sells. • Text mining • Search engines cluster documents.

  6. Overview • What is clustering and its applications? • Distance between two clusters. • Hierarchical Agglomerative clustering. • K-Means clustering.

  7. Cluster Distances There are four different measures to compute the distance between each pair of clusters: 1- Min distance 2- Max distance 3- Average distance 4- Mean distance

  8. Cluster Distances (Min) X2 X1

  9. Cluster Distances (Max) X2 X1

  10. Cluster Distances (Average) Average of distance between all nodes in cluster Ci and Cj X2 X1

  11. Cluster Distances (Mean) X2 X1 µi = Mean of all data points in cluster i (over all features)

  12. Overview • What is clustering and its applications? • Distance between two clusters. • Hierarchical Agglomerative clustering. • K-Means clustering.

  13. Define a distance between clusters Initialize: every example is a cluster. Iterate: Compute distances between all clusters Merge two closest clusters Save both clustering and sequence of cluster operations as a Dendrogram. Hierarchical Agglomerative Clustering Initially, every datum is a cluster

  14. Iteration 1 Merge two closest clusters

  15. Iteration 2 Cluster 2 Cluster 1

  16. Iteration 3 • Builds up a sequence of clusters (“hierarchical”) • Algorithm complexity O(N2) • (Why?) In matlab: “linkage” function (stats toolbox)

  17. Dendrogram Two clusters Three clusters Stop the process whenever there are enough number of clusters.

  18. Overview • What is clustering and its applications? • Distance between two clusters. • Hierarchical Agglomerative clustering. • K-Means clustering.

  19. K-Means Clustering • Always there are K cluster. • In contrast, in agglomerative clustering, the number of clusters reduces. • iterate: (until clusters remains almost fix). • For each cluster, compute cluster means: ( Xi is feature i) • For each data point, find the closest cluster.

  20. Choosing the number of clusters • Cost function for clustering is :what is the optimal value of k? (can increasing k ever increase the cost?) • This is a model complexity issue • Much like choosing lots of features – they only (seem to) help • Too many clusters leads to over fitting. • To reduce the number of clusters, one solution is to penalize for complexity: • Add (# parameters) * log(N) to the cost • Now more clusters can increase cost, if they don’t help “enough”

  21. Choosing the number of clusters (2) • The Cattell scree test: The best K is 3. Dissimilarity 1 4 6 2 3 5 7 Number of Clusters Scree is a loose accumulation of broken rock at the base of a cliff or mountain.

  22. Select two points randomly. 2-means Clustering

  23. Find distance of all points to these two points. 2-means Clustering

  24. Find two clusters based on distances. 2-means Clustering

  25. Find new means. Green points are mean values for two clusters. 2-means Clustering

  26. Find new clusters. 2-means Clustering

  27. Find new means. Find distance of all points to these two mean values. Find new clusters. Nothing is changed. 2-means Clustering

  28. Summary • Definition of clustering • Difference between supervised and unsupervised learning. • Finding labels for each datum. • Clustering algorithms • Agglomerative clustering • Each data is a cluster. • Combine closest clusters • Stop when there is a sufficient number of cluster. • K-means • Always K clusters exist. • Find new mean value. • Find new clusters. • Stop when nothing changes in clusters (or changes are less than very small value).

More Related