1 / 108

How to cluster data Algorithm review

How to cluster data Algorithm review. Extra material for DAA++ 18.2.2016. Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND. University of Eastern Finland. Joensuu. Joki = a river. Joen = of a river. Suu = mouth.

cmazzone
Télécharger la présentation

How to cluster data Algorithm review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to cluster dataAlgorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND

  2. University of Eastern Finland Joensuu Joki = a river Joen = of a river Suu = mouth Joensuu = mouth of a river

  3. Research topics Voice biometric Location-based application Speaker recognition Voice activity detection Applications Mobile data collection Route reduction and compression Photo collections and social networks Location-aware services & search engine Clustering methods Clustering algorithms Clustering validity Graph clustering Gaussian mixture models Image processing& compression Lossless compression and data reduction Image denoising Ultrasonic, medical and HDR imaging

  4. Research achievements Voice biometric • NIST SRE submission ranked #2 in four categories in NIST SRE 2006. • Top-1 most downloaded publication in Speech Communication Oct-Dec 2009 • Results used in Forensics Location-based application • Results used by companies in Finland Clustering methods • State-of-the-art algorithms! • 4 PhD degrees • 5 Top publications Image processing& compression • State-of-the-art algorithms in niche areas • 6 PhD degrees • 8 Top publications

  5. Application example 1Color reconstruction Image with original colors Image with compression artifacts

  6. Application example 2speaker modeling for voice biometrics Tomi Feature extraction and clustering Mikko Tomi Matti Matti Training data Mikko Feature extraction Speaker models ? Best match: Matti !

  7. Speaker modeling Speech data Result of clustering

  8. Application example 3Image segmentation Image with 4 color clusters Normalized color plots according to red and green components. green red

  9. Application example 4Quantization Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values Quantized signal Original signal

  10. Color quantization of images Color image RGB samples Clustering

  11. Application example 5Clustering of spatial data

  12. Clustered locations of users

  13. Timeline clustering Clustering of photos Clustered locations of users

  14. Clustering GPS trajectoriesMobile users, taxi routes, fleet management

  15. Conclusions from clusters Cluster 2: Home Cluster 1: Office

  16. Part I:Clustering problem

  17. Definitions and data Set of N data points: X={x1, x2, …, xN} Partition of the data: P={p1, p2, …, pM}, Set of M cluster prototypes (centroids): C={c1, c2, …, cM},

  18. K-means algorithm X = Data set C = Cluster centroids P = Partition K-Means(X, C) → (C, P) REPEAT Cprev←C; FOR all i∈[1, N] DOpi← FindNearest(xi, C); FOR all j∈[1, k] DO cj← Average of xipi = j; UNTIL C = Cprev Optimal partition Optimal centoids

  19. Distance and cost function Euclidean distance of data vectors: Mean square error:

  20. Clustering result as partition Partition of data Cluster prototypes Illustrated by Voronoi diagram Illustrated by Convex hulls

  21. Duality of partition and centroids Partition of data Cluster prototypes Partition by nearestprototype mapping Centroids as prototypes

  22. Challenges in clustering Incorrect cluster allocation Incorrect number of clusters Too many clusters Clusters missing Cluster missing

  23. How to solve? Algorithmic problem Mathematical problem Computer science problem Solve the clustering: • Given input data (X) of N data vectors, and number of clusters (M), find the clusters. • Result given as a set of prototypes, or partition. Solve the number of clusters: • Define appropriate cluster validity function f. • Repeat the clustering algorithm for several M. • Select the best result according to f. Solve the problem efficiently.

  24. Part II:Clustering algorithms

  25. Algorithm 1:Split P. Fränti, T. Kaukoranta and O. Nevalainen, "On the splitting method for vector quantization codebook generation", Optical Engineering, 36 (11), 3043-3051, November 1997.

  26. Divisive approach Motivation Efficiency of divide-and-conquer approach Hierarchy of clusters as a result Useful when solving the number of clusters Challenges Design problem 1: What cluster to split? Design problem 2: How to split? Sub-optimal local optimization at best

  27. Split-based (divisive) clustering

  28. Use this ! Select cluster to be split • Heuristic choices: • Cluster with highest variance (MSE) • Cluster with most skew distribution (3rd moment) • Locally optimal: • Tentatively split all clusters • Select the one that decreases MSE most! • Complexity of the choice: • Heuristics take the time to compute the measure • Optimal choice takes only twice (2) more time!!! • The measures can be stored, and only two new clusters appear at each step to be calculated.

  29. Selection example Biggest MSE… 11.6 6.5 7.5 4.3 11.2 8.2 … but dividing this decreases MSE more

  30. Selection example 11.6 6.5 7.5 4.3 6.3 8.2 4.1 Only two new values need to be calculated

  31. How to split • Centroid methods: • Heuristic 1: Replace C by C- and C+ • Heuristic 2: Two furthest vectors. • Heuristic 3: Two random vectors. • Partition according to principal axis: • Calculate principal axis • Select dividing point along the axis • Divide by a hyperplane • Calculate centroids of the two sub-clusters

  32. Splitting along principal axispseudo code Step 1: Calculate the principal axis. Step 2: Select a dividing point. Step 3: Divide points by a hyper plane. Step 4: Calculate centroids of the new clusters.

  33. Example of dividing Principal axis Dividing hyper plane

  34. Optimal dividing pointpseudo code of Step 2 Step 2.1: Calculate projections on the principal axis. Step 2.2: Sort vectors according to the projection. Step 2.3: FOR each vector xi DO: - Divide using xi as dividing point. - Calculate distortion of subsets D1 and D2. Step 2.4: Choose point minimizing D1+D2.

  35. Finding dividing point • Calculating error for next dividing point: • Update centroids: Can be done in O(1) time!!!

  36. Sub-optimality of the split

  37. Example of splitting process 2 clusters 3 clusters Principal axis Dividing hyper plane

  38. Example of splitting process 4 clusters 5 clusters

  39. Example of splitting process 6 clusters 7 clusters

  40. Example of splitting process 8 clusters 9 clusters

  41. Example of splitting process 10 clusters 11 clusters

  42. Example of splitting process 12 clusters 13 clusters

  43. Example of splitting process 14 clusters 15 clusters MSE = 1.94

  44. K-means refinement Result directly after split: MSE = 1.94 Result afterre-partition:MSE = 1.39 Result after K-means: MSE = 1.33

  45. Time complexity Number of processed vectors, assuming that clusters are always split into two equal halves: Assuming unequal split to nmax and nmin sizes:

  46. Time complexity Number of vectors processed: At each step, sorting the vectors is bottleneck:

  47. Algorithm 2:Pairwise Nearest Neighbor P. Fränti, T. Kaukoranta, D-F. Shen and K-S. Chang, "Fast and memory efficient implementation of the exact PNN", IEEE Trans. on Image Processing, 9 (5), 773-777, May 2000.

  48. Agglomerative clustering Single link • Minimize distance of nearest vectors Complete link • Minimize distance of two furthest vectors Ward’s method • Minimize mean square error • In Vector Quantization, known as Pairwise Nearest Neighbor (PNN) method

  49. PNN algorithm[Ward 1963: Journal of American Statistical Association] Merge cost: Local optimization strategy: Nearest neighbor search is needed: • finding the cluster pair to be merged • updating of NN pointers

  50. Pseudo code

More Related