300 likes | 528 Vues
K-Means Clustering. What is Clustering?. Also called unsupervised learning , sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing. Mengelompokkan data-data menjadi beberapa cluster berdasarkan kesamaannya.
E N D
What is Clustering? Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing Mengelompokkan data-data menjadibeberapa cluster berdasarkankesamaannya
What is a natural grouping among these objects? Clustering is subjective Simpson's Family Females Males School Employees
Two Types of Clustering • Partitional algorithms:Membuatbeberapapartisidanmengelompokkanobjekberdasarkankriteriatertentu • Hierarchical algorithms:Membuatdekomposisipengelompokanobjekberdasarkankriteriatertentu. Misal= tua-muda, tua-muda(merokok-tidakmerokok) Partitional Hierarchical
What is Similarity? The quality or state of being similar; likeness; resemblance; as, a similarity of features. Webster's Dictionary Similarity is hard to define, but… “We know it when we see it”.
0 4 8 8 7 7 0 2 0 3 3 0 1 0 4 Distance : Adalahukurankesamaanantarobjek yang dihitungberdasarkanrumusantertentu D( , ) = 8 D( , ) = 1
Partitional Clustering • Nonhierarchical, setiapobjekditempatkandisalahsatu cluster • Nonoverlapping cluster • Jumlahkluster yang akandibentukditentukansejakawal
Algorithmk-means Tentukanberapa cluster k yang maudibuat. Inisialisasicentroiddaritiap cluster (randomly, if necessary). Tentukankeanggotaanobjek-objek yang lain denganmengklasifikasikannyasesuaicentroidterdekat (berdasarkan distance kecentroid) Setelah cluster dananggotanyaterbentuk, hitung mean tiap cluster danjadikansebagaicentroidbaru Jikacentroidbarutidaksamadengancentroid lama, makaperludiupdatelagikeanggotaanobjek-objeknya(balikke -3). Sebaliknyajikacentroidbarusamadengan yang lama makaselesai.
k3 k1 k2 K-means Clustering: Step 1-2 Tentukanberapa cluster k yang maudibuat. Inisialisasicentroiddaritiap cluster (randomly, if necessary) 5 4 3 2 1 0 0 1 2 3 4 5
k3 k1 k2 K-means Clustering: Step 3 Tentukankeanggotaanobjek-objek yang lain dengan mengklasifikasikannyasesuaicentroidterdekat 5 4 3 2 1 0 0 1 2 3 4 5
k3 k1 k2 K-means Clustering: Step 4 Setelah cluster dananggotanyaterbentuk, hitung mean tiap cluster danjadikansebagaicentroidbaru 5 4 3 2 1 0 0 1 2 3 4 5
k3 k1 k2 K-means Clustering: Step 5 Jikacentroidbarutidaksamadengancentroid lama, makaperludiupdatelagikeanggotaanobjek-objeknya 5 4 3 2 1 0 0 1 2 3 4 5
k1 k2 k3 K-means Clustering: Finish Lakukaniterasi step 3-5 sampaitakadalagiperubahancentroid dantakadalagiobjek yang berpindahkelas
Comments on the K-Means Method • Strength • Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness • Applicable only when mean is defined, then what about categorical data? • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers
Algoritmapengukuran distance • SqEuclidean • Cityblock • Cosine • Correlation • Hamming
MATLAB • [IDX,C] = kmeans(X,k) returns the k cluster centroid locations in the k-by-p matrix C
[...] = kmeans(...,'param1',val1,'param2',val2,...) enables you to specify optional parameter name-value pairs to control the iterative algorithm used by kmeans. • The parameters are : • ‘distance’ • ‘start’ • ‘replicates’ • ‘maxiter’ • ‘emptyaction’ • ‘display’
'distance’ • Distance measure, in p-dimensional space, that kmeans minimizes with respect to. kmeans computes centroid clusters differently for the different supported distance measures:
'start' • Method used to choose the initial cluster centroid positions, sometimes known as "seeds". Valid starting values are:
'replicates' • Number of times to repeat the clustering, each with a new set of initial cluster centroid positions. • kmeans returns the solution with the lowest value for sumd. • You can supply 'replicates' implicitly by supplying a 3-dimensional array as the value for the 'start' parameter.
'maxiter' • Maximum number of iterations. Default is 100.
'emptyaction' • Action to take if a cluster loses all its member observations. Can be one of:
'display' • Controls display of output. • 'off‘ : Display no output. • 'iter‘ : Display information about each iteration during minimization, including the iteration number, the optimization phase, the number of points moved, and the total sum of distances. • 'final‘ : Display a summary of each replication. • 'notify‘ : Display only warning and error messages. (default)
Example dataku =[ 7 26 6 60; 1 29 15 52; ... 11 56 8 20; ... 11 31 8 47; ... 7 52 6 33; ... 11 55 9 22; ... 3 71 17 6; ... 1 31 22 44; ... 2 54 18 22; ... 21 47 4 26; ... 1 40 23 34; ... 11 66 9 12; ... 10 68 8 12]
Using kmeans to build 3 cluster • hasilk = kmeans(dataku,3)
Result hasilk = 1 1 2 1 2 2 2 3 2 2 3 2 2
Meaning of the result • Data at row number 1, 2, and 4 are member of first cluster (cluster number 1). • Data at row number 3,5,6,7,9,10,12 and 13 are member of second cluster (cluster number 2). • Data at row number 8 and 11 are member of third cluster (cluster number 3).