1 / 50

CSE 634 Data Mining Concepts Techniques

Cluster Analysis. References. Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter 7, Sections 1- 4). Morgan Kaufman, 2005Jiawei Han, Lecture Notes, University of Illinois at Urbana-Champaign, http://www-faculty.cs.uiuc.edu/~hanj/bk2/07.pptProf.

roxy
Télécharger la présentation

CSE 634 Data Mining Concepts Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Cluster Analysis CSE 634 Data Mining Concepts & Techniques Cluster Analysis Group 6 Nam, Kyu Han (105953722) Ju, Jae Won (106112650) Chung, Dong Hwan (105275323)

    2. Cluster Analysis References Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter 7, Sections 1- 4). Morgan Kaufman, 2005 Jiawei Han, Lecture Notes, University of Illinois at Urbana-Champaign, http://www-faculty.cs.uiuc.edu/~hanj/bk2/07.ppt Prof. Dr. J. Fürnkranz and Dr. G. Grieser, “Maschinelles Lernen and Data Mining” (3-11) http://www.ke.informatik.tu-darmstadt.de/lehre/ws05/mldm/clustering.pdf K. Wagsta, C. Cardie, S. Rogers, and S. Schroedl, “Constrained K-means Clustering with Background Knowledge”, Proceedings of 18th International Conference on Machine Learning 2001. (pp. 577-584). Morgan Kaufmann, San Francisco, CA.

    3. Cluster Analysis Data Mining Concepts and Techniques

    4. Cluster Analysis What is Cluster Analysis? Cluster : Collection of data objects (Intraclass similarity) - Objects are similar to objects in same cluster (Interclass dissimilarity) - Objects are dissimilar to objects in other clusters Cluster analysis Statistical method for grouping a set of data objects into clusters A good clustering method produces high quality clusters with high intraclass similarity and low interclass similarity Clustering is unsupervised classification Data objects in a cluster have two properties - Intraclass and Interclass. These are properties that a cluster tries to improve. Examples of clusters: Stars in a galaxy, Planets in the solar system, Kinds of rocks Explain cluster analysis Why is it unsupervised? Because it does not rely on predefined classes or trained data. It is learning by observation, not learning by examples. Data objects in a cluster have two properties - Intraclass and Interclass. These are properties that a cluster tries to improve. Examples of clusters: Stars in a galaxy, Planets in the solar system, Kinds of rocks Explain cluster analysis Why is it unsupervised? Because it does not rely on predefined classes or trained data. It is learning by observation, not learning by examples.

    5. Cluster Analysis Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

    6. Cluster Analysis Data Representation Data matrix (two mode) N objects with p attributes Dissimilarity matrix (one mode) d(i,j) : dissimilarity between i and j

    7. Cluster Analysis Types of Data in Cluster Analysis Interval-Scaled Variables Binary Variables Nominal, Ordinal, and Ratio-Scaled Variables Variables of Mixed Types

    8. Cluster Analysis Interval-Scaled Variables Continuous measurements of a roughly linear scale E.g. weight, height, temperature, etc.

    9. Cluster Analysis Using Interval-Scaled Values Step 1: Standardize the data To ensure they all have equal weight To match up different scales into a uniform, single scale Not always needed! Sometimes we require unequal weights for an attribute Step 2: Compute dissimilarity between records Use Euclidean, Manhattan or Minkowski distance Exceptions: height may be a more important attribute associated with basketball players Exceptions: height may be a more important attribute associated with basketball players

    10. Cluster Analysis Data Types and Distance Metrics Distances are normally used to measure the similarity or dissimilarity between two data objects Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

    11. Cluster Analysis Data Types and Distance Metrics (Cont’d) If q = 1, d is Manhattan distance If q = 2, d is Euclidean distance

    12. Cluster Analysis Data Types and Distance Metrics (Cont’d) Properties d(i,j) ? 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) ? d(i,k) + d(k,j) Can also use weighted distance, or other dissimilarity measures.

    13. Cluster Analysis Binary Attributes A contingency table for binary data Simple matching coefficient (if the binary attribute is symmetric): Jaccard coefficient (if the binary attribute is asymmetric):

    14. Cluster Analysis Dissimilarity between Binary Attributes Example

    15. Cluster Analysis Nominal Attributes A generalization of the binary attribute in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching m: # of attributes that are same for both records, p: total # of attributes Method 2: rewrite the database and create a new binary attribute for each of the m states For an object with color yellow, the yellow attribute is set to 1, while the remaining attributes are set to 0.

    16. Cluster Analysis Ordinal Attributes An ordinal attribute can be discrete or continuous Order is important (ex.rank) Can be treated like interval-scaled replacing xif by their rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th attribute by compute the dissimilarity using methods for interval-scaled attributes

    17. Cluster Analysis Ratio-Scaled Attributes Ratio-scaled attribute: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods: treat them like interval-scaled attributes — not a good choice because scales may be distorted apply logarithmic transformation yif = log(xif) treat them as continuous ordinal data and treat their rank as interval-scaled.

    18. Cluster Analysis Attributes of Mixed Types A database may contain all the six types of attributes symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio. Use a weighted formula to combine their effects. f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w. f is interval-based: use the normalized distance f is ordinal or ratio-scaled compute ranks rif and and treat zif as interval-scaled

    19. Cluster Analysis Data Mining Concepts and Techniques

    20. Cluster Analysis

    21. Cluster Analysis

    22. Cluster Analysis

    23. Cluster Analysis

    24. Cluster Analysis

    25. Cluster Analysis

    26. Cluster Analysis

    27. Cluster Analysis

    28. Cluster Analysis

    29. Cluster Analysis

    30. Cluster Analysis

    31. Cluster Analysis

    32. Cluster Analysis

    33. Cluster Analysis

    34. Cluster Analysis

    35. Cluster Analysis

    36. Cluster Analysis

    37. Cluster Analysis

    38. Cluster Analysis

    39. Cluster Analysis

    40. Cluster Analysis

    41. Cluster Analysis

    42. Cluster Analysis Data Mining Concepts and Techniques

    43. Cluster Analysis Introduction Clustering is an unsupervised method of data analysis Data instances grouped according to some notion of similarity Access only to the set of features describing each object No information as to where each instance should be placed with partition However there might be background knowledge about the domain or data set that could be useful to algorithm In this paper the authors try to integrate this background knowledge into clustering algorithms.

    44. Cluster Analysis K-means Clustering Used to partition a data set into k groups Group instances based on attributes into k groups High intra-cluster similarity; Low inter-cluster similarity Cluster similarity is measured in regards to the mean value of objects in the cluster. How does K-means work ? First, select K random instances from the data – initial cluster centers Second, each instance is assigned to its closest (most similar) cluster center Third, each cluster center is updated to the mean of its constituent instances Repeat steps two and three till there is no further change in assignment of instances to clusters

    45. Cluster Analysis Constrained K-means Clustering Two pair-wise constraints Must-link: constraints which specify that two instances have to be in the same cluster Cannot-link: constraints which specify that two instances must not be placed in the same cluster When using a set of constraints we have to take the transitive closure Constraints may be derived from Partially labeled data Background knowledge about the domain or data set

    46. Cluster Analysis Constrained Algorithm First, select K random instances from the data – initial cluster centers Second, each instance is assigned to its closest (most similar) cluster center such that VIOLATE-CONSTRAINT(I, K, M, C) is false. If no such cluster exists , fail Third, each cluster center is updated to the mean of its constituent instances Repeat steps two and three till there is no further change in assignment of instances to clusters VIOLATE-CONSTRAINT instance I, cluster K, must-link constraint M, cannot-link constraint C For each (i, i=) in M: if i= is not in K, return true. For each (i, i?) in C : if i? is in K, return true Otherwise return false

    47. Cluster Analysis Experimental Results on GPS Lane Finding Large database of digital road maps available These maps contain only coarse information about the location of the road By refining maps down to the lane level we can enable a host of more sophisticated applications such as lane departure detection Approach Based on the observation that drivers tend to drive within lane boundaries Lanes should correspond to “densely traveled” regions in contrast to the lane boundaries Possible to collect data about the location of cars and then cluster that data to automatically determine where the individual lanes are located

    48. Cluster Analysis GPS Lane Finding (cont’d) Collect data about the location of cars as they drive along a given road Collect data once per second from several drivers using GPS receivers affixed to top of their vehicles Each data instance has two features: 1. Distance along the road segment 2. Perpendicular offset from the road centerline For evaluation purposes drivers were asked to indicate which lane they occupied and any lane changes

    49. Cluster Analysis GPS Lane Finding (cont’d) For the problem of automatic lane detection, Two domain-specific heuristics for generating constraints Trace contiguity means that, in the absence of lane changes, all of the points generated from the same vehicle in a single pass over a road segment should end up in the same lane. Maximum separation refers to a limit on how far apart two points can be (perpendicular to the centerline) while still being in the same lane. If two points are separated by at least four meters, then we generate a constraint that will prevent those two points from being placed in the same cluster. To better analyze performance in the domain, authors modified the cluster center representation

    50. Cluster Analysis GPS Lane Finding (cont’d)

    51. Cluster Analysis Conclusion Measurable improvement in accuracy The use of constraints while clustering means that, unlike the regular k-means algorithm, the assignment of instances to clusters can be order-sensitive. If a poor decision is made early on, the algorithm may later encounter an instance i that has no possible valid cluster Ideally, the algorithm would be able to backtrack, rearranging some of the instances so that i could then be validly assigned to a cluster. Could be extended to hierarchical algorithms

More Related