330 likes | 510 Vues
Data Mining. Lecture 6. Course Syllabus. Case Study 1 : Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 –Assignment1) Data Analysis Techniques ( Week 5 ) Statistical Background Trends/ Outliers/Normalizations Principal Component Analysis
E N D
Data Mining Lecture 6
Course Syllabus • Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 –Assignment1) • Data Analysis Techniques (Week 5) • Statistical Background • Trends/ Outliers/Normalizations • Principal Component Analysis • Discretization Techniques • Case Study 2: Working and experiencing on the properties of discretization infrastructure of The Retail Banking Data Mart (Week 5 –Assignment 2) • Lecture Talk: Searching/Matching Engine
Course Syllabus • Clustering Techniques (Week 6) • K-Means Clustering • Condorcet Clustering • Other Clustering Techniques • Case Study 3: Working and experiencing on the properties of the clustering infrastructure for The Retail Banking (Week 6 – Assignment3) • Lecture Talk: Different Perspectives on Searching/Matching
Discretization • Three types of attributes: • Nominal — values from an unordered set, e.g., color, profession • Ordinal — values from an ordered set, e.g., military or academic rank • Continuous — real numbers, e.g., integer or real numbers • Discretization: • Divide the range of a continuous attribute into intervals • Some classification algorithms only accept categorical attributes. • Reduce data size by discretization • Prepare for further analysis
Discretization and Concept Hierarchy • Discretization • Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals • Interval labels can then be used to replace actual data values • Supervised vs. unsupervised • Split (top-down) vs. merge (bottom-up) • Discretization can be performed recursively on an attribute • Concept hierarchy formation • Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior)
Discretization and Concept Hierarchy Generation for Numeric Data • Typical methods: All the methods can be applied recursively • Binning (covered above) • Top-down split, unsupervised, • Histogram analysis (covered above) • Top-down split, unsupervised • Clustering analysis (covered above) • Either top-down split or bottom-up merge, unsupervised • Entropy-based discretization: supervised, top-down split • Interval merging by 2 Analysis: unsupervised, bottom-up merge • Segmentation by natural partitioning: top-down split, unsupervised
Entropy-Based Discretization • Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the information gain after partitioning is • Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 is • where pi is the probability of class i in S1 • The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization • The process is recursively applied to partitions obtained until some stopping criterion is met • Such a boundary may reduce data size and improve classification accuracy
Interval Merge by 2 Analysis • Merging-based (bottom-up) vs. splitting-based methods • Merge: Find the best neighboring intervals and merge them to form larger intervals recursively • ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002] • Initially, each distinct value of a numerical attr. A is considered to be one interval • 2 tests are performed for every pair of adjacent intervals • Adjacent intervals with the least 2 values are merged together, since low 2 values for a pair indicate similar class distributions • This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level, max-interval, max inconsistency, etc.)
2 Test • Χ2 (chi-square) test • The larger the Χ2 value, the more likely the variables are related • The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count • Correlation does not imply causality
Chi-Square Calculation: An Example • Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) • It shows that like_science_fiction and play_chess are correlated in the group
Segmentation by Natural Partitioning • A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals. • If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals • If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals • If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals
count -$351 -$159 profit $1,838 $4,700 Step 1: Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max Step 2: msd=1,000 Low=-$1,000 High=$2,000 (-$1,000 - $2,000) Step 3: (-$1,000 - 0) ($1,000 - $2,000) (0 -$ 1,000) ($2,000 - $5, 000) ($1,000 - $2, 000) (-$400 - 0) (0 - $1,000) (0 - $200) ($1,000 - $1,200) (-$400 - -$300) ($2,000 - $3,000) ($200 - $400) ($1,200 - $1,400) (-$300 - -$200) ($3,000 - $4,000) ($1,400 - $1,600) ($400 - $600) (-$200 - -$100) ($4,000 - $5,000) ($600 - $800) ($1,600 - $1,800) ($1,800 - $2,000) ($800 - $1,000) (-$100 - 0) Example of 3-4-5 Rule (-$400 -$5,000) Step 4:
Concept Hierarchy Generation for Categorical Data • Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts • street < city < state < country • Specification of a hierarchy for a set of values by explicit data grouping • {Urbana, Champaign, Chicago} < Illinois • Specification of only a partial set of attributes • E.g., only street < city, not others • Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values • E.g., for a set of attributes: {street, city, state, country}
15 distinct values country 365 distinct values province_or_ state 3567 distinct values city 674,339 distinct values street Automatic Concept Hierarchy Generation • Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set • The attribute with the most distinct values is placed at the lowest level of the hierarchy • Exceptions, e.g., weekday, month, quarter, year
What is Cluster Analysis? • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms
Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Quality: What Is Good Clustering? • A good clustering method will produce high quality clusters with • high intra-class similarity • low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
Measure the Quality of Clustering • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j) • There is a separate “quality” function that measures the “goodness” of a cluster. • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables. • Weights should be associated with different variables based on applications and data semantics. • It is hard to define “similar enough” or “good enough” • the answer is typically highly subjective.
Requirements of Clustering in Data Mining • Scalability • Ability to deal with different types of attributes • Ability to handle dynamic data • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters • Able to deal with noise and outliers • Insensitive to order of input records • High dimensionality • Incorporation of user-specified constraints • Interpretability and usability
Data Structures • Data matrix • (two modes) • Dissimilarity matrix • (one mode)
Type of data in clustering analysis • Interval-scaled variables • Binary variables • Nominal, ordinal, and ratio variables • Variables of mixed types
Interval-valued variables • Standardize data • Calculate the mean absolute deviation: • where • Calculate the standardized measurement (z-score) • Using mean absolute deviation is more robust than using standard deviation
Similarity and Dissimilarity Between Objects • Distances are normally used to measure the similarity or dissimilarity between two data objects • Some popular ones include: Minkowski distance: • where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer • If q = 1, d is Manhattan distance
Similarity and Dissimilarity Between Objects (Cont.) • If q = 2, d is Euclidean distance: • Properties • d(i,j) 0 • d(i,i)= 0 • d(i,j)= d(j,i) • d(i,j) d(i,k)+ d(k,j) • Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures
Major Clustering Approaches (I) • Partitioning approach: • Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors • Typical methods: k-means, k-medoids, CLARANS • Hierarchical approach: • Create a hierarchical decomposition of the set of data (or objects) using some criterion • Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON • Density-based approach: • Based on connectivity and density functions • Typical methods: DBSCAN, OPTICS, DenClue
Major Clustering Approaches (II) • Grid-based approach: • based on a multiple-level granularity structure • Typical methods: STING, WaveCluster, CLIQUE • Model-based: • A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other • Typical methods:EM, SOM, COBWEB • Frequent pattern-based: • Based on the analysis of frequent patterns • Typical methods: pCluster • User-guided or constraint-based: • Clustering by considering user-specified or application-specific constraints • Typical methods: COD (obstacles), constrained clustering
Typical Alternatives to Calculate the Distance between Clusters • Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq) • Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq) • Average: avg distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq) • Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj) • Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj) • Medoid: one chosen, centrally located object in the cluster
Centroid: the “middle” of a cluster Radius: square root of average distance from any point of the cluster to its centroid Diameter: square root of average mean squared distance between all pairs of points in the cluster Centroid, Radius and Diameter of a Cluster (for numerical data sets)
Week 6-End • assignment 2 (please share your ideas with your group) • choose freely a dataset my advice: http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html • use Weka http://www.cs.waikato.ac.nz/ml/weka/ - apply different discretization strategies that you have learned in class (equi–width, equi-depth, entropy based, merging, splitting,...)
Week 6-End • read • Course Text Book Chapter 7