13.02.2007 University of Joensuu Sami Äyrämö

Knowledge Mining using Robust Clustering 13.02.2007 University of Joensuu Sami Äyrämö

From Data Mining to Knowledge Mining

Motivation • The need for scalable and robust KD/DM/DM methods is caused by the constant growth of digital data storages: • Capability to measure and store data from different objects has increased (industrial processes, human beings, markets, … ) • Collecting data is cheap! • The consequence is that the essential information may be lost into the oceans of data • The growth of knowledge lags behind the growth of digital data sets • Lots of data, high dimensions, missing values, outliers,… • Scalable and robust data analysis methods are needed • NB! Data owners should retain that they have to collect and digitize the most important data • All collected digital data is not important • All important data may not be collected

Knowledge discovery from Databases (KDD) Modified from: U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, Volume 39 , Issue 11, November 1996.

Knowledge Mining (KM) process • Enhanced understanding of the target data leads to success

Applications

Control Quality Sensors Temporal events KM motivation: Data intensive environments • E-commerce • Web mining • Marketing research • Process industry (e.g., paper machines) • Gene databases • Social sciences • Software quality analysis • Document databases • Image databases • Radar applications • Peace science • Telecommunication • Archeology • and many others…

Operator Laborant Quality Process data Process data Customer Control data ? Manager Feedback A sample application

A real-world dataset 35% missing values!! Unknown number of errors!!

Cluster analysis

Cluster analysis • Clustering is the process of grouping the data into clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters • The core method of data mining Refine,compess,simplify,…

Challenges • Application domain → relevant variables → type of variables → distance measure • Non-informative variables, e.g., completely missing, constant, and dublicate variables • Outliers and noise (robustness) • Missing data (consistency) • Scaling of data (equal weighting of variables) • Number of clusters (may be ambiguous) • Choice of clustering method (underlying assumptions) • Non-uniquess of results (solution depends on the initial values) • Large data set (scalability) • Cluster presentation (dimension reductions) • Interpretation (domain knowledge)

What is the best distance measure? Features: height, weight, eye color, eyeglasses,…, blood pressure Objects/ persons Outlier Missing value

Uniqueness - problem of local minima Initialization Robust clustering Bad? But this is the global minimum? Better? Useful? Role of K? Clustering methods do not necessarily inherit the robustness of its robust components!

What is the correct number of clusters??

Missing data

Missing data • Roughly speaking there are two types of missing values: • The true underlying value exists, but for one reason or another it is not entered into the data set • E.g., human mistake or measurement system failure • There may not exist true value in the real-world • An unemployed person can not have an employer • ”Do not exist” or ”do not know” may be the most important information (cf. Finnish parliamentary election 2007) • Hence, missing data may contain the most important piece of information (”nonignorable variables”) • Anyhow, missing data causes many challenges for cluster analysis • For instance, dissimilarity matrix may be impossible to compute (what is suitable distance measure for a pair of vectors (NaN , 3.2 , NaN) and (1.7 , NaN , 4.0))

Missing completely at random (MCAR) • Probability that jth component of vector xi∈ℝp is missing is independent of any other known or missing value of xi • Any missing data strategy can be applied without producing significant bias on the data • The easiest type of missing data from the statistical point-of-view

Missing at random (MAR) x > 5 → y is missing • Probability that jth component of vector xi∈ℝp is missing may depend on the observed components of xi • The observed values of (xi)j is a random sample within subclasses defined by the values of (xi)j, i≠k • Some knowledge about the unavailable data values

Not missing at random (NMAR) • Probability that jth component of vector xi∈ℝp is missing depend on its value • E.g., value of a measurement is out of some permissible range and might therefore be censored or replaced by an empty value (in this case the mechanism of missing data is understood) • Some knowledge about the unavailable data values • No general method exists for this kind of missing data mechanism x exists only if x > 0 • Missing data value is actually an extreme outlier (any value between -∞ and ∞)

Strategies for handling missing data • Fill in the missing values manually: • Time consuming (esp. DM applications) • Replace by constant • Simple, but not usually recommended • Treat unknown values as a special category • Use only complete cases in the analysis • A lot of information might be lost • It may be more reasonable to discard a variable with a large fraction of missing values than to discard the large number of objects • Available case strategy • Uses all available data in computation • Statistical efficiency of basic estimates is proportional to the fraction of missing data • Implemented easily by projector technique where all computation is projected to the available vector components (more details later..) • Extremely simple to the end-user

Strategies for handling missing data: Imputation • Imputation • Replace missing values with reasonable estimates (mean, median, mode, kNN) based on the available data values • Simple approach • Mean imputation works when the sample is drawn from unimodal normal distribution and the data is missing at random • Reduces the inherent variability in the data • Outliers may seriously distort the mean imputation (median might be better) • Simple imputation methods are not generally suitable for data clustering • The potential class structure is not considered by simple imputation methods • Shrinking the within-cluster variations may disturb cluster validation

Strategies for handling missing data: Imputation • Hot deck imputation • Replaces the missing values with values based on estimated missing data distributions – the missing data distribution is estimated from the available values • Sample EM-like hot deck algorithm: • Cluster the data using complete cases • Assign incomplete cases to the nearest clusters (using an appropriate distance measure) • Impute missing values using statistical within-cluster summaries • Cluster the whole data (both complete and imputed cases) • If changes in clustering repeat from Step 3 • Cold deck imputation uses other distinct data set for substitution of missing values

Strategies for handling missing data: Imputation • Nearest-neighbor imputation • Identifies candidates that are the most similar to the object with missing data (the candidate has to be complete in the required variables) and substitutes the missing values with values of candidates • Appropriate distance measure is needed • E.g., l 2- or l ∞-norm on available variables (presumes standardization of the variables) • Upper limit d0 may be given for the distance in order to avoid occurence of strange objects (if the distance from the incomplete object to its closest complete-case neighbor is greater than d0 imputation will be omitted) • Prediction-based imputation • Uses classification (nominal variable) and regression (continuous) models based on other variables to predict missing values • Multiple imputation • A collection of imputed data sets is produced and analyzed using standard methods • The final result is obtained by combining the characterizations of the imputed data sets(computationally expensive)

Robust clustering

Elements of Data Clustering • Data presentation • Choice of objects • Choice of variables • What to cluster: objects or variables • Normalization/weighting of variables • Choice of (dis)similarity measures! • Choice of clustering criterion (objective function) • Choice of missing data strategy • Algorithms and computer implementation (and their reliability, e.g., convergence) • Number of clusters • Interpretation of results

Quality attributes Sample data set Time Missing value Possible outlier Robust clustering • Data sets produced by industrial processes are often sparse and contain outliers that are caused e.g., by human errors and measurement failures (e.g., laboratory measurements) • Traditional clustering methods (e.g., k-means, hierarchical methods) are not appropriate for comprehensive cluster analysis by themselves, because of their sensitivity to such defects • Robust methods complement the traditional methods because of their • tolerance to incomplete, sparse and dirty data • usability (less manual trimming of data by the end user)

Robust K-spatialmedians α = 2, q=2 → K-means α = 1, q=2 → K-spatialmedians α = 1, q=1 → K-coord.medians • K-prototype algorithm • Initialize the cluster prototypes (HOW?) • Assign each data point to the closest cluster prototype (DISTANCE MEASURE?) • Compute the new estimates for the cluster prototypes (ESTIMATOR?) • Termination: stop if termination criteria are satisfied (usually no changes in I) T. Kärkkäinen and S. Äyrämö, Robust clustering methods for erroneous and incomplete data, In Proc. of 5th Conference on Data Mining, September, 2004.

What is robustness? • “Robustness signifies insensitivity to small deviations from the assumptions”, P. Huber, Robust statistics, 1981 • Small deviation means either gross errors (outliers) in the minor part of data or small errors (noise) in a large part of data • Statistical methods and tools are mainly based on classical statistics (mean, variance) • Classical statistics are based on the assumption that data sets originates from normal distribution • This often makes results useless when the normal assumption is not satisfied • The classical statistics are computationally straightforward • Robust statistics are more intractable from computational point-of-view

Measures of robustness • Breakdown point: ”the smallest fraction of contamination that can cause the estimator T to take values arbitrarily far from T(X)” • BP of the sample mean is 1/n and the coordinatewise and spatial medians 1/2 • Influence function • Gross-error sensitivity • Local-shift sensitivity

Breakdown of estimates 1

Robust location estimators • The trimmed mean • Requires parameters (distance or fraction of data) • Difficult in high-dimensions • The (coordinatewise) median • Attractive robust alternative in one- or two-dimensions • Translation and scale equivariant • The spatial median • Attractive robust alternative in high dimensions • Computationally challenging – explicit solution does exist for the optimization problem • Orthogonal equivariant • Presumes equal weighting for variables – insensitive to scaling of variables • Many others: Oja median,…

Robust estimators - desirable properties • High efficiency: Nearly equal efficiency with maximum likelihood estimators under ideal parametric models. • Qualitative robustness: Estimates are influenced just slightly by small deviations from the assumed model. • Quantitative robustness: Estimates are protected against large amounts of contamination or single gross errors (high breakdown point). • Local-shift sensitivity: Smooth reaction to rounding and grouping. • Rejection point: Separation between outliers and the bulk of data. • Fisher consistency: Estimation of right quantity (for parametric models). • Affine equivariance: The solution should be independent of the scales of the variables (multivariate cases) • Computational practicality: The solution should be obtainable in a practical amount of computing time, even in a high dimension and/or with large amounts of data.

4D-sample 3D-sample Problems of coordinatewise median 1 • Infinite local-shift sensitivity at the median of the distribution:

Problems of coordinatewise median 2 • Coordinatewise median does not always lie in the convex hull of the data in ℝp with p≥3: • Not orthogonal equivariant – sensitive to the rotations of data

The spatial median • In statistics: a robust estimate of location (M-estimate). A.k.a. Fermat-Weber problem or single-facility location problem (used, e.g., in transportation cost minimization problems, networking problems) • Good statistical properties on multivariate samples • More robust than the sample mean (breakdown point 50%) • Inherently multivariate estimator on continuous variables • The coordinatewise sample median is sensitive to so called ”inliers” and works better with discrete variables • Rotation invariant • In univariate case spatial median coincides with the coordinatewise median • Statistical efficiency approach to the sample mean as p approach to infinity • Treated as nonsmooth convex optimization problem • Unique solution exists when the data are not collinear • SOR-Weiszfeld type of algorithm with treatment for missing data Kärkkäinen, T. and Äyrämö, S., On Computation of Spatial Median for Robust Data Mining, In Proceedings of EUROGEN 2005, September, 2005, Germany. Valkonen, T., Convergence of a SOR-Weiszfeld type algorithm for incomplete data sets, Numerical Functional Analysis and Optimization, 27 (7-8), Dec. 2006.

The spatial median: Problem of spatial median • Nonsmooth • Convex • Unique solution if data is not collinear • In univariate case spatial median coincides with the coordinatewise median

Smooth problem formulations for incomplete data Modified gradient: ε-approximating problem:

Computation of spatial median • Classical direct search optimization methods (e.g., Nelder-Mead) • Accurate since not based on gradients • Scale poorly w.r.t. the number of dimensions • Requires a large amount of computing resources • Gradient-based methods • Presume differentiability • Inaccurate • Requires a lot of resources to compute • Iterative methods (e.g., Weiszfeld and its modifications) • Fast • Good scalability to high dimensions

SOR accelerated Weiszfeld algorithm with missing data treatment for approximated spatial median problem 1 • ε-approximation problem reformulation • Basic iteration is based on the first-order necessary conditions for a stationary point of the cost function:

SOR accelerated Weiszfeld algorithm with missing data treatment for approximated spatial median problem 2 • Acceleration using SOR type of stepsize factor • On incomplete data sets operations can be performed correspondingly by projecting all computation to the available values!

SOR accelerated Weiszfeld algorithm - restriction of nonsmooth data points • In this variant nonsmooth data points are removed from the computations. • Basic iteration is based on the first-order necessary conditions for a stationary point of the cost function:

Computational efficiency • Time complexity of one SOR/ASSOR iteration corresponds to two/three cost function evaluations (O(n)) • The number of iterations of the SOR-based methods is remarkably smaller when compared to the number function evaluations of the classical optimization methods • CG with ε-approximation problem formulation gives inaccurate solutions

Initialization methods • Needed to avoid the locally optimal solutions • Random • Practical only for small data sets • Distance-optimization • Maximizes between-cluster distances • Sensitive towards outliers • Density-estimation • Tries to detect dense regions from data by clustering several subsamples • Easily generalized to the robust clustering • Trimming

Estimation of number of clusters • Silhouettes • L1-datadepth based robust methods • Trimmed silhouettes • Concentrates on the tight cluster cores and between cluster distances • Trim away the most distant 50% of the cluster points (with respect to the spatial median) • Compute the average distance a(Ck) from the remaining most central points to the closest cluster prototype (measures the tightness of the core of the cluster) • Similarly, compute the average distance b(Ck) from the most central points to the second closest cluster prototype (measures the between-cluster distance)

Trimmed silhouettes • Trimmed silhouette width for cluster k is computed as: • Robust silhouette coefficient is defined as:

Applications

Applications 24bit 3bit K=5

13.02.2007 University of Joensuu Sami Äyrämö

13.02.2007 University of Joensuu Sami Äyrämö

Presentation Transcript

India, Iran and the NPT

PowerPoint 2003 Vs 2007

October 28, 2007

General Library University of Puerto Rico at Mayagüez

Trish Norman, The University of Texas System Roy Mathew, The University of Texas at El Paso TAIR Conference South Padre

DESIGNING HOSPITALS FOR SAFE AND ECONOMICAL PRACTICE The Quality Colloquium Harvard University August 21, 2007

2007 Annual Meeting ● Assemblée annuelle 2007 Vancouver

Textual Entailment

M. Grunow Technical University of Denmark , Copenhagen

HIST 3051 HISTORY OF TURKISH REPUBLIC BAHÇEŞEHİR UNIVERSITY 2007 Prof. Dr. Zafer Toprak

Welcome to the

Extensible 3D (X3D) Graphics Examples

New Zealand 2007

S tatistical R elational L earning: An Introduction

NEH Mission

2007 Legislative Session Updates

ATA 641-PART 2 BOĞAZİÇİ UNIVERSITY 2007 Prof. Dr. Zafer Toprak