Robust Space Transformations for Distance-based Operations in KDD

Robust Space Transformations for Distance-based Operations Advisor： Dr. Hsu Graduate：Min-Hong Lin IDSL seminar

Outline • Motivation • Objective • Introduction • Donoho-Stahel Estimator • Key Properties of The DSE • k-D Subsampling Algorithm • k-D Randomized Algorithm • Experimental Evaluation • Conclusions • Personal Opinion IDSL

Motivation • For many KDD operations, there is an underlying k-D data space • Each tuple is represented as a point in the space • We may get unintuitive results if an inappropriate space is used • In the presence of differing scales, variability, correlation, and outliers IDSL

Objective • Propose a robust space transformation • Develop algorithms and evaluate how well they perform empirically IDSL

Introduction • Example 1(Nearest Neighbor Search) • Systolic blood pressure(100-160 mm) • Body temperature(37 degrees Celsius) • Age(20-50 years) • The results are dominated by blood pressure • Consider query point (120,37,35) • Using Euclidean distance ,the point (120,40,35) is nearer to the query point than (130,37,35) is IDSL

Fix to The Problem • Weight • Normalization • Standardization • But these transformation do not take into account • Outliers • Correlation between attributes IDSL

Contributions of This Paper • Propose a robust space transformation called the Donoho-Stahel estimator (DSE) • Propose a version of the DSE for high-dimensional database (subsampling) • Develop a new algorithm(Hybrid-random algorithm) for computing the DES efficiently IDSL

Related Work • Space transformations are from the class of distance-preserving transformations • Principal component analysis(PCA) is useful for data reduction • Many clustering algorithms are distance-based or density-based • Outlier detection has received considerable attention in recent years IDSL

Donoho-Stahel Estimator • The DSE is a robust multivariate estimator of location and scatter • It is an “outlyingness-weighted” mean and covariance • It also possesses statistical properties such as affine equivariance • DSE is easier to computer, scales better, and has much less bias IDSL

The DSE Fixed-angle Algorithm for 2-D IDSL

The DSE Fixed-angle Algorithm for 2-D • Robustness is achieved by first identifying outlying points, and then downweighting their influence • Step 1 computes the degree of outlyingness of the point with respect to  • Step 2 computes the maximum degree of outlyingness over all possible ’s • In step 3, if this maximum degree for a point is to high, the influence of this point is weakened • Finally, the location center and the covariance matrix are computed IDSL

Key Properties of The DSE • Euclidean property • Stability property IDSL

Euclidean Property • It says that while inappropriate in the original space, the Euclidean distance function becomes reasonable in the DSE transformed space IDSL

Stability Property • Which says that in spite of frequent updates, the estimator does not: • Change much • Lose its usefulness • Require re-computation IDSL

Stability Experiment • Outlier detection • Used the old estimator to transform the space for Dnew and then found all the outliers in Dnew • Used the updated estimator to transform the space for Dnew and then found all the outliers in Dnew • Use standard precision and recall to measure the difference between the two sets of detected outliers IDSL

Results • The DSE transformation is stable IDSL

Complexity of the Fixed-angle Algorithm • How to compute efficiently, for k > 2 dimensions? • Complexity of the Fixed-angle Algorithm • O(ak-1kN + k2N)=O(ak-1kN ) • Impractical for large values of a and k • There is an exhaustive enumeration of ’s IDSL

Projects of Points Onto a Line IDSL

Intuition Behind The Subsampling Algorithm • Use orthogonal lines increases chance of detecting outliers • Because many outliers are likely to stand out after the orthogonal projection • In applying subsampling, our goal is to use lines orthogonal to the axes of an ellipse(or ellipsoid in k-D) • However, the parameters of the ellipsoid are unknow • There will be too many points and dimensions IDSL

k-D Subsampling Algorithm IDSL

Details of the Subsampling Algorithm • How to compute m? • A subsample is good if all k-1 points are within the ellipsoid • Let  be the fraction of points outsides the ellipsoid.(typically 0.01~0.5) • m can be the smallest number of subsamples such that there is at least a 95% probability that we get at least one good subsample out of the m subsamples • We can determine a base vale of m by solving the following inequality • 1-(1-(1- )k-1)m>=0.95 IDSL

Complexity of the Subsampling Algorithm • O(mk3 + k2N) • The mk3complexity factor is still costly when the number of subsamples m is large(i.e., for a high quality estimator) • How the k-D DSE estimator can be computed more efficiently? IDSL

k-D Randomized Algorithms • Pure-random Algorithm • Hybrid-random Algorithm • Combines part of the Pure-random algorithm with part of the Subsampling algorithm IDSL

Pure-random Algorithm • Fixed-angle algorithm need ak-1 projection unit vectors examined • It is possible randomly selects r projections to examine • O(rkN + k2N)=O(rkN) IDSL

Hybrid-random Algorithm • The Pure-random algorithm probes the k-D space blindly • The value of r may need to be high for acceptable quality • Whether random draws project vectors can be done more intelligently IDSL

IDSL

Hybrid-random Algorithm • Steps 1 to 3 use the Subsampling algorithm to find some initial projection vectors and keep them in S • In each iteration of step 4, a new random projection vector is generated in such a way that it stays clear of existing projection vectors • O(mk3 + rkN) IDSL

Experimental Evaluation • Experimental Setup • Distance-based outlier detection operation • Use precision and recall to compare the results • An 855-record dataset consisting of 1995-96 National Hockey League(NHL) player performance statistics • We created synthetic datasets containing up to 100,100 tuples-whose distribution mirrored that of the base dataset • Attributes are goals, assists, penalty minutes, shots on goal, and games played IDSL

Usefulness of Donoho-Stahel Transformation • The range for penalty-minutes was [0,335], and the range for goals-scored was [0,69] IDSL

Internal Parameters of the Algorithms IDSL

Comparison IDSL

Scalability in Dimensionality and Dataset Size IDSL

Conclusions • Many type of distance-based KDD operations tend to be less meaningful • When no attention is paid to scale, variability, correlation, and outliers in the underlying data • An appropriate space is one that • Preserves the Euclidean property • Is stable of updates • The Hybrid-random algorithm is an effective and efficient algorithm to be a robust estimator IDSL

Personal Opinion • Data preprocessing can help improve the quality of the data and the mining results • Space transformation may improve the accuracy and efficiency of mining algorithms involving distance measurements IDSL

Robust Space Transformations for Distance-based Operations in KDD

Robust Space Transformations for Distance-based Operations in KDD

Presentation Transcript

Space-Based Network Centric Operations Research

Transformations of Figures through Space!

Haptic -motor transformations for the control of fingertip distance

Logic Programming Based Model Transformations

Phylogenetics - Distance-Based Methods

Distance-based methods

Distance-Based GDPs

Space-based Architecture for Climate

Verbal communication is essential for Space Operations

European Space Operations Centre

Operator-Based Distance for GP

Space Concept of Operations

Cost-Based Transformations

European Space Operations Centre

Cost based transformations

Web Based Education for Distance Learning

Video-Based Distance Learning

Importance of operations management | Distance Management Courses | Distance MBA in Operations | MIT School of Distance

Distance-based phylogeny estimation

Cost-Based Transformations

Cost based transformations