340 likes | 354 Vues
This seminar outlines the motivation, objective, and methodology of proposing a robust space transformation method called the Donoho-Stahel Estimator (DSE) for distance-based operations in Knowledge Discovery in Databases (KDD). The DSE aims to address issues related to scaling, variability, outliers, and correlation in data spaces. Key properties, algorithms, and the experimental evaluation of the DSE are discussed, highlighting its stability and efficiency. The paper also introduces the k-D Subsampling and k-D Randomized Algorithms for high-dimensional databases. Overall, the study aims to enhance the accuracy and reliability of distance-based operations through innovative space transformations.
E N D
Robust Space Transformations for Distance-based Operations Advisor: Dr. Hsu Graduate:Min-Hong Lin IDSL seminar
Outline • Motivation • Objective • Introduction • Donoho-Stahel Estimator • Key Properties of The DSE • k-D Subsampling Algorithm • k-D Randomized Algorithm • Experimental Evaluation • Conclusions • Personal Opinion IDSL
Motivation • For many KDD operations, there is an underlying k-D data space • Each tuple is represented as a point in the space • We may get unintuitive results if an inappropriate space is used • In the presence of differing scales, variability, correlation, and outliers IDSL
Objective • Propose a robust space transformation • Develop algorithms and evaluate how well they perform empirically IDSL
Introduction • Example 1(Nearest Neighbor Search) • Systolic blood pressure(100-160 mm) • Body temperature(37 degrees Celsius) • Age(20-50 years) • The results are dominated by blood pressure • Consider query point (120,37,35) • Using Euclidean distance ,the point (120,40,35) is nearer to the query point than (130,37,35) is IDSL
Fix to The Problem • Weight • Normalization • Standardization • But these transformation do not take into account • Outliers • Correlation between attributes IDSL
Contributions of This Paper • Propose a robust space transformation called the Donoho-Stahel estimator (DSE) • Propose a version of the DSE for high-dimensional database (subsampling) • Develop a new algorithm(Hybrid-random algorithm) for computing the DES efficiently IDSL
Related Work • Space transformations are from the class of distance-preserving transformations • Principal component analysis(PCA) is useful for data reduction • Many clustering algorithms are distance-based or density-based • Outlier detection has received considerable attention in recent years IDSL
Donoho-Stahel Estimator • The DSE is a robust multivariate estimator of location and scatter • It is an “outlyingness-weighted” mean and covariance • It also possesses statistical properties such as affine equivariance • DSE is easier to computer, scales better, and has much less bias IDSL
The DSE Fixed-angle Algorithm for 2-D • Robustness is achieved by first identifying outlying points, and then downweighting their influence • Step 1 computes the degree of outlyingness of the point with respect to • Step 2 computes the maximum degree of outlyingness over all possible ’s • In step 3, if this maximum degree for a point is to high, the influence of this point is weakened • Finally, the location center and the covariance matrix are computed IDSL
Key Properties of The DSE • Euclidean property • Stability property IDSL
Euclidean Property • It says that while inappropriate in the original space, the Euclidean distance function becomes reasonable in the DSE transformed space IDSL
Stability Property • Which says that in spite of frequent updates, the estimator does not: • Change much • Lose its usefulness • Require re-computation IDSL
Stability Experiment • Outlier detection • Used the old estimator to transform the space for Dnew and then found all the outliers in Dnew • Used the updated estimator to transform the space for Dnew and then found all the outliers in Dnew • Use standard precision and recall to measure the difference between the two sets of detected outliers IDSL
Results • The DSE transformation is stable IDSL
Complexity of the Fixed-angle Algorithm • How to compute efficiently, for k > 2 dimensions? • Complexity of the Fixed-angle Algorithm • O(ak-1kN + k2N)=O(ak-1kN ) • Impractical for large values of a and k • There is an exhaustive enumeration of ’s IDSL
Intuition Behind The Subsampling Algorithm • Use orthogonal lines increases chance of detecting outliers • Because many outliers are likely to stand out after the orthogonal projection • In applying subsampling, our goal is to use lines orthogonal to the axes of an ellipse(or ellipsoid in k-D) • However, the parameters of the ellipsoid are unknow • There will be too many points and dimensions IDSL
Details of the Subsampling Algorithm • How to compute m? • A subsample is good if all k-1 points are within the ellipsoid • Let be the fraction of points outsides the ellipsoid.(typically 0.01~0.5) • m can be the smallest number of subsamples such that there is at least a 95% probability that we get at least one good subsample out of the m subsamples • We can determine a base vale of m by solving the following inequality • 1-(1-(1- )k-1)m>=0.95 IDSL
Complexity of the Subsampling Algorithm • O(mk3 + k2N) • The mk3complexity factor is still costly when the number of subsamples m is large(i.e., for a high quality estimator) • How the k-D DSE estimator can be computed more efficiently? IDSL
k-D Randomized Algorithms • Pure-random Algorithm • Hybrid-random Algorithm • Combines part of the Pure-random algorithm with part of the Subsampling algorithm IDSL
Pure-random Algorithm • Fixed-angle algorithm need ak-1 projection unit vectors examined • It is possible randomly selects r projections to examine • O(rkN + k2N)=O(rkN) IDSL
Hybrid-random Algorithm • The Pure-random algorithm probes the k-D space blindly • The value of r may need to be high for acceptable quality • Whether random draws project vectors can be done more intelligently IDSL
Hybrid-random Algorithm • Steps 1 to 3 use the Subsampling algorithm to find some initial projection vectors and keep them in S • In each iteration of step 4, a new random projection vector is generated in such a way that it stays clear of existing projection vectors • O(mk3 + rkN) IDSL
Experimental Evaluation • Experimental Setup • Distance-based outlier detection operation • Use precision and recall to compare the results • An 855-record dataset consisting of 1995-96 National Hockey League(NHL) player performance statistics • We created synthetic datasets containing up to 100,100 tuples-whose distribution mirrored that of the base dataset • Attributes are goals, assists, penalty minutes, shots on goal, and games played IDSL
Usefulness of Donoho-Stahel Transformation • The range for penalty-minutes was [0,335], and the range for goals-scored was [0,69] IDSL
Comparison IDSL
Conclusions • Many type of distance-based KDD operations tend to be less meaningful • When no attention is paid to scale, variability, correlation, and outliers in the underlying data • An appropriate space is one that • Preserves the Euclidean property • Is stable of updates • The Hybrid-random algorithm is an effective and efficient algorithm to be a robust estimator IDSL
Personal Opinion • Data preprocessing can help improve the quality of the data and the mining results • Space transformation may improve the accuracy and efficiency of mining algorithms involving distance measurements IDSL