Feature Extraction for Outlier Detection in High-Dimensional Spaces

Feature Extraction for Outlier Detection in High-Dimensional Spaces Hoang Vu Nguyen Vivekanand Gopalkrishnan

Motivation • Outlier detection techniques • Compute distances between points in full feature space • Curse of dimensionality • Solution: feature extraction • Feature extraction techniques • Do not consider class imbalance • Not suitable for asymmetric classification (and outlier detection!) 2 Feature Extraction for Outlier Detection

Overview • DROUT • Dimensionality Reduction/Feature Extraction for OUTlier Detection • Extract features for the detection process • To be integrated with outlier detectors Features Training set DROUT Testing set Outliers Detector 3 Feature Extraction for Outlier Detection

Background • Training set: • Normal class ωm: cardinality Nm, mean vector μm, covariance matrix ∑m • Anomaly class ωa: cardinality Nm, mean vector μa, covariance matrix ∑a • Nm >> Na • Total number of points: Nt = Nm + Na ∑w = (Nm/Nt) . ∑m + (Na/Nt) . ∑a ∑b = (Nm/Nt) . (μm – μt) (μm – μt)T + (Na/Nt) . (μa – μt)(μa – μt)T ∑t = ∑w + ∑b 4 Feature Extraction for Outlier Detection

Background (cont.) • Eigenspace of scatter matrix ∑ : (spanned by eigenvectors) • Consists of 3 subspaces: principal, noise, and null space • Solving eigenvalue problem and obtain d eigenvalues v1 ≥ v2 ≥ … ≥ vd • Noise and null subspaces are caused by noise and mainly by the insufficient training data • Existing methods: discard the noise and null subspaces  loss of information • Jiang et al. 2008: regularize all 3 subspaces before performing feature extraction Ø P N Plot of eigenvalues 0 1 m r d 5 Feature Extraction for Outlier Detection

DROUT Approach • Weight-adjusted Within-Class Scatter Matrix • ∑w = (Nm/Nt) . ∑m + (Na/Nt) . ∑a • Nm >> Na ∑a is far less reliable than ∑m • Weighing ∑m and ∑a according to (Nm/Nt) and (Na/Nt) •  when doing feature extraction on ∑w (using PCA etc.), dimensions (eigenvectors) specified mainly by small eigenvalues of ∑m unexpectedly removed •  dimensions extracted are not really relevant for the asymmetric classification task • Xudong Jiang: Asymmetric principal component and discriminant analyses for pattern classiﬁcation. IEEE Trans. Pattern Anal. Mach. Intell., 31(5), 2009 • Solutions • ∑w = wm . ∑m + wa . ∑a • wm < wa and wm + wa = 1 • more suitable for asymmetric classification 6 Feature Extraction for Outlier Detection

DROUT Approach (cont.) • Which matrix to regularize first? • Goal: extract features that minimize the within-class and maximize the between-class variances • Within-class variances are estimated from limited training data •  small variances estimated tend to be unstable and cause overﬁtting •  proceed with regularizing 3 subspaces of the adjusted within-class scatter matrix 7 Feature Extraction for Outlier Detection

DROUT Approach (cont.) • Subspace decomposition • Solving eigenvalue problem on (weight-adjusted) ∑w and obtain eigenvectors {e1, e2, …, ed} with corresponding eigenvalues v1 ≥ v2 ≥ … ≥ vd • Identify m: • vmed = mediani ≤ r {vi} • vm+1 = maxi≤ r {vi | vi < 2vmed – vr} Ø P N Plot of eigenvalues 0 1 m r d 8 Feature Extraction for Outlier Detection

DROUT Approach (cont.) • Subspace regularization • a = v1 . vm . (m – 1)/(v1 – vm) • b = (mvm – v1)/(v1 – vm) • Regularize: • i ≤ m: xi = vi • m < i ≤ r: xi = a/(i + b) • r < i ≤ d: xi = a/(r + 1 + b) • A = [ei . wi]1 ≤ i ≤ d where wi = 1/sqrt(xi) Ø P N 0 1 m r d 9 Feature Extraction for Outlier Detection

DROUT Approach (cont.) • Subspace regularization • pT = AT . p with p being a data point • Form new (weight-adjusted) total scatter matrix (slide 4) and solve the eigenvalue problem using it • B = matrix of c resulting eigenvectors with largest eigenvalues •  feature extraction done only after regularization  limit loss of information • Xudong Jiang, Bappaditya Mandal, and Alex ChiChung Kot: Eigenfeature regularization and extraction in face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 30(3):383–394, 2008 • Transform matrix: M = A . B 10 Feature Extraction for Outlier Detection

DROUT Approach (cont.) • Summary: • Let ∑w = wm . ∑m + wa . ∑a • Compute A from ∑w • Transform the training set using A • Compute the new total scatter matrix ∑t • Compute B by solving the eigenvalue problem on ∑t • M = A . B • Use M to transform the testing set 11 Feature Extraction for Outlier Detection

Related Work • APCDA • Xudong Jiang: Asymmetric principal component and discriminant analyses for pattern classiﬁcation. IEEE Trans. Pattern Anal. Mach. Intell., 31(5), 2009 • Uses weight-adjusted scatter matrices for feature extraction • Discards noise and null subspaces  loss of information • ERE • Xudong Jiang, Bappaditya Mandal, and Alex ChiChung Kot: Eigenfeature regularization and extraction in face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 30(3):383–394, 2008 • Performs regularization before feature extraction • Ignores class imbalance  not suitable for outlier detection • ACP • David Lindgren and Per Spangeus: A novel feature extraction algorithm for asymmetric classiﬁcation. IEEE Sensors Journal, 4(5):643–650, 2004 • Consider neither noise-null subspaces nor class imbalance 12 Feature Extraction for Outlier Detection

Outlier Detection with DROUT • Detectors: • ORCA • Stephen D. Bay and Mark Schwabacher: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD, pages 29–38, 2003 • BSOUT • George Kollios, Dimitrios Gunopulos, Nick Koudas, and Stefan Berchtold: Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans. Knowl. Data Eng., 15(5):1170–1187, 2003 13 Feature Extraction for Outlier Detection

Outlier Detection with DROUT (cont.) • Datasets: • KDD Cup 1999 • Normal class (60593 records) vs. U2R class (246) • d = 34 (7 categorical attributes are excluded) • Training set: 1000 normal recs. vs. 50 anomalous recs. • Ann-thyroid 1 • Class 3 vs. class 1 • d = 21 • Training set: 450 normal recs. vs. 50 anomalous recs. • Ann-thyroid 2 • Class 3 vs. class 2 • d = 21 • Training set: 450 normal recs. vs. 50 anomalous recs. • Parameter settings: • wm = 0.1 and wa = 0.9 • Number of extracted features b ≤ d/2 14 Feature Extraction for Outlier Detection

Results 15 Feature Extraction for Outlier Detection

Results (cont.) 16 Feature Extraction for Outlier Detection

Conclusion • Summary of contributions • Explore the effect of feature extraction on outlier detection • Results on real datasets and two detection methods are promising • A novel framework for ensemble outlier detection. Experiments on real data sets seem to be promising • Future work • More experiments on larger datasets • Examine other possibilities of dimensionality reduction 17 Feature Extraction for Outlier Detection

Last words… Thank you FOR YOUR ATTENTION!!!

Feature Extraction for Outlier Detection in High-Dimensional Spaces