Efficient Outlier Detection with RDF-based Vertical Data Representation Method

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University, U.S.A

Introduction • Related Work • Breunig et al. [6] first proposed a density-based approach to mining outliers over datasets with different densities. • Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI). • Not efficient. • Contributions of this paper • a relative density factor (RDF) • RDF expresses the same amount of information as LOF (local outlier factor)[6] and MDEF(multi-granularity deviation factor)[7] • but RDF is easier to compute; • RDF-based outlier detection method • it efficiently prunes the data points which are deep in clusters • It detects outliers only within the remaining small subset of the data; • a vertical data representation in P-trees • P-Trees improve the efficiency of the method further.

Direct DiskNbr x Indirect DiskNbr Definitions • Definition 1: Disk Neighborhood --- DiskNbr(x,r) • Given a point x and radius r, the disk neighborhood of x is defined as a set DiskNbr(x, r)={x’ X | d(x-x’)  r}, where d(x-x’) is the distance of x and x’ • Direct & indirect neighbors of x Definition 2: Density of DiskNbr(x, r) --- Dens (x,r) , where dim is the number of dimensions

Direct neighbor Direct DiskNbr x x r 2r Indirect neighbors Indirect DiskNbr Definitions (Continued) Definition 3: Relative Density Factor (RDF) of point x with radius r -- RDF(x,r) Special case: RDF between DiskNbr(x,r) and {DiskNbr(x,2r)- DiskNbr(x,r)} RDF is used to measure outlierness. Outliers are points with high RDF values.

Start point x Finding Outliers p r r 2r 4r 6r Prune out non-outlier Outliers!!! The Proposed Outlier Detection Method • Given a dataset X, the proposed outlier detection method is processed by: • Find Outliers • Prune Non-outliers Our method prunes non-outliers (points deep in clusters) efficiently; find outliers over the remaining small subset of the data, which consists of points on cluster boundaries and real outliers.

Finding Outliers Three possible distributions with regard to RDF: (a) prune all neighbors, call “Pruning Non-outliers” procedure; (b) prune all direct neighbors of x, calculate RDF for each indirect neighbor. (c) x is an outlier, prune indirect neighbors of x. x (a) 1/(1+ε) ≤ RDF ≤ (1+ε) (b) RDF < 1/ (1+ε) (c) RDF > (1+ε)

Finding Outliers using P-Trees • P-Tree based direct neighbors --- PDNxr For point x, let X= (x1,x2,…,xn) or X = (x1,m-1, …x1,0), (x2,m-1, …x2,0), …(xn,m-1, …xn,0), where xi,j is the jth bit value in the ith attribute. • For the ith attribute, PDNxir = Px’>xi-r AND Px’xi+r • For muti-attributes, • |DiskNbr(x,r)|= rc(PDNxr) • P-Tree based indirect neighbors --- PINxr • PINxr = (ORq  Nbr(x,r)PDNqr) AND PDNx’r • Pruning is done by P-Trees ANDing based on the above three distributions (a),(c): PU = PU AND PDNxrAND PINx’r (b): PU = PU AND PDNxr; where PU is a P-tree representing unprocessed data points

Start point x r 2r 4r r Prune out non-outlier Pruning Non-outliers The pruning is a neighborhood expanding process. It calculates RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr) and prunes based on the value of RDF, where k is an integer. • 1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding neighborhood by doubling the radius. • RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune DiskNbr(x,kr), and call “Finding Outliers” Process; • RDF > (1+ε) (significant increase of density): stop expanding and call “Pruning Non-outliers”.

Pruning Non-outliers Using P-Trees • We define ξ- neighbors: it represents the neighbors with ξ bits of dissimilarity with x, e.g. ξ = 1, 2 ... 8 if x is an 8-bit value • For point x, let X= (x1,x2,…,xn) or X = (x1,m, …x1,0), (x2,m, …x2,0), …(xn,m, …xn,0), where xi,j is the jth bit value in the ith attribute. For the ith attribute, ξ- neighbors of x is calculated by ,where • The pruning is accomplished by: PU = PU AND PXξ’, where PXξ’ is the complement set of PXξ

RDF-based Outlier Detection Process • Algorithm: RDF-based Outlier Detection using P-Trees • Input: Dataset X, radius r, distribution parameter ε. • Output: Anoutlier set Ols. • // PU — unprocessed points represented by P-Trees; • // |PU| — number of points in PU • // PO --- outliers; • //Build up P-Trees for Dataset X • PU  createP-Trees(X); • i  1; • WHILE |PU| > 0 DO • x  PU.first; //pick an arbitrary point x • PO  FindOutliers (x, r, ε); • i  i+1 • ENDWHILE

“Find Outliers” and “Prune Non-Outliers” Procedures

Experimental Study • NHL data set (1996) • Compare with LOF, aLOCI • LOF: Local Outlier Factor Method • aLOCI: approximate Local Correlation Integral Method • Run Time Comparison • Scalability Comparison • Start from 16,384, outperform in terms of scalability and speed

Reference Reference • V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John Wiley’s Publisher • Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222. • Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large Data Bases Conference Proceedings, 1998, pp. 24-27. • Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data Bases Conference Proceedings, 1999, pp. 211-222. • Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”, International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication:2000, ISSN:0163-5808 • Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, “LOF: Identifying Density-based Local Outliers”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000 • Spiros Papadimitriou,Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore, India • A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 1999 • Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164-169. • S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98. • Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing, 2002. • W. Perrizo, “Peano Count Tree Technology,” Technical Report NDSU-CSOR-TR-01-1, 2001. • M. Khan, Q. Ding and W. Perrizo, “k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees” , Proc. Of PAKDD 2002, Spriger-Verlag LNAI 2776, 2002 • Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE 2003 • Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003

Thank you!

Determination of Parameters • Determination of r • Breunig et al. shows choosing miniPt = 10-30 work well in general [6] (miniPt-Neighborhood) • Choosing miniPts=20, get the average radius of 20-neighborhood, raverage. • In our algorithm, r = raverage=0.5 • Determination of ε • Selection of ε is a tradeoff between accuracy and speed. The larger ε is, the faster the algorithm works; the smaller ε is, the more accurate the results are. • We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig’s, but much faster. • The results shown in the experimental part is based on ε=0.8.

Efficient Outlier Detection with RDF-based Vertical Data Representation Method