Manish Gupta 1 , Jing Gao 2 , Yizhou Sun 1 , Jiawei Han 1 1 UIUC 2 SUNY , Buffalo

Integrating Community Matching and Outlier Detection for Mining Evolutionary Community Outliers Manish Gupta1, Jing Gao2, Yizhou Sun1, Jiawei Han1 1UIUC 2SUNY, Buffalo KDD 2012 Beijing, China

Temporal Datasets are Ubiquitous Time Series Evolving networks Temporal Databases Image Courtesy: http://thefinancebuff.com/living-with-negative-real-interest-rates.html and http://sydney.edu.au/engineering/it/~shhong/temporal.htm

Evolutionary Networks are Omnipresent • Social networks: New users join, new friendships are created. • Bibliographic networks: New authors publish more papers, more collaborations are done. • Transportation/road networks: New roads are constructed. • Ad hoc networks: Army vehicles change positions very frequently, new messages transmitted. • Most objects evolve normally. • Evolutionary outliers - black hats or revolutionaries. Image Courtesy: http://sydney.edu.au/engineering/it/~shhong/temporal.htm

Abstract • ECOutliers: Objects that evolve against cluster trends in temporal datasets. • We perform cluster matching and outlier detection in a tightly integrated manner. • Given two snapshots, we propose a coordinate descent algorithm to find ECOutliers. • We experiment on several real and synthetic datasets.

Clusters Evolve Contraction Expansion Clusters Merge Cluster Splits Snapshot1 Snapshot2

Evolutionary Community Outliers

Real-life ExamplesCareer Change: Perry Como From a community of barbers to a community of singers Image courtesy: http://www.chicagotribune.com/topic/entertainment/music/perry-como-PECLB001096.topic http://www.karaoke-lyrics.net/photos/perry-como-15218/157385 http://canonsburgfriends.blogspot.com/2011/05/perry-como-singing-sensation-and-barber.html

Real-life ExamplesConglomerative Diversification: Walt Disney Animation Movies Theme Parks+ Resorts Image Courtesy: http://en.wikipedia.org/wiki/File:Walt_Disney_Television_Animation_Logo.jpg http://students.cis.uab.edu/archived/monosky/project.html http://www.ultimaterollercoaster.com/coasters/history/designer/disney2.shtml

Existing Work • Single Snapshot • Distance-based outliers [Knorr et al., 2000, Ramaswamy et al., 2000] • Density-based outliers [Kreigel et al., 2009, Breunig et al., 2000] • Individual Object Context • Type I and Type II outliers [Fox, 1972, Peter et al., 1988] • Stream outliers (without modeling communities or their evolution) • Distance-based local outliers [Pokrajac et al., 2007] • Graph outliers [Aggarwal et al., 2011]

Typical Evolutionary Outlier Detection Framework Cluster Matching Data Analysis to Detect Outliers Clustering P X1 X2 Q

Our Evolutionary Outlier Detection Framework Cluster Matching Data Analysis to Detect Outliers Clustering Evolutionary Clustering P X1 X2 Q

Few Definitions • Community • Probabilistic collection of similar objects • E.g. a research area like data mining

Few Definitions • Belongingness Matrix Databases Information Retrieval Machine Learning Data Mining

Few Definitions • Correspondence Matrix: • Outlierness Matrix: • Problem: • Given P and Q, estimate S and A • Detect Evolutionary Community Outliers =1 X S Q P

“Mining needle in a haystack. So much hay and so little time” The problem is Challenging! Outlier Detection Evolutionary Outlier Detection Image Credits: http://www.dawnsong.org/LotRO/housing/yard_small.htm

Community Matching and Outlier Detection Together X S Negative log in aoj Hence, convex in aoj Convex Optimization Log ensures outliers will be within small range Quadratic in sij Hence, convex in sij Q P • PNXK1=belongingness matrix for snapshot 1 • QNXK2=belongingness matrix for snapshot 2 • SK1XK2=correspondence matrix • ANXK2=outlierness matrix • N=number of objects • K1=number of clusters in snapshot 1 • =maximum level of outlierness overall Minimize element-wise distance between two matrices wrt outlier-weighted elements Correspondence matrix should be right stochastic Inequality Maximum amount of outlierness should be controlled.

Using Lagrangian Multipliers Method and are Langragian variables subject to the following constraints

Derivation of Update Rules • Differentiate wrt aoj and set it to 0. • Differentiate wrt sijand set it to 0. Community Matching Error for the (o,j)th entry Total outlierness in the snapshot Outlierness for the (o,j)th entry Total Matching Error Part of qo’j that needs to be explained by po’i sij Average across all objects Normalization to make sure that Give outliers lower weight when matching

Evolutionary Community Outlier Detection Algorithm (OneStage) • Input: P and Q • Output: Estimates of S and A • Initialize , all sij to and all aoj to • While (not converged) • Compute A (Outlier Detection step) • Compute S (Community Matching step) • Estimate • While (not converged) • Compute A (Outlier Detection step) • Compute S (Community Matching step) • Two pass algorithm • Coordinate descent iterative computation of S and A Evolutionary Community Outlier Detection Community Matching O(NK12K2t) N=#objects K1=#clusters in X1 K2=#clusters in X2 t=#iterations

Estimation of • In first pass, we assume unit total outlierness in snapshot (). • After first pass, estimate • In the second pass, compared to the first pass • increases • aoj values increase • Effect 1 • Matching error requirement is relaxed for outlier entries • Less overfitting of S to outlier entries • Increase in accuracy of outlier detection • Effect 2 • Increase in overall matching error • aoj remain <1 • Problem remains convex.

Experiments • Evaluation of outlier detection methods is generally quite difficult due to lack of ground truth. • Multiple Synthetic Datasets that simulate real scenarios. • Precision, Area Under Curve • Three real datasets: DBLP, IMDB, Four Area • Case Studies and Analysis

Baselines • Nearest Neighbor • For each object o • Set NN=Objects in X1 with best matching distribution as o • K=10,100 • TwoStage • Compute S such that Q. • A=Q-PS • OneStage • 1 pass version of our algorithm (OneStage).

Synthetic Dataset Generation • We generate five datasets to model contraction/expansion, noEvolution, Merge, Split, Mix of evolutions. • For each dataset • Model clusters as 2D Gaussians • Select mean and stdDev for X1 • Select mean and stdDev for X2 • Each cluster has N/K1 objects. • For each object • Allocate it to a cluster and get poj and qoj • S is computed using P and Q.

Injecting Outliers in Synthetic Dataset • Let be percentage of outliers. • Randomly select set R of N objects. • For each object o in R • Introduce two outliers corresponding to min and max qoj values by swapping qo,jmin and qo,jmax • aoj values are computed as

Synthetic Datasets Expansion/Contraction No Evolution

Synthetic Datasets Cluster Merge Cluster Split

Synthetic Datasets (Mix) Snapshot 1 Snapshot 2

Synthetic Dataset Results Summary • NN: Comparison with old Nearest neighbors without community matching. • 2S: Outlier detection after community matching. • 1S: Outlier detection with community matching. • 100 experiments per setting. • Average variances are .0012 for NN, .0021 for 2S, .0017 for 1S, and .0005 for 1S 2S (15%) NN (33%) 2S (10%) NN (46%) 2S (8%) NN (36%) 2S (25%) NN (21%) 2S (22%) NN (33%) 2S (10%) NN (30%)

Precision Curves Precision for SynMix dataset 1S performs better than any other algorithm.

Running Time and Convergence Linear Algorithm Convergence

Real Datasets • DBLP, IMDB, Four Area (DB, DM, IR, ML)[Sun et al., 2009b Gao et al., 2009] • We use iTopicModel [Sun et al., 2009a] for clustering • DBLP Author Network (top 1500 conferences, authors with >=10 papers) • X1: 2006-07, 4784 papers • X2: 2008-09, 5633 papers • Data=conferences, links=co-authorship • K1=K2=5, N=2666 • Edge weight=coauthorship frequency • DBLP Conf Network • Data=title words • K1=K2=5, N=920 • Edge weight=#authors publishing at both conferences

Real Datasets • IMDB Actor Network (actors with >=5 movies) • X1: 2006-07, 15337 actors • X2: 2008-09, 13142 actors • Data=genres, links=co-actorship • K1=K2=5, N=6609 • edge weight=costarring frequency • Four Area Authors • X1: 2001-04, 601 authors • X2: 2005-08, 1206 authors • K1=K2=4, N=374 • Four Area Conf • K1=K2=4, N=18

Real Dataset Case Studies • DBLP Conference Network Analysis of Top Outlier Conferences using OneStage Approach Analysis of Top Outlier Conferences using TwoStage Approach

Real Dataset Case Studies • IMDB Co-starring Network • Kelly Carlson (I) • X1: Many Sport, Thriller, and Action movies. • X2: Many Drama, Music, Reality-TV movies. • Collaborators quite different across snapshots. • X1 neighbors did Sports, Comedy, Documentary, Action • X2 neighbors did Drama, Documentary, Thriller, Music • Changed community from Sports, Thriller, Action to Drama and Music.

Related Work (Community Matching) • Mismatch happens due to label switching or genuine multi-modularity [Jakobsson and Rosenberg, 2007] • Hard community matching [Dimitriadou et al., Dudoit and Fridlyand, 2003] • Soft community matching [Long et al., 2005] • Computer vision: position, intensity, shape and average gray-scale difference [Kottke and Sun, 1994], and degree of match between surrounding clusters [Miller et al., 1984] • Words-based clusters could be matched using TF-IDF similarity [Cohen and Richman, 2002]

Related Work (Outlier Detection) • Extensive Surveys: [Chandola et al., 2009], [Hodge and Austin, 2004], [Zhang et al., 2008] • Four types: point outliers, contextual outliers, collective outliers and evolutionary outliers. • Techniques: mixture of models [Eskin, 2000], neural networks, stat profiling using histograms, SVMs [Hu et al., 2003], rule based systems, parametric stat modeling, non-parametric stat modeling, bayesian networks, etc. • Datasets: high-dimensional data [Aggarwal and Yu, 2001], uncertain data [Aggarwal and Yu, 2008], stream data [Aggarwal et al., network data [Akoglu et al., 2010, Gao et al., 2010] and time series data [Fox, 1972]. • Evolving datasets [Ge et al., 2010, Ghoting et al., 2004].

Conclusions • We introduced the notion of evolutionary outlier detection with respect to latent evolving communities, ECOutliers. • We provided a coordinate descent algorithm OneStagethat performs community matching across snapshots and ECOutlier detection simultaneously. • Using three real datasets and multiple synthetic datasets, we showed the effectiveness of our approach.

References [Aggarwal and Yu, 2001] Aggarwal, C. C. and Yu, P. S. (2001). Outlier Detection for High Dimensional Data. SIGMOD Records, 30:37–46. [Aggarwal and Yu, 2008] Aggarwal, C. C. and Yu, P. S. (2008). Outlier Detection with Uncertain Data. In Proc. of the SIAM Intl. Conf. on Data Mining (SDM), pages 483–493. [Aggarwal et al., 2011] Aggarwal, C. C., Zhao, Y., and Yu, P. S. (2011). Outlier Detection in Graph Streams. In Proc. of the 27th Intl. Conf. on Data Engineering (ICDE), pages 399–409. IEEE Computer Society. [Akoglu et al., 2010] Akoglu, L., McGlohon, M., and Faloutsos, C. (2010). Oddball: Spotting anomalies in weighted graphs. In Advances in Knowledge Discovery and Data Mining, 14th Pacific-Asia Conf. (PAKDD), pages 410–421. Springer. [Bertsekas, 1999] Bertsekas, D. P. (1999). Non-Linear Programming (2nd Edition). Athena Scientific. [Breunig et al., 2000] Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. (2000). LOF: Identifying Density-Based Local Outliers. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data (SIGMOD), pages 93–104. ACM. [Chakrabarti et al., 2006] Chakrabarti, D., Kumar, R., and Tomkins, A. (2006). Evolutionary Clustering. In Proc. of the 12th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 554–560. ACM. [Chandola et al., 2009] Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly Detection: A Survey. ACM Computing Surveys, 41(3). [Chi et al., 2007] Chi, Y., Song, X., Zhou, D., Hino, K., and Tseng, B. L. (2007). Evolutionary Spectral Clustering by Incorporating Temporal Smoothness. In Proc. of the 13th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 153–162. ACM. [Cohen and Richman, 2002] Cohen, W. W. and Richman, J. (2002). Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration. In Proc. Of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 475–480. ACM. [Dimitriadou et al., 2001] Dimitriadou, E., Weingessel, A., and Hornik, K. (2001). Voting-Merging: An Ensemble Method for Clustering. In Proc. of the Intl. Conf. on Artificial Neural Networks (ICANN), pages 217–224. Springer. [Dudoit and Fridlyand, 2003] Dudoit, S. and Fridlyand, J. (2003). Bagging to Improve the Accuracy of a Clustering Procedure. Bioinformatics, 19(9):1090–1099. [Eskin, 2000] Eskin, E. (2000). Anomaly Detection over Noisy Data using Learned Probability Distributions. In Proc. of the 17th Intl. Conf. on Machine Learning (ICML), pages 255–262. Morgan Kaufmann Publishers Inc. [Fox, 1972] Fox, A. J. (1972). Outliers in Time Series. Journal of the Royal Statistical Society. Series B (Methodological), 34(3):350–363.

References [Freund and Schapire, 1995] Freund, Y. and Schapire, R. E. (1995). A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting. In Proc. of the 2nd European Conf. on Computational Learning Theory (EuroCOLT), pages 23–37. Springer-Verlag. [Gao et al., 2009] Gao, J., Liang, F., Fan, W., Sun, Y., and Han, J. (2009). Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models. In Proc. of the 23rd Annual Conf. on Neural Information Processing Systems (NIPS), pages 585–593. Curran Associates, Inc. [Gao et al., 2010] Gao, J., Liang, F., Fan, W., Wang, C., Sun, Y., and Han, J. (2010). On Community Outliers and their Efficient Detection in Information Networks. In Proc. of the 16th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 813–822. [Ge et al., 2010] Ge, Y., Xiong, H., hua Zhou, Z., Ozdemir, H., Yu, J., and Lee, K. C. (2010). Top-Eye: Top-K Evolving Trajectory Outlier Detection. In Proc. of the 19th ACM Conf. on Information and Knowledge Management (CIKM), pages 1733–1736. [Ghoting et al., 2004] Ghoting, A., Otey, M. E., and Parthasarathy, S. (2004). LOADED: Link-Based Outlier and Anomaly Detection in Evolving Data Sets. In Proc. of the 4th IEEE Intl. Conf. on Data Mining (ICDM), pages 387–390. [Google Answers Community, 2006] Google Answers Community (2006). Career changes. http://answers.google.com/answers/threadview/id/729232.html. [Gupta et al., 2011] Gupta, M., Aggarwal, C. C., Han, J., and Sun, Y. (2011). Evolutionary Clustering and Analysis of Bibliographic Networks. In Advances in Social Networks Analysis and Mining (ASONAM), pages 63–70. IEEE Computer Society. [Hodge and Austin, 2004] Hodge, V. J. and Austin, J. (2004). A Survey of Outlier Detection Methodologies. Artificial Intelligence Review, 22(2):85–126. [Hu et al., 2003] Hu, W., Liao, Y., and Vemuri, V. R. (2003). Robust Anomaly Detection Using Support Vector Machines. In Proc. of the Intl. Conf. on Machine Learning (ICML), pages 282–289. Morgan Kaufmann Publishers Inc. [Jakobsson and Rosenberg, 2007] Jakobsson, M. and Rosenberg, N. A. (2007). CLUMPP: A Cluster Matching and Permutation Program for Dealing with Label Switching and Multimodality in Analysis of Population Structure. Bioinformatics, 23:1801–1806. [Kim and Han, 2009] Kim, M.-S. and Han, J. (2009). A Particle-and-Density Based Evolutionary Clustering Method for Dynamic Networks. In Proc. of the 35th Intl. Conf. on Very Large Data Bases (VLDB).

References [Knorr et al., 2000] Knorr, E. M., Ng, R. T., and Tucakov, V. (2000). Distance-Based Outliers: Algorithms and Applications. The VLDB Journal, 8:237–253. [Kottke and Sun, 1994] Kottke, D. and Sun, Y. (1994). Motion Estimation Via Cluster Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16:1128–1132. [Kriegel et al.,2009] Kriegel, H.-P., Kroger, P., Schubert, E. and Zimek A. (2009). LoOP: Local Outlier Probabilities. In Proc. of the 18th ACM Conf. on Information and Knowledge Management (CIKM), pages 1649–1652. ACM. [Long et al., 2005] Long, B., Zhang, Z. M., and Yu, P. S. (2005). Combining Multiple Clusterings by Soft Correspondence. In Proc. of the 5th IEEE Intl. Conf. on Data Mining (ICDM), pages 282–289. IEEE Computer Society. [Miller et al., 1984] Miller, M. J., Olson, A. D., and Thorgeirsson, S. S. (1984). Computer Analysis of Two-Dimensional Gels: Automatic Matching. ElectroPhoresis, 5(5):297–303. [Perlroth and Schafer, 2010] Perlroth, N. and Schafer, I. (2010). Celebrity career-changers. http://www.forbes.com/2010/03/25/sarah-palin-television-leadership-second-actscelebrities.html. [Peter et al., 1988] Peter, J., Mark, B., Otto, C., Monsour, N. J., Burman, J. P., and Otto, M. C. (1988). Census Bureau Research Project: Outliers in Time Series. [Pokrajac et al., 2007] Pokrajac, D., Lazarevic, A., and Latecki, L. J. (2007). Incremental Local Outlier Detection for Data Streams. In IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pages 504–515. IEEE. [Ramaswamy et al., 2000] Ramaswamy, S., Rastogi, R., and Shim, K. (2000). Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD Records, 29:427–438. [Sun et al., 2009a] Sun, Y., Han, J., Gao, J., and Yu, Y. (2009a). iTopicModel: Information Network-Integrated Topic Modeling. In Proc. of the 9th IEEE Intl. Conf. on Data Mining (ICDM), pages 493–502. IEEE Computer Society. [Sun et al., 2009b] Sun, Y., Yu, Y., and Han, J. (2009b). Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema. In Proc. of the 15th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD), 797–806. ACM. [Zhang et al., 2008] Zhang, Y., Meratnia, N., and Havinga, P. J. M. (2008). Outlier Detection Techniques For Wireless Sensor Networks: A Survey. Technical Report TR-CTIT-08-59, Centre for Telematics and Information Technology University of Twente.

Thanks!

Manish Gupta 1 , Jing Gao 2 , Yizhou Sun 1 , Jiawei Han 1 1 UIUC 2 SUNY , Buffalo

Manish Gupta 1 , Jing Gao 2 , Yizhou Sun 1 , Jiawei Han 1 1 UIUC 2 SUNY , Buffalo

Presentation Transcript

1 1. 2. 3. 2 1. 2. ,,

,,: 1-1 :, 1-2 :, 2-1 :,, 2-2 :, 2-3 :,

: 1 : 2 : 1 : 2 : : 1 : 2 :

2 1. 1

2-1-1

1. 1 . 2 .

1.Peter.1:1-2

READING 1; GENESIS 1-1 – 2:2

1+1 = 2

Shunsuke Nakao (1, 2) , Manish Shrivastava (2, 3) ,

Jing Gao 1 , Wei Fan 2 , Deepak Turaga 2 , Olivier Verscheure 2 , Xiaoqiao Meng 2 ,

2-1-1

1 2 1 2

Direct product: e.g. B 1 B 2 = (1 -1 1 -1) (1 -1 -1 1) = (1 1 -1 -1) = A 2

Donald Stewart 1 , Jing Sun 1 Stephen Clift 2 , Grenville Hancox 2 ,

Qu, Qian (1);  Sun, Jing (2); Kang, Jun (3); Zhang, Weibin (1);