590 likes | 700 Vues
ddBall: Spotting A n o m a l i e s in Weighted Graphs. Leman Akoglu , Mary McGlohon , Christos Faloutsos Carnegie Mellon University School of Computer Science Pittsburgh, Pennsylvania, USA. Motivation. Anomaly detection in networks (graph data) has important applications:
E N D
ddBall: Spotting Anomalies in Weighted Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School of Computer Science Pittsburgh, Pennsylvania, USA
Motivation • Anomaly detection in networks (graph data) has important applications: • Computer networks spammers, port scanners • Phone-call networks telemarketers, misbehaving costumers, faulty equipment • Social networks ‘popularity contests’ • Account networks scammers, transfer fraud • Terrorist networks tight groups of people Akoglu, McGlohon, Faloutsos
Problem Q1. Given a weighted and unlabeled graph, how can we spot strange, abnormal, extreme nodes? Q2. Can we explain why the spotted nodes are anomalous? Akoglu, McGlohon, Faloutsos
Preliminaries I – What is an anomaly? • No clear and unique definition! “An observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.” [Hawkins, 80] Akoglu, McGlohon, Faloutsos
Preliminaries II – Weights $15K $5K $10K 3 1 Bipartite Unipartite Akoglu, McGlohon, Faloutsos 5
Preliminaries III – Power Laws Pr[X≥x] ~ cx-α ln(Pr[X≥x]) ~ -α(c lnx) c ≥ 0, α ≥ 0 slope = -α log-log plot lin-lin plot Akoglu, McGlohon, Faloutsos 6
‘Power Law’ Example Total weight #Source nodes Densification Power Law [Leskovec ‘05] Weight Power Law [McGlohon ‘08] #Destination nodes # Edges DBLP Keyword-to-Conference Network Akoglu, McGlohon, Faloutsos 7
‘Power Law’ Example e.g. John Kerry, $10M received, from 1K donors In-weights($) In-degree (# donors) 2004 US FEC Committees to Candidates network Snapshot Power Law [McGlohon et al.‘08] Akoglu, McGlohon, Faloutsos
Preliminaries IV – how to fit Least Squares fit to medians! Akoglu, McGlohon, Faloutsos
Problem revisited Q1. Given a weighted and unlabeled graph, how can we spot strange, abnormal, extreme nodes? Q2. Can we explain why the spotted nodes are anomalous? Akoglu, McGlohon, Faloutsos
Problem sketch Akoglu, McGlohon, Faloutsos
Main idea For each node, P.1) extract ‘ego-net’ (=1-step-away neighbors) P.2) extract features (#edges, total weight, etc.) P.3) extract patterns (norms) P.4) anomaly detection: compare with the rest of the population C. Faloutsos (CMU)
Outline • Motivation • Preliminaries and Problem Definition 3. Proposed Method • Study of ego-nets • Laws and Observations • Anomaly detection • Datasets • Experiments • Discussion & Conclusion Akoglu, McGlohon, Faloutsos
P.1 What is an egonet? ego-net ego Akoglu, McGlohon, Faloutsos
What is odd? Akoglu, McGlohon, Faloutsos
What is “anomalous”? telemarketer, port scanner, people adding friends indiscriminatively, etc. Near-star tightly connected people, terrorist groups?, discussion group, etc. Near-clique Leman Akoglu
What is “anomalous”? too much money wrt number of accounts, high donation wrt number of donors, etc. Heavy vicinity single-minded, tight company Dominant heavy link Leman Akoglu 17
P.2What features… … should we extract so that to project nodes into a low-dimensional space? features that could yield “laws” features easy to compute and interpret Leman Akoglu 18
Selected Features • Ni: number of neighbors (degree) of ego i • Ei: number of edges in egonet i • Wi: total weight of egonet i • λw,i: principal eigenvalue of the weighted adjacency matrix of egonet i Akoglu, McGlohon, Faloutsos
details λw,i = √N = √E = √W λw,i > √N ~ √E, √W λw,i √W λw,i = N ≈ √W λw,i = W λw,i ≈ W N: #neighbors, W: total weight Akoglu, McGlohon, Faloutsos 20
Other Features • Si: number of singleton neighbors of ego iwith degree 1 • max(Wi): maximum edge weight in egoneti • max(Wi, d=1): maximum edge weight to/from a degree 1 neighbor of ego i • max(di): maximum degree of the neighbors of ego i • 2-step neighborhood features Akoglu, McGlohon, Faloutsos
Outline • Motivation • Preliminaries 3. Proposed Method • Study of egonets • Laws and Observations • Anomaly detection 4. Datasets 5. Experiments 6. Discussion & Conclusion Akoglu, McGlohon, Faloutsos 22
Observation 1: Egonet Density Power Law (EDPL) P.3What patterns? Q1: How does the number of neighbors N of the egonet relate to the number of edges E? Akoglu, McGlohon, Faloutsos 23
Observation 1: Egonet Density Power Law (EDPL) Ei ∝ Niα 1 ≤ α ≤ 2 Leman Akoglu 24
Observation 2: Egonet Weight Power Law (EWPL) P.3What patterns? Q2: How does the total weight W of the egonet relate to the number of edges E? Akoglu, McGlohon, Faloutsos 25
Observation 2: Egonet Weight Power Law (EWPL) Wi ∝ Eiβ β ≥ 1 26
Observation 3: Egonet λw Power Law (ELWPL) P.3What patterns? Q3: How does the largest eigenvalue λw of the weighted adjacency matrix of the egonet relate to the total weight W? Akoglu, McGlohon, Faloutsos 27
Observation 3: Egonet λw Power Law (ELWPL) λw,i∝ Wiγ 0.5 ≤ γ ≤ 1 28
Outline • Motivation • Preliminaries 3. Proposed Method • Study of egonets • Laws and Observations • Anomaly detection 4. Datasets 5. Experiments 6. Discussion & Conclusion Akoglu, McGlohon, Faloutsos 29
P.4 Anomaly detection violates our “laws” too far away from the rest of the points Anomaly ≈ Akoglu, McGlohon, Faloutsos 30
scoredist= distance to fitting line scoreoutl= outlierness score score = func ( scoredist , scoreoutl) • can tell what kind of anomaly a node belongs to • can sort nodes wrt their outlierness scores Akoglu, McGlohon, Faloutsos 31
Outline • Motivation • Preliminaries • Proposed Method • Study of egonets • Laws and Observations • Anomaly detection • Datasets • Experiments • Discussion & Conclusion Akoglu, McGlohon, Faloutsos 32
Datasets Bipartite networks: |N| |E| 1. Don2Com 1.6M 2M 2. Com2Cand 6K 125K 3. Auth2Conf 421K 1M Unipartite networks: |N| |E| 5. BlogNet 27K 126K 6. PostNet 223K 217K 7. Enron 36K 183K 8. Oregon 11K 38K Akoglu, McGlohon, Faloutsos 33
Outline • Motivation • Preliminaries • Proposed Method • Study of egonets • Laws and Observations • Anomaly detection • Datasets • Experiments • Discussion & Conclusion Akoglu, McGlohon, Faloutsos 34
Experimental Results Akoglu, McGlohon, Faloutsos 35
Near-Clique/Star Leman Akoglu 36
Near-Clique/Star Akoglu, McGlohon, Faloutsos 37
Experimental Results Akoglu, McGlohon, Faloutsos 38
Heavy Vicinity Akoglu, McGlohon, Faloutsos 39
Heavy Vicinity Akoglu, McGlohon, Faloutsos 40
Experimental Results Akoglu, McGlohon, Faloutsos 41
Dominant Heavy Link $87M - DNC $25M - RNC Akoglu, McGlohon, Faloutsos 42
Dominant Heavy Link Leman Akoglu 43
Experimental Results Akoglu, McGlohon, Faloutsos 44
Outline • Motivation • Preliminaries • Proposed Method • Study of egonets • Laws and Observations • Anomaly detection • Datasets • Experiments • Discussion & Conclusion Akoglu, McGlohon, Faloutsos 45
Scalability • Counting number of edges in egonets for ALL nodes is expensive! need to scan connections for all pairs of neighbors! • Can be reworded as counting local triangles • A fast method [Tsourakakis,08] exists! IDEA: • #triangles = (# paths of length 3) / 2 • # paths of length 3 for node i = (A3)ii • Computing A3 is still expensive! • Low-rank approximation! Akoglu, McGlohon, Faloutsos 46
details UT S S3 ~ U A A3 kxn kxk A3 =O(n3) ~ O(nk2) nxn nxk • Prune d=1 nodes • Prune d=2 as well as d=1 nodes smaller & sparser A matrix Akoglu, McGlohon, Faloutsos
Scalability – time vs. size • Time vs. number of edges. • Effect of pruning on computation time. Solid (–): no pruning, Dashed (−−): pruning nodes w/ d ≤1, Dotted (…): pruning nodes w/ d ≤ 2 • Computation time increases linearly with increasing number of edges, while decreasing with pruning. Akoglu, McGlohon, Faloutsos 48
Scalability – accuracy vs time • Time vs. accuracy. • Effect of pruning on accuracy of finding top anomalies as in the original ranking before pruning. • New rankings are scored using Normalized Cumulative Discounted Gain. • Pruning reduces time for both Node-Iterator and Eigen-Triangle while keeping accuracy at as high as ~1 and ~.9, respectively. Akoglu, McGlohon, Faloutsos 49
Conclusion • OddBall, a fast, unsupervised method to detect abnormal nodes in weighted graphs. • Study of egonets; list of numerical features • Discovery of new patternsin density (Obs.1: EDPL), weights (Obs.2: EWPL), and principal eigenvalues (Obs.3: ELWPL). • Speed-up in feature extraction, with accuracy ~.9 • Experiments on real graphs of over 1M nodes, that reveal strange/extreme nodes from many different domains Software available online! http://www.cs.cmu.edu/~lakoglu/#tools Akoglu, McGlohon, Faloutsos 50