On Triangulation-based Dense Neighbourhood Graph Discovery

On Triangulation-based Dense Neighbourhood Graph Discovery School of Computing National University of Singapore

Outline • Motivation • Related Work • Terms Definition • Triangulation based DN-graph mining • Semi-streaming DN-graph model • Experimental Study • Future Work and Conclusion

Motivation • Define dense graph pattern from the perspective that considers both the size of the substructure and the minimum level of interactions between vertices. • Locate dense patterns within unsolvable restricted resources for large scale graphs.

Related Work • Other Dense Patterns • Clique/Quasi-Clique • High Degree Patterns • Dense Bipartite Patterns • Heavy Patterns • Triangle Counting • CSV • Density-based closed cliques discovery and a linear fashion visualization.

Terms Definition

Terms Definition (cont’d)

DN-graph b a G Proof

DN-graph and Other Dense Patterns Quasi-clique Close-clique (a maximal clique) DN-Graph

DN-graph and Closed Clique Proof

Computation Bottleneck in DN-graph Mining Most sub-graphs are not DN-graphs Most of these operations are redundant

How to tackle the bottleneck ? • Reduce number of joins • Local maximal feature: two DN-graphs share no edge. • All edges sharing common vertices and local maximal λ values comprising of the DN-graph • Locating DN-graph using λ(e) value • All edges within DN-graph have equal λ(e) , noted as λmax • All edges connecting to neighboring vertices have a smaller λ values: λ(e) = λ(u,v) < λmax while u not in G’, v in G’ • Use approximating methods to compute λ(e) efficiently

Graph Triangulation • Given a graph triangle, the upper bound of the other two edges can be used to tighten the density estimation of the third edge. λ(w,v) = 3 w v λ(u,w) = 3 λ(u,v)=5 u

Triangulation Based DN-graph Mining • DN-graph Mining Algorithm • Step One: Sort vertices according to their degrees. • Step Two: Generate triangles in a streaming fashion. • Step Three: Obtain the local density information gradually along the triangle streams. • Initial Upper Bound: TC(e) the number of triangles an edge participates in.

Counting of Supporting Nodes Not Supporting Node n2 n2 n2 n2 n2 n3 n1 n1 n1 n1 n1 5 6 8 4 n4 7 5 5 3 a a a a a b b b b b = 4

Convergence Converge First Iteration Second Iteration Initialization Two Support Vertices One Support Vertex 2 V5 The local maximal neighborhood size 𝜆=2 2 𝜆(V2V3) decreases by one 𝜆(V3V6) decreases by one 𝜆(V2V6) decreases by one V6 3 2 2 2 3 2 V3 2 V1V2 1 V2V6 3 2 V4 3 2 1 V3V6 V1V3 1 4 3 2 2 V2V3 4 V3V5 3 2 2 2 2 V2V4 V2V5 V2 V3V4 V5V6 V1 1 2 2 V4V6 2

Semi-Streaming Graph Model • Graph vertices fit into main memory, while edges are in the secondary storage, in the form of adjacency list. • Random access in primary storage (i.e. memory) and only sequential access in secondary storage. • As a feasible solution towards a streaming graph G(V,E), it should not exceed log |V| scans of G’s adjacency list.

DN-graph mining in semi-streaming model • Estimating shared neighbor size using min-wise independent set property. • Min-wise independent set property: Two sets A, B over a universe X, and a uniformly chosen permutation π over X. If there is a total order in X, then the probability that min(π(A)) = min(π(B)) is the same as the Jaccard Coefficient J(A, B)= (n(A)∩n(B))/ (n(A)Un(B)). • We can use that to estimate shared neighbor size (n(A)∩n(B)).

Experimental Setting • Quad-Core AMD Opteron(tm) processor 8356 • 128GB memory • 700 GB hard disk • OS: Windows Server 2003

Experimental Study • Comparison with CSV on Stock Market Dataset

Convergence • Dataset: Flickr graph (1.7million vertices and 22.6 million edges) • Running time per iteration is between 55 minutes to 1 hour.

StreamDN Performance on Flickr Dataset • StreamDN over-estimates with respect to BiTriDN algorithm’s results by 72% during the first 66 scans. • StreamDN can handle streaming setting with reasonable accuracy.

DN-graph Semantics in Various Domain

Future work and Conclusion • DN-graph • DN-graph Mining Problem • Semi-streaming Approach • Future Work

Thank You & Questions

Reference • [WSTT08] N. Wang, P. Srinivasan, K.-L. Tan, and A.K.H. Tung. CSV: visualizing and mining cohesive subgraphs. In SIGMOD’08, pages 445–458, 2008. • [WZTT11] N. Wang, J. Zhang, K.-L. Tan, and A.K.H. Tung. On triangulation-based dense neighbourhood graph discovery. In VLDB’11, volume 4, 2011. • [ABC+04 P. Aloy, BaPttcher, H. Ceulemans, C. Leutwein, C. Mellwig, S. Fischer, and A.C. Gavin. Structure-based assembly of protein complexes in yeast. volume 303, pages 2026–2029, 2004. • [ATH03] I. Akihiro, W. Takashi, and M. Hiroshi. Complete mining of frequent patterns from graphs: Mining graph data. volume 50, pages 321–354, Hingham, MA, USA, 2003. Kluwer Academic Publishers. • [BBP06] V. Boginski, S. Butenko, and Pardalos. P.M. Mining market data: a network approach. Computers and Operations Research, 33(11):3171–3184, 2006. • [GRT05] D. Gibson, K. Ravi, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB’05, pages 721–732, Trondheim, Norway, 2005. • [Bla94] R.E. Blake. Partitioning graph matching with constraints. volume 27, pages 439–446, 1994.

Reference (cont.) • [DT99] L. Dehaspe and H. Toivonen. Discovery of frequent datalog patterns. Data Mining and Knowledge Discovery, 3(7-36), 1999. • [HCD94] L. Holder, D. Cook, and S. Djoko. Substructure discovery in the SUBDUE system. In Proceedings of the Workshop on Knowledge Discovery in Databases, pages 169–180, 1994. • [MARW90] E.M. Mitchell, P.J. Artymiuk, D.W. Rice, and P. Willett. Use of techniques derived from graph theory to compare secondary structure motifs in proteins. Journal of Molecular Biology, 212:151–166,1990. • [MK01] K. Michihiro and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313–320, 2001. • [RRRT99] K. Ravi, Prabhakar R., Sridhar R., and A Tomkins. Trawling the web for emerging cyber-communities. In Computer Networks, pages 1481–1493, 1999. • [SK98] A. Srivastav and W. Katja. Finding dense subgraphs with semidefinite programming. In APPROX ’98, pages 181–191, London, UK, 1998. Springer-Verlag. • [ZWZK] Z. Zeng, J. Wang, L. Zhou, and G. Karypis. Coherent closed quasi-clique discovery from large dense graph databases. In KDD’06, Philadelphia, USA.

Proof: A DN-graph is a local maximum graph

Proof: DN-graph and Closed Clique

2 2 3 2 2 3 2 1 4 2 1

On Triangulation-based Dense Neighbourhood Graph Discovery

On Triangulation-based Dense Neighbourhood Graph Discovery

Presentation Transcript

Convergent Dense Graph Sequences

i -Neighbourhood Abstraction in Graph Transformation

Graph Triangulation

Graph-based Segmentation

Triangulation

Graph Clustering based on Random Walk

Graph-Based Perspective

Chase Methods based on Knowledge Discovery

Discovery-Driven Graph Summarization

Graph-Based Segmentation

On Anomalous Hot Spot Discovery in Graph Streams

Spectral Sequencing Based on Graph Distance

Structural Web Search Using a Graph-Based Discovery System

Graph-based Planning

Graph-based Planning

Neighbourhood Sampling for Local Properties on a Graph Stream

Graph-based Segmentation

Graph Triangulation

Triangulation

Graph-based Planning