1 / 26

TopK Interesting Subgraph Discovery in Information Networks

TopK Interesting Subgraph Discovery in Information Networks. Manish Gupta Jing Gao Xifeng Yan Hasan Cam Jiawei Han. Real World Problems. Network Bottlenecks Discovery. Computer Networks. Organization Networks. Team Selection.

Télécharger la présentation

TopK Interesting Subgraph Discovery in Information Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TopK Interesting Subgraph Discovery in Information Networks Manish Gupta Jing GaoXifeng Yan Hasan Cam Jiawei Han gmanish@microsoft.com

  2. Real World Problems Network Bottlenecks Discovery Computer Networks Organization Networks Team Selection Interestingness = Highest Historical Compatibility Interestingness = Lowest Bandwidth Suspicious Relationships Discovery Battlefield Networks Resource Allocation Social Networks Interestingness = Highest Negative Association Strength of Attribute Values Interestingness = Lowest Distance between Entities gmanish@microsoft.com

  3. The Basic Underlying Problem Team Selection Network Bottlenecks Discovery Interestingness = Lowest Bandwidth Interestingness = Highest Historical Compatibility Suspicious Relationships Discovery Resource Allocation Interestingness = Highest Negative Association Strength Interestingness = Lowest Distance • Given • Edge-weighted Typed Network G • Typed Subgraph Query Q • Edge Interestingness measure • Find • TopK matching subgraphs gmanish@microsoft.com

  4. Naïve Solution: Ranking After Matching 4 3 2 1 A A A B 0.8 0.7 0.2 12 13 0.2 Network G Query Q C C 0.4 0.5 0.4 0.3 6 5 4 3 2 1 2 3 6 5 4 4 3 3 2 B A A A A A A B Ranking 0.6 0.8 0.8 0.7 0.2 B A A A A A A Why compute all matches? We need only top-2! 0.6 0.8 0.8 0.8 0.7 0.9 0.1 0.7 0.1 10 9 8 7 0.7 11 1 4 10 9 8 7 B C A A A B A 0.3 0.6 0.5 0.2 A A A B B 0.3 0.6 0.5 Matching 4 3 2 6 5 A A A B A 0.8 0.7 0.6 0.1 0.9 7 10 9 5 B 6 5 A A 0.3 4 5 A A A B A 0.8 0.6 0.9 0.9 0.1 0.9 9 8 7 9 7 9 8 A A B A B 7 A A 0.6 0.5 0.6 gmanish@microsoft.com

  5. Our Contributions • New notion: TopK interesting subgraph detection in information networks • Three new low-cost indexes • Graph topology index • Sorted edge lists • Graph maximum metapath weight index • Novel top-K algorithm to answer interestingness queries on large graphs • Detailed effectiveness and efficiency validation on several synthetic and real datasets gmanish@microsoft.com

  6. Relationship with Previous Work • Subgraph matching • Approximate: fuzzy node/edge similarity • Exact: Matching without ranking • RDF graphs, probabilistic graphs, temporal graphs • TopK querying on graphs • H-hop aggregate queries • Keyword queries on RDF graphs • K most frequent patterns • Twig queries gmanish@microsoft.com

  7. System Overview 2 Network G Breadth First Traversal from each Node up to Distance D Graph Topology Index Offline Index Construction Distance D Sort Edges 3 Graph Maximum MetaPathWeight Index 1 Sorted Edge Lists Find Candidate Nodes Query Q Candidate Nodes Top-K Computation Online Query Processing Top-K Subgraphs gmanish@microsoft.com

  8. G=(V,E), B=avg #neighbors, T=#types Index Structures 12 13 0.2 Network G C C 0.4 0.5 0.4 0.3 6 5 4 3 2 1 B A A A A B 0.6 0.8 0.8 0.7 0.2 0.9 0.1 0.7 0.1 10 9 8 7 11 C A A A B 0.3 0.6 0.5 0.2 gmanish@microsoft.com

  9. Find Candidate Nodes Graph Topology Index Query Q Query Q Graph Topology Index 2 3 A A 1 4 B A Query Topology gmanish@microsoft.com

  10. Finding and Scoring MatchesKey Idea Query Q Top-K Computation 2 3 Start Y Generate a Size-1 Candidate A A More valid edges? N 1 4 Y B A TopK Quit? Compute Actual and UB Score N Y N Candidate Size==|Q|? B A A A Grow Candidates N Y Y Top-K Heap TopK Quit? Compute Actual and UB Score Update Heap Compute Max UB Score N Y TopK Quit? Done! gmanish@microsoft.com

  11. Finding and Scoring MatchesGenerating Size-1 Candidates Size-1 Candidates Query Q 9 9 2 9 5 5 9 9 9 9 3 5 5 5 5 5 5 9 A A A A A A A A A A A A A A A A A A A A 5 1 9 4 B B B B B B B B B B A A A A A A A A A A Query Edge with both endpoints of same type Multiple query edges of the same type Candidate Growth B A A A Order (5,9) (3,4) (4,5) (2,3) (2,7) … Heapify? Discard? Prune? Grow? 8 6 6 10 Prune? Grow? 8 10 Heapify? Discard? Prune? Grow? gmanish@microsoft.com

  12. Finding and Scoring MatchesActual Score and Upper Bound Score Candidate Growth 9 9 9 9 5 5 5 5 Prune? Grow? Prune? Grow? Heapify? Discard? 6 8 8 A A A A A A A A B B B B A A A A Actual Score= 0.9 B A A A UB Score = 0.9+ UB(NonConsidered Edges) = 0.9+ (0.6+0.6) = 2.1 • Partially grown candidate • Prune if UBScore< min(heap) • Grow otherwise • Fully grown candidate • Discard if UBScore< min(heap) • Update heap otherwise Useful Edge Lists gmanish@microsoft.com

  13. Finding and Scoring MatchesGlobal Top-K Quit 12 13 0.2 Network G C C Query Q 0.4 0.5 0.4 0.3 6 5 4 3 2 1 2 3 B A A A A A A B 0.6 0.8 0.8 0.7 0.2 0.9 0.1 0.7 1 4 0.1 10 9 8 7 11 B A C A A A B 0.3 0.6 0.5 0.2 B A A A K=2 TopK Heap (4,3,2,7): 2.2 (3,4,5,6): 2.2 Stop 0.7+0.6+0.7 = 2 <2.2 gmanish@microsoft.com

  14. Faster Query Processing using Graph Maximum MetaPath Weight Index Slight complication 1 1 1 4 3 5 C 4 3 5 C C A B C A B C 2 2 2 C C C Query 6 7 1 B C Query Partial Instantiation UB Score = Actual Score(1-2) + UB(1-3) + UB(2-3) + UB(3-4) + UB(4-5) C 1 4 3 5 C 2 C 4 A B C B Partial Candidate 7 3 6 7 C A UB Score = Actual Score(1-2) + UB(1-3-4-5) + UB(2-3) 2 B C 1 C 4 3 5 C Paths to cover Non-Considered Edges Edges to Consider Separately A B C 3 Paths to cover Non-Considered Edges A UB Score = Actual Score(1-2) + UB(1-3-4-5-7) + UB(2-3) + UB(4-6) +UB(6-7) 2 Using MMW Index! C gmanish@microsoft.com

  15. Faster Query Processing using Graph Maximum MetaPathWeight Index 5 A A Prune? Grow? 9 B A Edge-based UBScore 0.9+0.8+0.7 =2.4 > 2.0 B A A A Grow K=2 TopK Heap (8,9,5,6): 2.1 (5,9,8,7): 2.0 Path-based UBScore 0.9+UB(5-A-B) =0.9+0.9 =1.8 < 2.0 Prune MMW Index gmanish@microsoft.com

  16. Discussions • Queries with multiple edge semantics • Directed graphs • Homogeneous networks • Weighted query edges • Weights signify expected amount of interestingness • Weights signify importance of query edge • Faster computations versus index size gmanish@microsoft.com

  17. Low-cost Index Structures gmanish@microsoft.com

  18. Faster Query Execution Query Execution Time (msec) for Clique Queries (Graph G2 and indexes with D=2) Query Execution Time (msec) for Path Queries (Graph G2 and indexes with D=2) RAM: Ranking After Matching baseline RWM0: without using the candidate node filtering RWM1: without using the MMW index RWM2: same as RWM1 without the pruning any partially grown candidates RWM3: same as RWM1 without the global top-K quit check RWM4: same as RWM1 with the MMW index Query Execution Time (msec) for Subgraph Queries (Graph G2 and indexes with D=2) gmanish@microsoft.com

  19. Good Scalability Good Scalability thanks to Effective Pruning Running time (msec) for different Query Sizes and Graph Sizes (D=2) Number of Candidates as Percentage of Total Matches for Different Query Sizes and Candidate Sizes Query Execution Time for Different Values of K gmanish@microsoft.com

  20. Real Dataset Case Studies 2 2 4 1 1 Author Conf Author Conf Keyword 3 3 Author Author Q1 Q2 2 2 4 1 1 Person Film Person Company Settlement 3 3 Person Person Q3 Q4 gmanish@microsoft.com

  21. Real Dataset Case Studies • DBLP • 1: Rohit Gupta, 2: BICoB, 3: Vipin Kumar • Rohit Gupta -- computer networking • Vipin Kumar -- Data and Information Systems • BICoB -- International Conference on Bioinformatics and Computational Biology • 1: Jimeng Sun, 2: Operating Systems Review (SIGOPS), 3: Christos Faloutsos, 4: mining • Jimeng Sun and Christos Faloutsos -- Data and Information Systems, Artificial intelligence, and Computational biology • "mining" -- Data and Information Systems • "Operating Systems Review (SIGOPS)" -- Operating systems, Computer architecture, Computer networking gmanish@microsoft.com

  22. Real Dataset Case Studies • Wikipedia • 1: Stacy Keach, 2: The Biggest Battle, 3: John Huston • Stacy Keach and John Huston starred in the movie “The Biggest Battle” • Stacy Keach (American), John Huston (American), movie is Italian • Stacy (narration, comedy, music), John (drama, documentary, adventure), movie (war) • 1: Medha Patkar, 2: BBC, 3: Felix D’Alviella, 4: Mogilino • Medha Patkar -- Indian social activist -- won Best International Political Campaigner by BBC • Felix D’Alviella -- Belgian actor in the BBC soap opera Doctors • Mogilino -- village in Bulgaria -- BBC showed the popular film "Bulgaria’s Abandoned Children" in 2007 • British company rewarding an Indian woman, covering a place in Bulgaria or linked to a person from Belgium is rare gmanish@microsoft.com

  23. Related Work (1) Theory literature on subgraph isomorphism [Cordella et al., 2004; McKay, 1981; Ullmann, 1976] Exact subgraph matching [Cheng et al., 2008; He and Singh, 2008; Sun et al., 2012; Zhang et al., 2007; Zhang et al., 2009; Zhao and Han, 2010; Zou et al., 2009] Approximate subgraph matching [Zou et al., 2007; Zeng et al., 2012; Tian et al., 2007; Zhang et al., 2010] gmanish@microsoft.com

  24. Related Work (2) • Matching in graph databases [Ranu and Singh, 2009; Yan et al., 2005; Zhu et al., 2012] • Matching for RDF graphs [Liu et al., 2012], probabilistic graphs [Yuan et al., 2012] and temporal graphs [Bogdanov et al., 2011] • Top-K queries • h-hop aggregate queries [Yan et al., 2010] • K most frequent patterns [Yang et al., 2012; Zhu et al., 2011] • Top-K keyword queries on RDF graphs [Tran et al., 2009] • Top-K similarity queries [Zou et al., 2007] • Twig queries [Gou and Chirkova, 2008] gmanish@microsoft.com

  25. Conclusion • Given • Typed unweighted query • A heterogeneous edge-weighted information network • Edge interestingness measure • Find • Top-K interesting subgraphs • Investigated ranking after matching baseline • Proposed three new graph indexes and exploited them for building a top-K solution • Showed efficiency, scalability and effectiveness on multiple synthetic and real datasets gmanish@microsoft.com

  26. Thanks! gmanish@microsoft.com

More Related