260 likes | 441 Vues
TopK Interesting Subgraph Discovery in Information Networks. Manish Gupta Jing Gao Xifeng Yan Hasan Cam Jiawei Han. Real World Problems. Network Bottlenecks Discovery. Computer Networks. Organization Networks. Team Selection.
E N D
TopK Interesting Subgraph Discovery in Information Networks Manish Gupta Jing GaoXifeng Yan Hasan Cam Jiawei Han gmanish@microsoft.com
Real World Problems Network Bottlenecks Discovery Computer Networks Organization Networks Team Selection Interestingness = Highest Historical Compatibility Interestingness = Lowest Bandwidth Suspicious Relationships Discovery Battlefield Networks Resource Allocation Social Networks Interestingness = Highest Negative Association Strength of Attribute Values Interestingness = Lowest Distance between Entities gmanish@microsoft.com
The Basic Underlying Problem Team Selection Network Bottlenecks Discovery Interestingness = Lowest Bandwidth Interestingness = Highest Historical Compatibility Suspicious Relationships Discovery Resource Allocation Interestingness = Highest Negative Association Strength Interestingness = Lowest Distance • Given • Edge-weighted Typed Network G • Typed Subgraph Query Q • Edge Interestingness measure • Find • TopK matching subgraphs gmanish@microsoft.com
Naïve Solution: Ranking After Matching 4 3 2 1 A A A B 0.8 0.7 0.2 12 13 0.2 Network G Query Q C C 0.4 0.5 0.4 0.3 6 5 4 3 2 1 2 3 6 5 4 4 3 3 2 B A A A A A A B Ranking 0.6 0.8 0.8 0.7 0.2 B A A A A A A Why compute all matches? We need only top-2! 0.6 0.8 0.8 0.8 0.7 0.9 0.1 0.7 0.1 10 9 8 7 0.7 11 1 4 10 9 8 7 B C A A A B A 0.3 0.6 0.5 0.2 A A A B B 0.3 0.6 0.5 Matching 4 3 2 6 5 A A A B A 0.8 0.7 0.6 0.1 0.9 7 10 9 5 B 6 5 A A 0.3 4 5 A A A B A 0.8 0.6 0.9 0.9 0.1 0.9 9 8 7 9 7 9 8 A A B A B 7 A A 0.6 0.5 0.6 gmanish@microsoft.com
Our Contributions • New notion: TopK interesting subgraph detection in information networks • Three new low-cost indexes • Graph topology index • Sorted edge lists • Graph maximum metapath weight index • Novel top-K algorithm to answer interestingness queries on large graphs • Detailed effectiveness and efficiency validation on several synthetic and real datasets gmanish@microsoft.com
Relationship with Previous Work • Subgraph matching • Approximate: fuzzy node/edge similarity • Exact: Matching without ranking • RDF graphs, probabilistic graphs, temporal graphs • TopK querying on graphs • H-hop aggregate queries • Keyword queries on RDF graphs • K most frequent patterns • Twig queries gmanish@microsoft.com
System Overview 2 Network G Breadth First Traversal from each Node up to Distance D Graph Topology Index Offline Index Construction Distance D Sort Edges 3 Graph Maximum MetaPathWeight Index 1 Sorted Edge Lists Find Candidate Nodes Query Q Candidate Nodes Top-K Computation Online Query Processing Top-K Subgraphs gmanish@microsoft.com
G=(V,E), B=avg #neighbors, T=#types Index Structures 12 13 0.2 Network G C C 0.4 0.5 0.4 0.3 6 5 4 3 2 1 B A A A A B 0.6 0.8 0.8 0.7 0.2 0.9 0.1 0.7 0.1 10 9 8 7 11 C A A A B 0.3 0.6 0.5 0.2 gmanish@microsoft.com
Find Candidate Nodes Graph Topology Index Query Q Query Q Graph Topology Index 2 3 A A 1 4 B A Query Topology gmanish@microsoft.com
Finding and Scoring MatchesKey Idea Query Q Top-K Computation 2 3 Start Y Generate a Size-1 Candidate A A More valid edges? N 1 4 Y B A TopK Quit? Compute Actual and UB Score N Y N Candidate Size==|Q|? B A A A Grow Candidates N Y Y Top-K Heap TopK Quit? Compute Actual and UB Score Update Heap Compute Max UB Score N Y TopK Quit? Done! gmanish@microsoft.com
Finding and Scoring MatchesGenerating Size-1 Candidates Size-1 Candidates Query Q 9 9 2 9 5 5 9 9 9 9 3 5 5 5 5 5 5 9 A A A A A A A A A A A A A A A A A A A A 5 1 9 4 B B B B B B B B B B A A A A A A A A A A Query Edge with both endpoints of same type Multiple query edges of the same type Candidate Growth B A A A Order (5,9) (3,4) (4,5) (2,3) (2,7) … Heapify? Discard? Prune? Grow? 8 6 6 10 Prune? Grow? 8 10 Heapify? Discard? Prune? Grow? gmanish@microsoft.com
Finding and Scoring MatchesActual Score and Upper Bound Score Candidate Growth 9 9 9 9 5 5 5 5 Prune? Grow? Prune? Grow? Heapify? Discard? 6 8 8 A A A A A A A A B B B B A A A A Actual Score= 0.9 B A A A UB Score = 0.9+ UB(NonConsidered Edges) = 0.9+ (0.6+0.6) = 2.1 • Partially grown candidate • Prune if UBScore< min(heap) • Grow otherwise • Fully grown candidate • Discard if UBScore< min(heap) • Update heap otherwise Useful Edge Lists gmanish@microsoft.com
Finding and Scoring MatchesGlobal Top-K Quit 12 13 0.2 Network G C C Query Q 0.4 0.5 0.4 0.3 6 5 4 3 2 1 2 3 B A A A A A A B 0.6 0.8 0.8 0.7 0.2 0.9 0.1 0.7 1 4 0.1 10 9 8 7 11 B A C A A A B 0.3 0.6 0.5 0.2 B A A A K=2 TopK Heap (4,3,2,7): 2.2 (3,4,5,6): 2.2 Stop 0.7+0.6+0.7 = 2 <2.2 gmanish@microsoft.com
Faster Query Processing using Graph Maximum MetaPath Weight Index Slight complication 1 1 1 4 3 5 C 4 3 5 C C A B C A B C 2 2 2 C C C Query 6 7 1 B C Query Partial Instantiation UB Score = Actual Score(1-2) + UB(1-3) + UB(2-3) + UB(3-4) + UB(4-5) C 1 4 3 5 C 2 C 4 A B C B Partial Candidate 7 3 6 7 C A UB Score = Actual Score(1-2) + UB(1-3-4-5) + UB(2-3) 2 B C 1 C 4 3 5 C Paths to cover Non-Considered Edges Edges to Consider Separately A B C 3 Paths to cover Non-Considered Edges A UB Score = Actual Score(1-2) + UB(1-3-4-5-7) + UB(2-3) + UB(4-6) +UB(6-7) 2 Using MMW Index! C gmanish@microsoft.com
Faster Query Processing using Graph Maximum MetaPathWeight Index 5 A A Prune? Grow? 9 B A Edge-based UBScore 0.9+0.8+0.7 =2.4 > 2.0 B A A A Grow K=2 TopK Heap (8,9,5,6): 2.1 (5,9,8,7): 2.0 Path-based UBScore 0.9+UB(5-A-B) =0.9+0.9 =1.8 < 2.0 Prune MMW Index gmanish@microsoft.com
Discussions • Queries with multiple edge semantics • Directed graphs • Homogeneous networks • Weighted query edges • Weights signify expected amount of interestingness • Weights signify importance of query edge • Faster computations versus index size gmanish@microsoft.com
Low-cost Index Structures gmanish@microsoft.com
Faster Query Execution Query Execution Time (msec) for Clique Queries (Graph G2 and indexes with D=2) Query Execution Time (msec) for Path Queries (Graph G2 and indexes with D=2) RAM: Ranking After Matching baseline RWM0: without using the candidate node filtering RWM1: without using the MMW index RWM2: same as RWM1 without the pruning any partially grown candidates RWM3: same as RWM1 without the global top-K quit check RWM4: same as RWM1 with the MMW index Query Execution Time (msec) for Subgraph Queries (Graph G2 and indexes with D=2) gmanish@microsoft.com
Good Scalability Good Scalability thanks to Effective Pruning Running time (msec) for different Query Sizes and Graph Sizes (D=2) Number of Candidates as Percentage of Total Matches for Different Query Sizes and Candidate Sizes Query Execution Time for Different Values of K gmanish@microsoft.com
Real Dataset Case Studies 2 2 4 1 1 Author Conf Author Conf Keyword 3 3 Author Author Q1 Q2 2 2 4 1 1 Person Film Person Company Settlement 3 3 Person Person Q3 Q4 gmanish@microsoft.com
Real Dataset Case Studies • DBLP • 1: Rohit Gupta, 2: BICoB, 3: Vipin Kumar • Rohit Gupta -- computer networking • Vipin Kumar -- Data and Information Systems • BICoB -- International Conference on Bioinformatics and Computational Biology • 1: Jimeng Sun, 2: Operating Systems Review (SIGOPS), 3: Christos Faloutsos, 4: mining • Jimeng Sun and Christos Faloutsos -- Data and Information Systems, Artificial intelligence, and Computational biology • "mining" -- Data and Information Systems • "Operating Systems Review (SIGOPS)" -- Operating systems, Computer architecture, Computer networking gmanish@microsoft.com
Real Dataset Case Studies • Wikipedia • 1: Stacy Keach, 2: The Biggest Battle, 3: John Huston • Stacy Keach and John Huston starred in the movie “The Biggest Battle” • Stacy Keach (American), John Huston (American), movie is Italian • Stacy (narration, comedy, music), John (drama, documentary, adventure), movie (war) • 1: Medha Patkar, 2: BBC, 3: Felix D’Alviella, 4: Mogilino • Medha Patkar -- Indian social activist -- won Best International Political Campaigner by BBC • Felix D’Alviella -- Belgian actor in the BBC soap opera Doctors • Mogilino -- village in Bulgaria -- BBC showed the popular film "Bulgaria’s Abandoned Children" in 2007 • British company rewarding an Indian woman, covering a place in Bulgaria or linked to a person from Belgium is rare gmanish@microsoft.com
Related Work (1) Theory literature on subgraph isomorphism [Cordella et al., 2004; McKay, 1981; Ullmann, 1976] Exact subgraph matching [Cheng et al., 2008; He and Singh, 2008; Sun et al., 2012; Zhang et al., 2007; Zhang et al., 2009; Zhao and Han, 2010; Zou et al., 2009] Approximate subgraph matching [Zou et al., 2007; Zeng et al., 2012; Tian et al., 2007; Zhang et al., 2010] gmanish@microsoft.com
Related Work (2) • Matching in graph databases [Ranu and Singh, 2009; Yan et al., 2005; Zhu et al., 2012] • Matching for RDF graphs [Liu et al., 2012], probabilistic graphs [Yuan et al., 2012] and temporal graphs [Bogdanov et al., 2011] • Top-K queries • h-hop aggregate queries [Yan et al., 2010] • K most frequent patterns [Yang et al., 2012; Zhu et al., 2011] • Top-K keyword queries on RDF graphs [Tran et al., 2009] • Top-K similarity queries [Zou et al., 2007] • Twig queries [Gou and Chirkova, 2008] gmanish@microsoft.com
Conclusion • Given • Typed unweighted query • A heterogeneous edge-weighted information network • Edge interestingness measure • Find • Top-K interesting subgraphs • Investigated ranking after matching baseline • Proposed three new graph indexes and exploited them for building a top-K solution • Showed efficiency, scalability and effectiveness on multiple synthetic and real datasets gmanish@microsoft.com
Thanks! gmanish@microsoft.com