250 likes | 262 Vues
This doctoral thesis explores the concepts and applications of frequent neighborhood patterns in various types of graphs, including academic networks, social networks, and knowledge bases. The mining algorithm and its applications, such as knowledge discovery and within-network classification, are discussed in detail.
E N D
Frequent Neighborhood Patterns: Mining Algorithms and Applications Jialong Han Doctoral thesis work, supervised by Prof. Ji-Rong Wen
Outline • Background • Frequent Neighborhood Patterns: Definitions • Mining Algorithm • Applications • Knowledge Discovery in Graphs • Within-Network Classification • Reverse Top-k Queries • Conclusions
Graphs Academic Networks4 Web Graphs3 Molecule Structure Databases1 Social Networks2 Knowledge Bases5
Graph Databases: Two Settings[KK05] • Graph-transaction setting • Core concept: transactions • Molecule structure databases • Properties of a transaction depends on its structure. • Frequent subgraph mining • Applications • Single-graph setting • Social networks, web graphs, academic networks, knowledge bases, … • Core concept: nodes • Persons, web pages, papers, general entities, …
Frequent Patterns for Nodes (in the Single-Graph Setting)? • Properties of a node depends on its surrounding structure. • Academic networks: an author citing his own paper • Social networks: a person with a son and a daughter • Within a molecule structure: a carbon atom appearing on a cycle of length 6 • Problems to be answered in this thesis • Is there a class of frequent patterns characterizing the common surrounding structures of many nodes? • If yes, can these frequent patterns support any node-related applications? “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13
Problem Formulation • A neighborhood patternis a tuple, where • is a connected graph, and • is the pivot of . • Given a database, nodes that matches nodes residing in a surrounding structure like in . • Support of: number of nodes that matches • is a frequent neighborhood pattern if its support exceeds τ. • The mining problem: Given and τ, find all frequent neighborhood patterns. Pivot Single-graph database NP: authors once citing their own papers “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13
Mining Algorithm FNM(Frequent Neighborhood Mining) The AprioriFramework Initialize; ; While Do ; ; End While Return ; • Aprioriproperty/Anti-monotonicity • ’s support does not exceed that of its sub-patterns. • Enables an Apriorimining framework[AIS93]: Join-Verify • Challenge: non-trivial Building Blocks • BBs: patterns that CANNOT be obtained by joining smaller ones • Traditional frequent pattern mining: BBs = all size-1patterns • However, in FNM: • BBs appear at level-2 and above. • What do BBs look like in general cases? “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13
Search Space Building Block Theorem of FNM … BBs Level 1 Non-BBs φ • Level 0 Frequent subgraph mining FNM • Call a path patternif it • is a path, with the pivot on the one end, and • contains at most one vertex label, (if does) appearing on the other end. • Theorem: is a BB iff. it is a path pattern. Path Patterns Extend “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13
Application 1: Knowledge Discovery in Single-Graphs • Frequent neighborhood patterns • has easy-to-interpret semantics, and • helps discover hidden knowledge in single-graphs. “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13
Application 2: Within-Network Classification • Task: molecule structure completion[DK09] • Input: a single-graph database, unlabeled • Output: labels of nodes in • Neighborhood patterns as node features • Mine frequent neighborhood patterns on, ; • Vectorize all as , where • Train model using, and (iteratively) classify from with . ? ? ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14
Preliminary Results and Problems • Outperforms the baseline by 11.7% in terms of F1 • Baseline: RL-RW-Deg[DK09] • Problem: are all features useful? • Definition: the radius of is • Larger radius, less (conditional) contribution Label ratio = 50% = 2 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14
MarkovAssumption for WNC [MP07] • Distant structures (node/edge) have small impacts on the classification of . • An efficiency-effectiveness tradeoff • with a large radius falls under Markov assumption. • Can we do FNM without generating with ? • FNM cannot control directly. • Late-filtration with : wasted computations. • Early-filtration with (e.g., from path-pattern-generating stage) : BBs missed again! ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14
BB Theorem of Radius-Constrained FNM(r-FNM) • After introducing , some non-path patterns become BBs. • Theorem: Under radius constraints, is a BB iff. it is a path pattern ora zipper pattern. • FNM with radius constraints: r-FNM = FNM + zipper pattern handling =3 Zipper patterns, =3 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14
Superiorities of r-FNM • Saves feature extraction time when Markov assumption needs to be involved • The K problem • Provides more choices on the efficiency-effectiveness tradeoff ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14
Application 3: Reverse Top-k Queries • Knowledge bases • A single-graph database • Access interface: structural query languages • Hard for ordinary users to formulate queries • Can we find the query using representative partial answers? • “Representative” • Persons born in Europe Which chess player was born and died in the same place? SELECT ?uri WHERE { ?uri:type :ChessPlayer. ?uri:birthPlace?place . ?uri:deathPlace?place } Which chess player was born and died in the same place6 ? Complete Answers M. Botvinnik P. Morphy … ? Representative Partial Answers M. Botvinnik “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16
Reverse Top-k Neighborhood Pattern Queries SELECT ?uri WHERE { ?uri :type :ChessPlayer . ?uri :birthPlace ?place . ?uri :deathPlace ?place } = • Natural language questions -> node queries -> Neighborhood pattern queries • Problem statement: Given a database and an order on , for input nodes , find all neighborhood pattern queries such that • , and • when ranking , nodes in all appear in the Top-k results. • A filter-refine approach • Reduce the filter sub-problem to FNM = { M. Botvinnik } “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16
Refine Stage: Observations and Optimizations • To verify , needs not be completely evaluated -> Indicator answers. • Only nodes in affect the Top-k condition. • meets the Top-kcondition iff. . • of different overlap with each other -> Shared evaluation. • For , is a sub-query of , we have . • Even needs not be completely obtained to reject -> Partial evaluation. • Only an lower bound of is needed. • The number of “match” checks can be reduced. “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16
Experiments • DBpedia 3.9 knowledge base, 52 questions in QALD-4-Task-1 dataset, allocated into 5 groups w.r.t. the shape of their ground truth query. • Efficiency evaluation • Three optimizations: speedup of up to 1 to 2 orders of magnitude each. • Effectiveness evaluation • Two examples are enough to narrow down the sets of returned queries. “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16
Related Work • Frequent subgraph mining • Graph-transaction setting [IWM00, KK04, YH02] • Single-graph setting [KK05, VGS02, FB07, BN08] • Within-network classification • Homophily-based[MP03] • Neighborhood-structure-based[DK09, NGK13] • Reverse queries • Reverse engineering SQLqueries [TCP09, ZEPS13, SCC+14] • Reverse nearest neighbor queries [KM00]、reverse top-k queries [VDKN10]、Reverse skyline queries [DS07]
Conclusions • We proposed a new class of node patterns in the single-graph setting: Frequent Neighborhood Patterns. • Algorithmic challenge: non-trivial building blocks • We discussed three applications of frequent neighborhood patterns. • Knowledge discovery, within-network classification, and reverse top-k queries • Future work: other node-centric applications in single-graph databases
Thank you! Q&A
References 1 Picture is from http://icep.wikispaces.com/2D+chemical+database+searching+systems 2 Picture is from http://7.mshcdn.com/wp-content/uploads/2012/09/social-graph-640.jpeg 3 Picture is from http://www.analiticaweb.es/wp-content/uploads/2009/09/google.page.rank.explained.jpg 4 Picture is from http://pages.cs.wisc.edu/~lixiujun/samples/social/dblp 5 Picture is from http://resources.mpi-inf.mpg.de/yago-naga/yago/img/yago-graph.png 6 Picture is from http://upload.chinaz.com/upimg/allimg/091020/1718320.gif [AIS93] RakeshAgrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In SIGMOD Conference, pages 207–216, 1993. [BN08] BjörnBringmann and Siegfried Nijssen. What is frequent in a single graph? In PAKDD, pages 858–863, 2008. [DK09] Christian Desrosiers and George Karypis. Within-network classification using local structure similarity. In ECML/PKDD (1), pages 260–275, 2009. [DKK03] MukundDeshpande, MichihiroKuramochi, and George Karypis. Frequent sub-structure-based approaches for classifying chemical compounds. In ICDM, pages 35–42, 2003.
References (cont.) [DS07] EvangelosDellis and Bernhard Seeger. Efficient computation of reverse skyline queries. In Proceedings of the 33rd international conference on Very large data bases, pages 291–302. VLDB Endowment, 2007. [FB07] Mathias Fiedler and Christian Borgelt. Subgraph support in a single large graph. In Data Mining Workshops, 2007. ICDM Workshops 2007, pages 399–404. IEEE, 2007. [IWM00] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In PKDD, pages 13–23, 2000. [KK04] MichihiroKuramochi and George Karypis. An efficient algorithm for discovering frequent subgraphs. Knowledge and Data Engineering, 16(9):1038–1051, 2004. [KK05] MichihiroKuramochi and George Karypis. Finding frequent patterns in a large sparse graph. Data Min. Knowl. Discov., 11(3):243–271, 2005. [KM00] Flip Korn and S Muthukrishnan. Influence sets based on reverse nearest neighbor queries. In ACM SIGMOD Record, volume 29, pages 201–212. ACM, 2000. [MP03] Sofus A Macskassy and Foster Provost. A simple relational classifier. In Proc. of the 2nd Workshop on Multi-Relational Data Mining (MRDM) at KDD, pages 64–76, 2003. [MP07] Sofus A. Macskassy and Foster J. Provost. Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, 8:935–983, 2007.
References (cont.) [NGK13] Marion Neumann, Roman Garnett, and Kristian Kersting. Coinciding walk kernels: Parallel absorbing random walks for learning with graphs and few labels. In Asian Conference on Machine Learning, pages 357–372, 2013. [SCC+14] Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, and Lev Novik. Discovering queries based on example tuples. In SIGMOD, 2014. [TCP09] QuocTrung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. Query by output. In SIGMOD, 2009. [VDKN10] AkriviVlachou, Christos Doulkeridis, YannisKotidis, and KjetilNorvag. Reverse top-k queries. In ICDE, 2010. [VGS02] Natalia Vanetik, Ehud Gudes, and Solomon EyalShimony. Computing frequent graph patterns from semistructured data. In ICDM, pages 458–465, 2002. [YH02] Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In ICDM, pages 721–724, 2002. [YYH04] Xifeng Yan, Philip S. Yu, and Jiawei Han. Graph indexing: A frequent structure-based approach. In SIGMOD Conference, pages 335–346, 2004. [ZEPS13] Meihui Zhang, HazemElmeleegy, Cecilia M Procopiuc, and Divesh Srivastava. Reverse engineering complex join queries. In SIGMOD, 2013.