This doctoral thesis explores the concepts and applications of frequent neighborhood patterns in various types of graphs, including academic networks, social networks, and knowledge bases. The mining algorithm and its applications, such as knowledge discovery and within-network classification, are discussed in detail.
Frequent Neighborhood Patterns: Mining Algorithms and Applications Jialong Han Doctoral thesis work, supervised by Prof. Ji-Rong Wen
Outline • Background • Frequent Neighborhood Patterns: Definitions • Mining Algorithm • Applications • Knowledge Discovery in Graphs • Within-Network Classification • Reverse Top-k Queries • Conclusions
Graphs Academic Networks4 Web Graphs3 Molecule Structure Databases1 Social Networks2 Knowledge Bases5
Graph Databases: Two Settings[KK05] • Graph-transaction setting • Core concept: transactions • Molecule structure databases • Properties of a transaction depends on its structure. • Frequent subgraph mining • Applications • Single-graph setting • Social networks, web graphs, academic networks, knowledge bases, … • Core concept: nodes • Persons, web pages, papers, general entities, …
Frequent Patterns for Nodes (in the Single-Graph Setting)? • Properties of a node depends on its surrounding structure. • Academic networks: an author citing his own paper • Social networks: a person with a son and a daughter • Within a molecule structure: a carbon atom appearing on a cycle of length 6 • Problems to be answered in this thesis • Is there a class of frequent patterns characterizing the common surrounding structures of many nodes? • If yes, can these frequent patterns support any node-related applications? “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13
Problem Formulation • A neighborhood patternis a tuple, where • is a connected graph, and • is the pivot of . • Given a database, nodes that matches nodes residing in a surrounding structure like in . • Support of: number of nodes that matches • is a frequent neighborhood pattern if its support exceeds τ. • The mining problem: Given and τ, find all frequent neighborhood patterns. Pivot Single-graph database NP: authors once citing their own papers “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13
Mining Algorithm FNM(Frequent Neighborhood Mining) The AprioriFramework Initialize; ; While Do ; ; End While Return ; • Aprioriproperty/Anti-monotonicity • ’s support does not exceed that of its sub-patterns. • Enables an Apriorimining framework[AIS93]: Join-Verify • Challenge: non-trivial Building Blocks • BBs: patterns that CANNOT be obtained by joining smaller ones • Traditional frequent pattern mining: BBs = all size-1patterns • However, in FNM: • BBs appear at level-2 and above. • What do BBs look like in general cases? “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13
Search Space Building Block Theorem of FNM … BBs Level 1 Non-BBs φ • Level 0 Frequent subgraph mining FNM • Call a path patternif it • is a path, with the pivot on the one end, and • contains at most one vertex label, (if does) appearing on the other end. • Theorem: is a BB iff. it is a path pattern. Path Patterns Extend “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13
Application 1: Knowledge Discovery in Single-Graphs • Frequent neighborhood patterns • has easy-to-interpret semantics, and • helps discover hidden knowledge in single-graphs. “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13
Application 2: Within-Network Classification • Task: molecule structure completion[DK09] • Input: a single-graph database, unlabeled • Output: labels of nodes in • Neighborhood patterns as node features • Mine frequent neighborhood patterns on, ; • Vectorize all as , where • Train model using, and (iteratively) classify from with . ? ? ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14
Preliminary Results and Problems • Outperforms the baseline by 11.7% in terms of F1 • Baseline: RL-RW-Deg[DK09] • Problem: are all features useful? • Definition: the radius of is • Larger radius, less (conditional) contribution Label ratio = 50% = 2 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14
MarkovAssumption for WNC [MP07] • Distant structures (node/edge) have small impacts on the classification of . • An efficiency-effectiveness tradeoff • with a large radius falls under Markov assumption. • Can we do FNM without generating with ? • FNM cannot control directly. • Late-filtration with : wasted computations. • Early-filtration with (e.g., from path-pattern-generating stage) : BBs missed again! ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14
BB Theorem of Radius-Constrained FNM(r-FNM) • After introducing , some non-path patterns become BBs. • Theorem: Under radius constraints, is a BB iff. it is a path pattern ora zipper pattern. • FNM with radius constraints: r-FNM = FNM + zipper pattern handling =3 Zipper patterns, =3 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14
Superiorities of r-FNM • Saves feature extraction time when Markov assumption needs to be involved • The K problem • Provides more choices on the efficiency-effectiveness tradeoff ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14
Application 3: Reverse Top-k Queries • Knowledge bases • A single-graph database • Access interface: structural query languages • Hard for ordinary users to formulate queries • Can we find the query using representative partial answers? • “Representative” • Persons born in Europe Which chess player was born and died in the same place? SELECT ?uri WHERE { ?uri:type :ChessPlayer. ?uri:birthPlace?place . ?uri:deathPlace?place } Which chess player was born and died in the same place6 ? Complete Answers M. Botvinnik P. Morphy … ? Representative Partial Answers M. Botvinnik “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16
Reverse Top-k Neighborhood Pattern Queries SELECT ?uri WHERE { ?uri :type :ChessPlayer . ?uri :birthPlace ?place . ?uri :deathPlace ?place } = • Natural language questions -> node queries -> Neighborhood pattern queries • Problem statement: Given a database and an order on , for input nodes , find all neighborhood pattern queries such that • , and • when ranking , nodes in all appear in the Top-k results. • A filter-refine approach • Reduce the filter sub-problem to FNM = { M. Botvinnik } “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16
Refine Stage: Observations and Optimizations • To verify , needs not be completely evaluated -> Indicator answers. • Only nodes in affect the Top-k condition. • meets the Top-kcondition iff. . • of different overlap with each other -> Shared evaluation. • For , is a sub-query of , we have . • Even needs not be completely obtained to reject -> Partial evaluation. • Only an lower bound of is needed. • The number of “match” checks can be reduced. “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16
Experiments • DBpedia 3.9 knowledge base, 52 questions in QALD-4-Task-1 dataset, allocated into 5 groups w.r.t. the shape of their ground truth query. • Efficiency evaluation • Three optimizations: speedup of up to 1 to 2 orders of magnitude each. • Effectiveness evaluation • Two examples are enough to narrow down the sets of returned queries. “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16
Related Work • Frequent subgraph mining • Graph-transaction setting [IWM00, KK04, YH02] • Single-graph setting [KK05, VGS02, FB07, BN08] • Within-network classification • Homophily-based[MP03] • Neighborhood-structure-based[DK09, NGK13] • Reverse queries • Reverse engineering SQLqueries [TCP09, ZEPS13, SCC+14] • Reverse nearest neighbor queries [KM00]、reverse top-k queries [VDKN10]、Reverse skyline queries [DS07]
Conclusions • We proposed a new class of node patterns in the single-graph setting: Frequent Neighborhood Patterns. • Algorithmic challenge: non-trivial building blocks • We discussed three applications of frequent neighborhood patterns. • Knowledge discovery, within-network classification, and reverse top-k queries • Future work: other node-centric applications in single-graph databases
