Frequent Neighborhood Patterns: Mining Algorithms and Applications

Frequent Neighborhood Patterns: Mining Algorithms and Applications Jialong Han Doctoral thesis work, supervised by Prof. Ji-Rong Wen

Outline • Background • Frequent Neighborhood Patterns: Definitions • Mining Algorithm • Applications • Knowledge Discovery in Graphs • Within-Network Classification • Reverse Top-k Queries • Conclusions

Graphs Academic Networks4 Web Graphs3 Molecule Structure Databases1 Social Networks2 Knowledge Bases5

Graph Databases: Two Settings[KK05] • Graph-transaction setting • Core concept: transactions • Molecule structure databases • Properties of a transaction depends on its structure. • Frequent subgraph mining • Applications • Single-graph setting • Social networks, web graphs, academic networks, knowledge bases, … • Core concept: nodes • Persons, web pages, papers, general entities, …

Frequent Patterns for Nodes (in the Single-Graph Setting)? • Properties of a node depends on its surrounding structure. • Academic networks: an author citing his own paper • Social networks: a person with a son and a daughter • Within a molecule structure: a carbon atom appearing on a cycle of length 6 • Problems to be answered in this thesis • Is there a class of frequent patterns characterizing the common surrounding structures of many nodes? • If yes, can these frequent patterns support any node-related applications? “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Problem Formulation • A neighborhood patternis a tuple, where • is a connected graph, and • is the pivot of . • Given a database, nodes that matches nodes residing in a surrounding structure like in . • Support of: number of nodes that matches • is a frequent neighborhood pattern if its support exceeds τ. • The mining problem: Given and τ, find all frequent neighborhood patterns. Pivot Single-graph database NP: authors once citing their own papers “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Mining Algorithm FNM（Frequent Neighborhood Mining） The AprioriFramework Initialize; ; While Do ; ; End While Return ; • Aprioriproperty/Anti-monotonicity • ’s support does not exceed that of its sub-patterns. • Enables an Apriorimining framework[AIS93]: Join-Verify • Challenge: non-trivial Building Blocks • BBs: patterns that CANNOT be obtained by joining smaller ones • Traditional frequent pattern mining: BBs = all size-1patterns • However, in FNM: • BBs appear at level-2 and above. • What do BBs look like in general cases? “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Search Space Building Block Theorem of FNM … BBs Level 1 Non-BBs φ • Level 0 Frequent subgraph mining FNM • Call a path patternif it • is a path, with the pivot on the one end, and • contains at most one vertex label, (if does) appearing on the other end. • Theorem: is a BB iff. it is a path pattern. Path Patterns Extend “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Application 1: Knowledge Discovery in Single-Graphs • Frequent neighborhood patterns • has easy-to-interpret semantics, and • helps discover hidden knowledge in single-graphs. “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Application 2: Within-Network Classification • Task: molecule structure completion[DK09] • Input: a single-graph database, unlabeled • Output: labels of nodes in • Neighborhood patterns as node features • Mine frequent neighborhood patterns on, ; • Vectorize all as , where • Train model using, and (iteratively) classify from with . ? ? ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Preliminary Results and Problems • Outperforms the baseline by 11.7% in terms of F1 • Baseline: RL-RW-Deg[DK09] • Problem: are all features useful? • Definition: the radius of is • Larger radius, less (conditional) contribution Label ratio = 50% = 2 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

MarkovAssumption for WNC [MP07] • Distant structures (node/edge) have small impacts on the classification of . • An efficiency-effectiveness tradeoff • with a large radius falls under Markov assumption. • Can we do FNM without generating with ? • FNM cannot control directly. • Late-filtration with : wasted computations. • Early-filtration with (e.g., from path-pattern-generating stage) : BBs missed again! ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

BB Theorem of Radius-Constrained FNM（r-FNM） • After introducing , some non-path patterns become BBs. • Theorem: Under radius constraints, is a BB iff. it is a path pattern ora zipper pattern. • FNM with radius constraints: r-FNM = FNM + zipper pattern handling =3 Zipper patterns, =3 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Superiorities of r-FNM • Saves feature extraction time when Markov assumption needs to be involved • The K problem • Provides more choices on the efficiency-effectiveness tradeoff ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Application 3: Reverse Top-k Queries • Knowledge bases • A single-graph database • Access interface: structural query languages • Hard for ordinary users to formulate queries • Can we find the query using representative partial answers? • “Representative” • Persons born in Europe Which chess player was born and died in the same place? SELECT ?uri WHERE { ?uri:type :ChessPlayer. ?uri:birthPlace?place . ?uri:deathPlace?place } Which chess player was born and died in the same place6 ? Complete Answers M. Botvinnik P. Morphy … ？ Representative Partial Answers M. Botvinnik “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Reverse Top-k Neighborhood Pattern Queries SELECT ?uri WHERE { ?uri :type :ChessPlayer . ?uri :birthPlace ?place . ?uri :deathPlace ?place } = • Natural language questions -> node queries -> Neighborhood pattern queries • Problem statement: Given a database and an order on , for input nodes , find all neighborhood pattern queries such that • , and • when ranking , nodes in all appear in the Top-k results. • A filter-refine approach • Reduce the filter sub-problem to FNM = { M. Botvinnik } “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Refine Stage: Observations and Optimizations • To verify , needs not be completely evaluated -> Indicator answers. • Only nodes in affect the Top-k condition. • meets the Top-kcondition iff. . • of different overlap with each other -> Shared evaluation. • For , is a sub-query of , we have . • Even needs not be completely obtained to reject -> Partial evaluation. • Only an lower bound of is needed. • The number of “match” checks can be reduced. “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Experiments • DBpedia 3.9 knowledge base, 52 questions in QALD-4-Task-1 dataset, allocated into 5 groups w.r.t. the shape of their ground truth query. • Efficiency evaluation • Three optimizations: speedup of up to 1 to 2 orders of magnitude each. • Effectiveness evaluation • Two examples are enough to narrow down the sets of returned queries. “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Related Work • Frequent subgraph mining • Graph-transaction setting [IWM00, KK04, YH02] • Single-graph setting [KK05, VGS02, FB07, BN08] • Within-network classification • Homophily-based[MP03] • Neighborhood-structure-based[DK09, NGK13] • Reverse queries • Reverse engineering SQLqueries [TCP09, ZEPS13, SCC+14] • Reverse nearest neighbor queries [KM00]、reverse top-k queries [VDKN10]、Reverse skyline queries [DS07]

Conclusions • We proposed a new class of node patterns in the single-graph setting: Frequent Neighborhood Patterns. • Algorithmic challenge: non-trivial building blocks • We discussed three applications of frequent neighborhood patterns. • Knowledge discovery, within-network classification, and reverse top-k queries • Future work: other node-centric applications in single-graph databases

Thank you! Q&A

References 1 Picture is from http://icep.wikispaces.com/2D+chemical+database+searching+systems 2 Picture is from http://7.mshcdn.com/wp-content/uploads/2012/09/social-graph-640.jpeg 3 Picture is from http://www.analiticaweb.es/wp-content/uploads/2009/09/google.page.rank.explained.jpg 4 Picture is from http://pages.cs.wisc.edu/~lixiujun/samples/social/dblp 5 Picture is from http://resources.mpi-inf.mpg.de/yago-naga/yago/img/yago-graph.png 6 Picture is from http://upload.chinaz.com/upimg/allimg/091020/1718320.gif [AIS93] RakeshAgrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In SIGMOD Conference, pages 207–216, 1993. [BN08] BjörnBringmann and Siegfried Nijssen. What is frequent in a single graph? In PAKDD, pages 858–863, 2008. [DK09] Christian Desrosiers and George Karypis. Within-network classification using local structure similarity. In ECML/PKDD (1), pages 260–275, 2009. [DKK03] MukundDeshpande, MichihiroKuramochi, and George Karypis. Frequent sub-structure-based approaches for classifying chemical compounds. In ICDM, pages 35–42, 2003.

References (cont.) [DS07] EvangelosDellis and Bernhard Seeger. Efficient computation of reverse skyline queries. In Proceedings of the 33rd international conference on Very large data bases, pages 291–302. VLDB Endowment, 2007. [FB07] Mathias Fiedler and Christian Borgelt. Subgraph support in a single large graph. In Data Mining Workshops, 2007. ICDM Workshops 2007, pages 399–404. IEEE, 2007. [IWM00] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In PKDD, pages 13–23, 2000. [KK04] MichihiroKuramochi and George Karypis. An efficient algorithm for discovering frequent subgraphs. Knowledge and Data Engineering, 16(9):1038–1051, 2004. [KK05] MichihiroKuramochi and George Karypis. Finding frequent patterns in a large sparse graph. Data Min. Knowl. Discov., 11(3):243–271, 2005. [KM00] Flip Korn and S Muthukrishnan. Influence sets based on reverse nearest neighbor queries. In ACM SIGMOD Record, volume 29, pages 201–212. ACM, 2000. [MP03] Sofus A Macskassy and Foster Provost. A simple relational classifier. In Proc. of the 2nd Workshop on Multi-Relational Data Mining (MRDM) at KDD, pages 64–76, 2003. [MP07] Sofus A. Macskassy and Foster J. Provost. Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, 8:935–983, 2007.

References (cont.) [NGK13] Marion Neumann, Roman Garnett, and Kristian Kersting. Coinciding walk kernels: Parallel absorbing random walks for learning with graphs and few labels. In Asian Conference on Machine Learning, pages 357–372, 2013. [SCC+14] Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, and Lev Novik. Discovering queries based on example tuples. In SIGMOD, 2014. [TCP09] QuocTrung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. Query by output. In SIGMOD, 2009. [VDKN10] AkriviVlachou, Christos Doulkeridis, YannisKotidis, and KjetilNorvag. Reverse top-k queries. In ICDE, 2010. [VGS02] Natalia Vanetik, Ehud Gudes, and Solomon EyalShimony. Computing frequent graph patterns from semistructured data. In ICDM, pages 458–465, 2002. [YH02] Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In ICDM, pages 721–724, 2002. [YYH04] Xifeng Yan, Philip S. Yu, and Jiawei Han. Graph indexing: A frequent structure-based approach. In SIGMOD Conference, pages 335–346, 2004. [ZEPS13] Meihui Zhang, HazemElmeleegy, Cecilia M Procopiuc, and Divesh Srivastava. Reverse engineering complex join queries. In SIGMOD, 2013.

Frequent Neighborhood Patterns: Mining Algorithms and Applications

Frequent Neighborhood Patterns: Mining Algorithms and Applications

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7