300 likes | 426 Vues
Blank Node Matching and RDF/S Comparison Functions. Yannis Tzitzikas , Christina Lantzaki and Dimitris Zeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete, GREECE. ISWC2012, Boston, Nov. 2012. In two slides (1/2).
E N D
Blank Node Matching andRDF/S Comparison Functions Yannis Tzitzikas, Christina Lantzaki and DimitrisZeginis Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete, GREECE ISWC2012, Boston, Nov. 2012
In two slides (1/2) Several RDF/S Knowledge Bases rely heavily on blank nodes • Bnodes are convenient for representing complex attributes or resources whose identity is unknown but their attributes (either literals or associations with other resources) are known. G1 G2 Chris Jim Chris Blank node prevalence * Opencalais.com 44.9% hi5.com foaf 87.5% *[On blank nodes ISWC 2011] hasAddress hasAddress hasAddress _:ad1 _:ad2 street city street city no no Arlington St Arlington St Boston 77 Boston 77 • We show how to exploit blank node anonymity in order to reduce the delta sizewhen comparingRDF/S Knowledge Bases. • We approach the problem as an optimization problem: • Find the mapping that gives the minimum in size delta FORTH-ICS, ISWC 2012
In two slides (2/2) Time Complexity Deviation from optimal All KBs (general case) Optimal Mapping NP-Hard KBs with no directly connected bnodes O(n3) Approximately Opt. mapping [0, 7.2] O(n3) Approximately Opt. mapping O(n logn) [1, 7.2] Mapping of 150,000 blank nodes ~11 sec FORTH-ICS, ISWC 2012
Outline • Motivation • RDF Knowledge Bases with Blank Nodes • On finding the Optimal Bnode Mapping • Delta and Bnode Name Tuning • The Optimization Problem • Polynomially-solved Cases • Approximate Bnode Matching Algorithms • Hungarian Bnode Matching Algorithm • A Fast Signature-based Algorithm • Experimental Evaluation • Discussing Semantics and Inference Rules • Related Work • Concluding Remarks FORTH-ICS, ISWC 2012
Motivation • World evolves, and world models (e.g. KBs expressed in RDF/S) evolve as well. • The result of the comparison of two KBs is called Delta. • Deltascan be useful for • aiding humans to understand the evolution of knowledge • to reduce the amount of data that need to be exchanged and managed over the network in order to build synchronization, versioning and replication services • The inability to match bnodes increases the delta size and does not assist in detecting the changes between subsequent versions of a KB. However, a large percentage of the nodes of existing RDF KBs are blank nodes • Opencalais.com: 44.9% bnodes, hi5.com foaf: 87.5% bnodes FORTH-ICS, ISWC 2012
RDF Knowledge Bases with Blank Nodes Def: Equivalence. Two RDF graphs G1 and G2 are equivalentif there is a bijection M between the sets of nodes of the two graphs (N1 and N2), such that: – M(uri) = uri for each uri ∈ U1 ∩ N1 – M(lit) = lit for each lit ∈ L1 – M maps bnodes to bnodes – The triple (s, p, o) is in G1 if and only if the triple (M(s), p,M(o)) ∈ G2 N1 Bijection M N2 Identity function • Graph notation N: nodes B: blank nodes L : literals U : URIs Identity function ? FORTH-ICS, ISWC 2012
RDF Knowledge Bases with Blank Nodes (Cont) Def: Edit Distance over Nodes given a Bijection Let o1 and o2 be two nodes of G1 and G2, and suppose a bijection between the nodes of these graphs, i.e. a function h : N1 → N2 . We define the edit distance between o1 and o2 over h, denoted by disth(o1, o2), as the number of additions or deletions of triples which are required for making the “direct neighborhoods” of o1 and o2 the same. Formally, disth(o1, o2) = |{(o1, p, a) ∈ G1 | (o2, p, h(a) ∉ G2}| + |{(a, p, o1) ∈ G1 | (h(a), p, o2)) ∉ G2}|+ |{(o2, p, a) ∈ G2 | (o1, p, h-1(a)) ∉ G1}|+ |{(a, p, o2) ∈ G2 | (h-11(a),p,o1)∉ G1}| K1 K2 o1 o5 h = {(o1→ o7), (o2 → o6), (o3→ o5), (o4 → o8)} p p o2 o6 p p p p dist h(o2,o6) = 4 o3 o4 o7 o8 Theorem: RDF Graph Equivalence G1 ≡h G2 ⇔ disth(o, h(o)) = 0 for each o ∈ N1 FORTH-ICS, ISWC 2012
Deltas and Bnode Mappings • For the case were the Knowledge Bases are not necessarily equivalent, we would like to find the bnode mapping that reduces the delta size • Delta • we use the differential function Δe, . The computed deltaconsists of triple additions and triple deletions Δe(G1 → G2) = {Add(t) | t ∈ G2 − G1} ∪ {Del(t) | t ∈ G1 − G2} • Consider the following example: G1 = {(_:1, name, Joe)} G2 = {(_:2, name, Joe),(_: 2,lives,UK)} Bnode Name Tuning • Note: • No rename operation is needed and hence no particular execution order FORTH-ICS, ISWC 2012
On Finding the Optimal Mapping • Our objective is to find the bijection M (between bnodes) that minimizes the delta size • concerns the mapping of the blank nodes of the subsets B1 and B2 • the bijection M a priori contains the mappings of all the URIs (U1, U2) and literals(L1,L2) as identity functions • The number of possible bijections M is exponential • |J| = n2 * (n2 -1) * …*(n2-n1+1) , if |B1| = n1, |B2|= n2, |B1| < |B2| • The cost of a bijection M (which is a actually the part of deltas tha concerns bnodes) • Cost(M) = ∑b1∈B1 distM(b1,M(b1)) Problem Statement Given two Knowledge Bases, find the bijection (or bijections) that minimizes the cost. Msol = argMminM∈ J (Cost(M)) Theorem: HardnessofOptimality Findingtheoptimalbijectionis NP-Hard. Proof: reduction to the subgraph isomorphism problem (NP-Complete) FORTH-ICS, ISWC 2012
Time Complexity All KBs (general case) Optimal Mapping NP-Hard KBs with no directly connected bnodes O(n3) Approximately Opt. mapping O(n3) Approximately Opt. mapping O(n logn) FORTH-ICS, ISWC 2012
Polynomially-solved cases: Not directly connected bnodes Key observation: If there are no directly connected bnodes, then the edit distance between a pair of bnodes is independentof the other pairs Consequence • The optimization problem can be solved using the Hungarian Algorithm [J. Munkres, 1957] • The elements of B1 play the role of workers • The elements of B2 play the role of jobs • The edit distances of the pairs in B1 X B2 play the role of the costs • All the possible combinations can be checked with only |B1| * |B2| (or else n2, assuming n=|B1| = |B2|) edit distance computation Theorem Finding the optimal bijectionis a polynomialtask if there are no directly connected blank nodes. • The Hungarian-based method has cubic time complexity O(n3) and quadratic main memory complexity. FORTH-ICS, ISWC 2012
Time Complexity All KBs (general case) Optimal Mapping NP-Hard KBs with no directly connected bnodes O(n3) Approximately Opt. mapping O(n3) Approximately Opt. mapping O(n logn) FORTH-ICS, ISWC 2012
The Hungarian-based Algorithm (1/2) • It is a variation of the optimal Hungarian algorithm that provides an approximate solution, as there is a need for an assumption about the treatment of the directly connected blank nodes at the computation of disth • Two possible assumptions: • All connected bnodesareconsidereddifferent • All connected bnodesareconsideredthe same It again makes only |B1| * |B2| (n2) edit distance computations and its complexity remains in the same level (O(n3)) FORTH-ICS, ISWC 2012
The Hungarian-based Algorithm (2/2) G1 G2 Jim Jim hasAgenda hasAgenda hasAgenda hasAgenda _:1 _:2 _:6 _:7 brother friend friend brother friend friend _:3 _:4 _:5 _:8 _:9 _:10 name sname sname name name name name name Chris Zeginis Tom John Chris Zeginis John Tom • disth (_:1,_:6) = ? • dependent on the mappings of bnodes _:3, _:4, _:8, _:9 Assume all the connected bnodes are considered: • different • disth (_:1,_:6) = 4 • does not take common predicates into account • the same • disth (_:1,_:6) = 0 • exploits the similarity of their predicates This assumption is used for the experiments FORTH-ICS, ISWC 2012
Time Complexity All KBs (general case) Optimal Mapping NP-Hard KBs with no directly connected bnodes O(n3) Approximately Opt. mapping O(n3) Approximately Opt. mapping O(n logn) FORTH-ICS, ISWC 2012
The Signature-based Algorithm (1/2) It consists of two steps • Signature Construction Phase: for each bnode a signature (string) is constructed based on the direct neighborhood of the bnode • Mapping Construction Phase: the two bags of signatures are matched. Each signature matching corresponds to a mapping of a pair of blank nodes Example of Signature Construction: G1 G2 Christina Yannis Christina Yannis Address Address hasAddress hasAddress hasAddress hasAddress rdf:type rdf:type _:1 rdf:type _:2 _:3 rdf:type _:4 street city street city street city street city no no no no Oxford St 14 London Broadway 445 New York Oxford St 14 London Michigan A 132 Chicago FORTH-ICS, ISWC 2012
The Signature-based Algorithm (2/2) Mapping Construction • The mapping is exported in two passes • For both passes we start from the smaller list, say BS1 and for each bs1 in that list we perform a lookup in the second list BS2, using binary search (logarithmic complexity) • First pass (exact match) exports only the exact matches Signature Construction Phase • Second pass (closest match) is applied over the remainder part of BS1, BS2 and matches each element of BS1 to the closer lexicographically element Lexicographical sorting Note: we perform the closest matches after finishing with the exact matches in order to avoid the situation where an approximate match deters an exact match at a later step. FORTH-ICS, ISWC 2012
Experimental evaluation Time Complexity All KBs (general case) Optimal Mapping NP-Hard KBs with no directly connected bnodes O(n3) Approximately Opt. mapping O(n3) Approximately Opt. mapping O(n logn) FORTH-ICS, ISWC 2012
Experimental Evaluation • Over real data sets • Available in the LOD cloud • Two versions from each dataset • Over synthetic datasets • A synthetic generator was implemented • Built over the UBA generator [Y. Guo et. al ISWC ’04] • Extended to support control over the number of blank nodes and the blank node properties • EvaluationAspects • Delta reduction potential • Equivalence detection potential • Time efficiency • Deviation from optimal delta Experiments were conducted using Sesame RDF/S Repository (main memory model) and using a PC with Intel Core i3 at 2.2 Ghz, 3.8 GB Ram, running Ubuntu 11.10. FORTH-ICS, ISWC 2012
Experimental Evaluation: Real Datasets Delta Size None of the datasets contains directly connected blank nodes • The proposed algorithms give a much smaller (12.7 to 7,924 times reduced) delta than without blank node matching • Italian Museums* • Swedish Open Cultural Heritage* • The Hungarian always finds the optimal solution • The Signature gave a 0.34 times bigger delta than the Hungarian * The datasets were downloaded from CKAN Mapping Time • The Hungarian requires more (from 15 to 624 times) time than the Signature • The Signature needs less than one second for mapping 6390 blank nodes FORTH-ICS, ISWC 2012
Experimental Evaluation: Synthetic Datasets 1 Synthetic Generation 1 • A set of 9 datasets, from KB0 to KB8 were generated • all of them contain the same number of blank nodes (240) • gradually create more complex blank node structures • Two rounds of experiments • 1. Delta reduction potential: Compare each dataset with another version • 2. Equivalence detection potential: Compare each dataset with itself FORTH-ICS, ISWC 2012
Experimental Evaluation: Synthetic Datasets 1 Delta Reduction Potential Without bnode matching the delta size ranges from 95% to 143% Delta size is given as • The algorithms provide a much smaller delta than without blank node matching • The Hungarian achieves the optimal delta for most of the pairs • The Hungarian yields from 0 to 3 times smaller deltas than the Signature FORTH-ICS, ISWC 2012
Experimental Evaluation: Synthetic Datasets 1 Equivalence detection potential • Both the proposed algorithms detected equivalence for the first five Knowledge Bases Time Efficiency • The Signature gives two orders of magnitude lower mapping times than the Hungarian FORTH-ICS, ISWC 2012
Experimental Evaluation: Synthetic Datasets 2 Synthetic Generation 2 • A set of 7 bigger datasets, from KB0 to KB6 were generated containing from 2,400 to 153,600 blank nodes • The mapping time for the Signature was only 10.5 seconds for the seventh pair of Knowledge Bases • Note: • The Hungarian Algorithm could not be applied even to the third pair of datasets • due to its high requirements in main memory space FORTH-ICS, ISWC 2012
Measuring the approximation • Deviation from optimal delta • Investigate how the bnode structures impact on the deviation from optimal delta The percentage of bnodes in the direct neighborhood • Hungarian deviation: 0% - 7.2% • Signature deviation: 1% - 7.2% FORTH-ICS, ISWC 2012
Discussion: Semantics and Inference Rules • Apart from the explicitly specified triples of a KB, other triples can be inferred based on the RDF/S semantics, or other custominferencerules. • To apply our method the only difference that the graphs should be completedaccording to the inferred triples. • It follows that if the semantics is based on a set of inference rules yielding a finiteclosure, then the graphisfiniteand thus our method can be applied. • E.g. Minimal RDFS semantics, ter Horst’s pD* semantics and others • Note: • It is worth mentioning, that the optimal bnode mapping over the complete graphs may be different from the optimal mapping when considering the explicit graphs. FORTH-ICS, ISWC 2012
Related Work • Past works focus on detecting only isomorphism • Jena • Past works focusing on finding delta • RDF Sync: no effort is dedicated on finding a blank node mapping • PromptDiff :employs heuristic matchers, but does not treat blank nodes • Otnoview: no blank node matching is offered • CWM: require for the blank nodes to have term labels • SemVersion: creates and assigns unique identifiers for the blank nodes • RDF Molecules (SSWS 2008): a blank node mapping O(n2) is offered , but requires the blank nodes to be part of a uniquely identified triple • They do not try to find a mapping that reduces the delta size • Works for constructing RDF/S mappings are not directly related since they map the named entities of the two KBs, and thus they take into account lexical similarities, something that is not possible with bnodes. FORTH-ICS, ISWC 2012
Concluding Remarks • We have shown how to exploit blank node anonymity in order to reduce the delta size when comparing RDF/S Knowledge Bases • Proved that finding the optimal mapping is NP-Hard in the general case (polynomialif there are no directly connected blank nodes) • We presented polynomial approximate algorithms for the general case (a Hungarian-based and aSignature-based) • In real datasets with no directly connected blank nodes • Signature Alg.: two orders of magnitude faster than the Hungarian Alg. (1 second for datasets with 6,390 blank nodes). 34% bigger deltas than the Hungarian Alg. • In synthetic datasets with directly connected blank nodes • Hungarian Alg. yielded from 0 to 3 times smaller deltas than the Signature Alg. The Signature Algorithm was 18 to 57 times faster • The algorithms provide a delta of 12.7 to 7,294 times smaller than without blank node matching • The Signature Algorithm requires only 10.5 seconds to match 153,600 blank nodes! FORTH-ICS, ISWC 2012
Possible Future Research Several issues are interesting for further research • Investigation of other special cases where the optimal blank node mapping can be found polynomially • Directly connected blank nodes that form graphs of bounded tree width • Comparative evaluation of various (probabilistic) signature construction methods and greedy approximation algorithms FORTH-ICS, ISWC 2012
Web system available in: http://www.ics.forth.gr/isl/BNodeDelta • Work done in the context of SCIDIP-ES, APARSEN and i-Marine Thank you for your attention FORTH-ICS, ISWC 2012