140 likes | 160 Vues
Designing Indexing Structure for Discovering Relationships in RDF Graphs. Stanislav Bartoň. RDF & RDF Schema. Triple (Object, Property, Subject) Objects identified by its URI Subject – an object or explicit value
E N D
Designing Indexing Structure for Discovering Relationships in RDF Graphs Stanislav Bartoň
RDF & RDF Schema • Triple (Object, Property, Subject) • Objects identified by its URI • Subject – an object or explicit value • Special semantics added to certain resources (e.g. rdfs:class, rdfs:subclass)
Known approaches to discovering associations in RDF graphs • Using graph algorithms on real data, or • Path and Schema indices • 2D array of paths between Classes i and j within one Schema • An array of interconnections between Schemas
Tree Signatures • based on Dietz numbering scheme • immediate knowledge of mutual position of any two nodes within a signature
Transforming the graph into forest of trees • The RDF graph is generic directed graph possibly containing cycles • two situations can violate the tree structure: • Cycles • Nodes with in-degree > 1 => Transform the graph into forest of trees where the tree signatures could be applied
Transforming the graph into forest of trees • In-degree > 1 transformation:
Transforming the graph into forest of trees • Cycle transformation:
Transforming the graph into forest of trees • The first transformation breaks the graph into several components. • Individual components within the graph are identified via reachability. • Cycles are detected within a component by inappropriate amount of edges. • The signature is then built to each component. The total time complexity is then O(4n) => O(n).
Tree Signature indices • There are two indices built to keep track of • which nodes have been ‘divided’ and to which signatures they belong, and • which multiple nodes are contained in each signature • The indices are built along the creation of signatures
Path algorithm • Takes the start and end node as an input • Current node = start node, start signature = current signature. • Finds all the multiple nodes above the current node in the current signature. • Traverses all the new possibilities until it either does not find the end node or it does not have any possibilities left.
Connection algorithm • The problem of finding intersecting paths is reduced to finding the multiple node to which exists a path from both starting nodes. • The algorithm is keeping the set of reachable multiple nodes to each starting node. • Each node gets one turn to enlarge its set of usable multiples in each iteration. • After each iteration the sets of reachable multiples are intersected.
Conclusion and future work • The algorithms are time intensive on large scale data => further optimization. • Both algorithms suffer from the disability of telling the mutual position of two nodes within the graph => second level indexing structure. • Proposed indexing structure is less memory intensive than the Path and Schema indices. • Further support of Rho iso operator.