Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum

Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum

Outline • The Problem: Connections in XML Collections • HOPI Basics [EDBT 2004] • Efficiently Building HOPI • Why Distances are Difficult • Incremental Index Maintenance

XML Basics article article title sec references title sec references entry entry XML … … XML document Element-level graph <article> <title>XML</title> <sec>…</sec> <references> <entry>…</entry> </references> </article>

XML Basics link <article> <title>XML</title> <sec>…</sec> <references> <entry>…</entry> </references> </article> <researcher> <name>Schenkel</name> <topics>…</topics> <pubs> <book>…</book> </pubs> </researcher> <book> <title>UML</title> <author>…</author> <content> <chap>…</chap> </content> </book> XML collection= docs + links

XML Basics article researcher title sec references name topics pubs entry book book Element-level graphof the collection title author content chap

XML Basics Document-level graphof the collection

Connections in XML article researcher article researcher title sec references name topics pubs entry book • (Naive) Answers: • Use Transitive Closure! • Use any APSP algorithm!(+ store information) • Questions: • Is there a path from article to researcher? • How long is the shortest path from article to researcher? book title author content chap XPath(++)/NEXI(++)-Query //article[about(“XML“)]//researcher[about(“DBS“)]

Why naive is not enough Small example from real world: subset of DBLP 6,210 documents (publications) 168,991 elements 25,368 links (citations) 14 Megabytes (uncompressed XML) Element-level graph has 168,991 nodes and 188,149 edges Its transitive closure: 344,992,370 connections 2,632.1 MB Complete DBLP has about 600,000 documents The Web has …?

Goal Find a compact representation for the transitive closure • whose size is comparable to the data‘s size • that supports connection tests (almost) as fast as the transitive closure • that can be built efficiently for large data sets

HOPI: Use Two-Hop Cover a c b • For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a) • For each connection (a,b), • choose a node c on the path from a to b (center node) • add c to Lout(a) and to Lin(b) • Then (a,b)Transitive Closure T  Lout(a)Lin(b)≠ Two-hop Cover of T (Edith Cohen et al., SODA 2002) • Minimize the sum of the label sizes(NP-complete  approximation required)

Approximation Algorithm 1 2 4 5 3 6 initial density: 2 4 1 I O 5 2 6 What are good center nodes? Nodes that can cover many uncovered connections. Initial step:All connections are uncovered 2  Consider the center graph of candidates density of densest subgraph (here: same as initial density) (We can cover 8 connections with 6 cover entries)

Approximation Algorithm 1 2 4 5 3 6 initial density: 1 4 2 5 I O 6 3 density of densest subgraph = initial density (graph is complete) 4 What are good center nodes? Nodes that can cover many uncovered connections. Initial step:All connections are uncovered 4  Consider the center graph of candidates Cover connections in subgraph with greatest density with corresponding center node

Approximation Algorithm 1 2 4 5 3 6 1 I O 2 2 What are good center nodes? Nodes that can cover many uncovered connections. Next step:Some connections already covered 2  Consider the center graph of candidates Repeat this algorithm until all connections are covered Theorem: Generated Cover is optimal up to a logarithmic factor

Optimizing Performance [EDBT04] • Density of densest subgraph of a node‘s center graph never increases when connections are covered • Precompute estimates, recompute on demand(using a Priority Queue)  ~2 computations per node • Initial Center Graphs are always their densest subgraphs

Is that enough? For our example: Transitive Closure: 344,992,370 connections Two-Hop Cover: 1,289,930 entries  compression factor of ~267  queries are still fast (~7.6 entries/node) But: Computation took 45 hours and 80 GB RAM!

HOPI: Divide and Conquer Framework of an Algorithm: • Partition the graph such that the transitive closures of the partitions fit into memory and the weight of crossing edges is minimized • Compute the two-hop cover for each partition • Combine the two-hop covers of the partitions into the final cover

Step 3: Cover Joining Using current Lin and Lout t Naive Algorithm (from EDBT ’04) s For each cross-partition link st: • Choose t as center node for all connectionsover st • Add t to Lin(d) of all descendants d of t and t itself • Add t to Lout(a) of all ancestors a of s and s itself Join has to be done sequentially for all links

Results with Naive Join Best combination of algorithms: Transitive Closure: 344,992,370 connections Two-Hop Cover: 15,976,677 entries  compression factor of ~21.6  queries are still ok (~94.5 entries/node)  build time is feasible (~3 hours with 1 CPU and 1GB RAM) Can we do better?

Structurally Recursive Join Alg Basic Idea • Compute (small) graph from partitioning • Compute its two-hop cover Hin,Hout • Combine this cover with the partition covers

Example 7 8 4 5 2 3 1 6 Build partition-level skeleton graph PSG

Example (ctd.) 1 2 7 8 Hin 2 2 7 2 Hout 2 2 2,7 2 8 7 1 2 Join Algorithm: • For each link source s,add Hout(s) to Lout(a) for each ancestor a of sin s‘ partition • For each link target t,add Hin(t) to LIn(t) for each descendant d of tin t‘s partition Join can be done concurrently for all links

Example (ctd.) Lout={…,2,7} Lin={…,2} 7 8 4 5 Lemma:It is enough to cover connections from link sources to link targets 2 3 1 6

Final Results for Index Creation Transitive Closure: 344,992,370 connections Two-Hop Cover: 9,999,052 entries  compression factor of ~34.5  queries are still ok (~59.2 entries/node)  build time is good (~23 minutes with 1 CPU and 1GB RAM) Cover size 8 times larger than best,but ~118 times faster with ~1% memory

Why Distances are Difficult 2 4 Lout(v)={(u,2), …} Lin(w)= {(u,4), …}  v u w • Should be simple to add: Lout(v)={u, …} Lin(w)= {u, …} dist(v,w)=dist(v,u)+dist(u,w)=2+4=6 • But the devil is in the details…

Why Distances are Difficult 2 4 v u w dist(v,w)=1 Center node u does not reflect the correct distance of v and w

Solution: Distance-aware Centergraph 1 2 4 5 3 6 1 4 2 5 I O 6 3 4 • Add edges to the center graph only if the corresponding connection is a shortest path • Correct, but two problems: • Expensive to build the center graph (2 additional lookups per connection) • Initial graphs are no longer complete  bound is no longer tight

New Bound for Distance-Aware CGs Estimation for Initial Density Assume we know the CG (E=#edges). Then But: precomputation takes 4h  Reduces time to build two-hop cover by 2 hours Solution: random sampling of large center graphs

Incremental Maintenance (join) (delete+insert) How to update the two-hop cover when documents (nodes, elements) are • inserted in the collection • deleted from the collection • updated Rebuilding the complete cover should be the last resort!

Deleting „good“ documents 2 3 4 1 6 5 7 8 9 „good“ documents separate the document-level graph: Ancestors of d and descendants of d are connected only through d Delete document 6  Deletions in covers of elements in documents 3,4,8,9 (+ doc 6)

Deleting „bad“ documents 2 3 4 1 6 5 7 8 9 „bad“ documents don‘t separate the doc-level graph: Ancestors of d and descendants of d are connected through d and by other docs • Delete document 5 • Deletions in covers of elements in documents 1,2,3,7 (+ doc 5) • Add 2-hop cover for connections starting in docs 1,2,3 (but not 4) and ending in 7

Future Work • Applications with non-XML data • Length-Bound Connections: n-Hop-Cover • Distance-Aware Solution for Large Graphs with many cycles (partitioning breaks cycles) • Large-scale experiments with huge data • Complete DBLP (~600,000 docs) • IMDB (>1 Mio docs,cycles) with many concurrent threads/processes • 64 CPU Sun server • 16 or 32 cluster nodes

Conclusion • HOPI as connection and distance index for linked XML documents • Efficient Divide-and-Conquer Build Algorithm • Efficient Insertion and (sometimes) Deletion of Documents, Elements, Edges

Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum