Efficient Complex Query Support For Multi-version XML Documents

Efficient Complex Query Support For Multi-version XML Documents Shu-Yao Chien Dept. of CS UCLA csy@cs.ucla.edu Vassilis J. Tsotras Dept. of CS&E UC Riverside tsotras@cs.ucr.edu Carlo Zaniolo Dept. of CS UCLA zaniolo@cs.ucla.edu Donghui Zhang Dept. of CS&E UC Riverside donghui@cs.ucr.edu

Content • Motivation • Problem statement • Framework • Problem Reduction • Solutions • Performance • Conclusions

Motivation • The web changes everything---XML unifies everything. • An assortment of new and old applications seek from XML a shared technology and toolset to support their assorted requirements. • Version management for XML documents is an important topic. • Main requirements and research challenges: • Efficient version retrieval. • Storage efficiency. • Complex query support.

Problem Definition • Given an XML document which evolve over time, how to store the whole history of it and perform complex queries on any version efficiently?

DNN DNN + Range Durable Node Numbering Scheme • XML document has ordered-tree structure and each element has: • a Durable Node Number(DNN) , and • a Range

1 100 5 30 51 80 Root (1) 21 25 55 dnn=1 65 range=100 Ch A (2) Ch B (5) dnn=5 dnn=51 range=25 range=30 Fig D (3) Sec E (4) Sec F (6) Fig H (8) dnn=11 dnn=21 dnn=55 dnn=71 range=2 range=5 range=10 range=2 Fig G (9) dnn=61 range=2 Node Numbering Scheme --- by Example • DNN preserves element order as pre-order traversal. • Range preserves parent-child relationship such that: dnn(P) < dnn(C) < dnn(C)+range(C) < dnn(P)+range(P).

Version Model • Each element has: • Lifespan --- (Vstart , Vend) • SPaR range --- (DNN, Range) • Adding a new version Ncorresponds to a set of changes: • Delete(E) – Set E.Vend to N and free its SPaR range. • Insert(E) – Set the lifespan of E to (N, now) and assign it an unused SPaR range. • Update(E, new value) – Delete(E) + Insert(E) using the same SPaR range but the new value.

Framework for Storage Schemes • Two types of tags: individual tag (abstract, conclusion) and list tag (chapter, section, figure). • User query list tag element by order (e.g. chapter 2) rather than by SPaR (e.g. the chapter whose SPaR range is (128, 512). Need to transform the order to SPaR range. Calls for separate indices.

Problem Reduction • Complex queries that can be reduced to partial version retrievals: • Structural projection: “project the part of document between chapter 2 and 5 in version 20”; • Path-expression: “find the chapter that contains figure 7 in version 10”.

Problem Reduction • Structural projection: “project the part of document between chapter 2 and 5 in version 20”: • Query CH-index, find all chapters in version 20; • Compute SPaR range between chapter 2 and 5; • Partial version retrieval on full index.

Problem Reduction • Partial version retrieval: given version i and DNN range r, find all elements whose DNNr in version i.

Problem Reduction • Path-expression: “find/construct the chapter that contains figure 7 in version 10”; • Query FIG-index, find the SPaR for figure 7 in version 10; • Query CH-index using the SPaR to find the chapter; • To construct, Partial version retrieval on full index.

Indexing for List Tags • The indexing for list tags (CH-index, FIG-index) is trivial: small. • Multi-version B+-tree (MVBT) [BGO+96]: asymptotically optimal in space, update, partial version retrieval.

Storage and Query Scheme for Full Index • We examine two schemes: • MVBT Storage/Index • UBCC Storage + secondary index

Motivation for UBCC Storage • The MVBT is capable of storing and querying the multi-versioned XML document, and is asymptotically optimal. Why UBCC? • MVBT is designed for handling one-by-one updates, not specialized for the batch update in the document versioning environment.

Traditional Versioning Schemes • Naive approach stores each version in its entirety: minimizes retrieval but very inefficient storage. • RCS (Revision Control System) : • stores the latest version in its entirety, and • old versions represented by deltas ---reverse edit script • minimizes storage cost • version retrieval cost grows linearly with version number • SCCS (Source Code Control System) : • objects time-stamped and stored by their document order • version retrieval cost as high as whole change history • These schemes are used by most current systems---but need improvements in storage management, retrieval, query, and support for complex objects.

UBCC Storage Scheme • RCS and SCCS stores major versions and incremental modifications. To query, find nearest major version and apply incremental changes for multiple versions. Also, designed for full version retrieval. • UBCC [VLDB’01]: Usefulness-Based Copy Control, uses the concept of Page Usefulness

Version Page Usefulness V1 A B C D 100% Useful V2 DEL A B C D 75% V3 DEL DEL Useless A B C D 25% Page Usefulness – by Example • We set a minimum usefulness requirement Umin, e.g. 70% (0 < Umin <= 1). • A page is useful/useless when its usefulness is above/ below Umin .

DEL DEL DEL VERSION 2 INS(Sec J) INS(Fig M) INS(Sec T) INS(Fig R) INS(Ch K) INS(Sec L) P3 P4 Usefulness Based Copy Control (UBCC) • STEP 1 : Determine page usefulness for copying. • STEP 2 : Append new/copied objects into new pages by • their logical order. VERSION 1 P2 , U(P2) = 50%< Umin=70% P1 , U(P1)=75% Root Ch A Fig D Sec E Ch B Sec F Fig G Fig H COPY Sec J Ch B Sec F Fig M Sec T Fig R Ch K Sec L , U(P4)=100% , U(P3)=100%

Complexity Analysis • Version retrieval I/O cost for Version N is bound by (SN/Umin). • SN is the size of Version N • E.g. Umin = 50% ---> I/O <= 2*SN • Version file size is linear with the size of change history (RCS), and is bound by O(Schg/(1-Umin)), where • Schg is the size of change history. • Umin is usefulness requirement. • Both are optimal!

Indexing Choices using UBCC • Using UBCC to cluster the document elements. On top of the document file: • MVBT as a dense index; or • MVRT as a sparse index.

Sparse Page Index --- Multi-version R Tree • Multi-version R-Tree : each record corresponds to a UBCC page: • Life Span : (T1,T2) • Maximum DNN Range : (D1,D2) • UBCC Page-ID • When retrieve a segment for a version, MVRT is traced to locate useful data pages with an overlapping DNN range. Version “Retrieve Version 10, Segment (D1,D2)” P 15 P 22 P 8 V 10 P 5 P 11 D1 D2 DNN range

Umin = 50% E1 DNN = 200 Life = (1,4) E2 DNN = 300 Life = (1,4) E3 DNN = 400 Life = (1,2) E4 DNN = 500 Life = (1,2) Page P Max DNN Range = (200,500) Life Span = (1,4) Sparse vs. Dense Indexing • Good for sparse MVRT: • small size; • each page is checked at most once. • Bad for sparse MVRT: • May read unnecessary pages, e.g. : • Request: Version 3, SPaR = (420,700) • Page P is qualified but contains no valid element.

Experimental Setup • Sun Enterprise 250 Server, Solaris 2.8, 16KB page size, 100 pages buffer size, GNU C++. • Dataset: 1000 versions; initial version 1000 objects; each object = 200 bytes; change between two versions is 10%. • Implemented schemes: • scheme 1: MVBT storage/index • scheme 2: UBCC storage, dense MVBT index • scheme 3: UBCC storage, sparse MVRT index

Performance Comparison --- Check-In Time and Index Size

Performance Comparison --- Partial Version Retrieval

Conclusions and Future Work • We proposed a framework for storing and querying multi-versioned XML documents. • We examined techniques that merges traditional versioning schemes and temporal databases for XML version management. • Best scheme: • UBCC storage • Sparse MVRT for full index • Dense MVBT for each tag index • Emerging issues: • Query language support for version queries. • User interface for browsing versions and presenting query results

Thank you!

Efficient Complex Query Support For Multi-version XML Documents