Managing XML and Semistructured Data

Managing XML and Semistructured Data Part 4: Compressing XML Data

In this section • XML Compression • Motivation • The State-of-the-Art • Queriable compressors • Non-queriable compressors Resources • XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001 • Others: XGrind, XPress, XQuec, XMLzip, … • XCQ: From my publications • XQZip: From my publications • MQX: From my publications

Introduction • More and more XML data is created • Duplicate structures (tags, paths …) • Data inflation: data in XML is much larger than raw data • Compression: storage and data transfer • General-purpose compressor (e.g. gzip) • Characteristics of XML data not utilized • Unqueriable

Compression: The Problem • XML for exchange (space or time) • But XML is verbose and inflated due to • Duplicated tags and paths • Users prefer application specific formats: • Eg. Web Server Logs • Is XML doomed to fail ? • Solution: XML-specific compressor • Non-queriable: XMill • Queriable: XQzip

XML-Specific Compressors • Unqueriable Compression (e.g. XMill): • Full-chunked: data commonalities eliminated • Very good compression ratio • Queriable Compression (e.g. XGrind, XPRESS): • Fine-grained: data commonalities ignored • Inadequate compression ratio and time • Support simple path queries with atomic predicate

Issues in XML Compression • Compression ratios, Compression time, Query Coverage, Memory Usage…(see my survey paper in WWWJ) Comparison of existing technologies

An Example:Web Server Logs ASCII File 15.9 MB (gzipped 1.6MB): 202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) <apache:entry> <apache:host> 202.239.238.16 </apache:host> <apache:requestLine> GET / HTTP/1.0 </apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> http://www.net.jp/ </apache:referer> <apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent> </apache:entry> XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB):

XMill • First specialized compressor for XML data • SAX parser for parsing XML data • Still using gzip as its underlying compressor • Clever grouping of data into containers for compression • Compress XML via three basic techniques • Compress the structure separately from the data • Group the data values according to their types • Apply semantic (specialized) compressors: • Downloadable: • www.cs.washington.edu/homes/suciu/XMILL

XMill Architecture:

How Xmill Works: Three Ideas Compress the structure separately from the data: gzip Structure gzip Data 202.239.238.16 GET / HTTP/1.0 text/html 200 … <apache:entry> <apache:host> </apache:host> . . . </apache:entry> =1.75MB +

How Xmill Works: Three Ideas Group the data values according to their types: gzip Structure gzip Data1 gzip Data2 <apache:entry> . . . </apache:entry> 202.23.23.16 224.42.24.55 … GET / HTTP/1.0 GET / HTTP/1.1 … =1.33MB + +

=0.82MB gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... How Xmill Works: Three Ideas Apply semantic (specialized) compressors: • Examples: • 8, 16, 32-bit integer encoding (signed/unsigned) • differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) • compress lists, records (e.g. 104.32.23.1  4 bytes) • Need user input to select the semantic compressor

Path Processor – structure container: <Book><Title lang=“English”>Data Compression</Title> <Author>Gray</Author> <Author>Reiter</Author> </Book> • Replace data value with container number (negative integer) • Replace end tag with 0 • Replace tags/attributes with positive integer Dictionary: One more entry for each new word Fewer storage! 14 bytes! Book = 1, Title = 2, @lang = 3, Author = 4 1 2 3 -1 0 -2 0 4 -3 0 4 -3 0 0 <Book><Title lang=“English”>Data Compression</Title> <Author>Gray</Author> <Author>Reiter</Author> </Book> <Book><Title lang=-1>-2</Title> <Author>-3</Author> <Author>-3</Autor> </Book> <Book><Title lang=-1 0>-2 0 <Author>-3 0 <Author>-3 0 0 Repeated structures entries could be compressed effectively!

XML Compression XMill Evaluation using XML datasets

Queriable Compressors • XQzip: queriable XML compressor (our work [EDBT04]) • Existing XML compressors (survey in[WWWJ05]): • Unqueriable (e.g. XMill [SIGMOD00]): exploit data commonalities ≥ better compression rate than gzip) • Queriable (e.g. XGrind [ICDE02], XPRESS [SIGMOD03], XQueC, XQzip [EDBT04], XCQ [KAISJ05]): compress data individually≥inadequate compression rate and time) • Features of XQzip: • Use the SIT to aid query evaluation • Block-compression: allow data commonalities to be exploited and used as buffers to reduce decompression overhead

Structure Index Tree (SIT) • Effective elimination of duplicate structures in the XML data • Merging of nodes that have • the same incoming path • the same ordered set of paths of their descendants • SIT Construction • A linear scan of the XML document • Merging of the subtree that we are constructing into its equivalent subtree in the base tree

SIT Construction / / 0 0 a a 1 1 b b c c b c c c b 7 2 2 ,7 5 6 5 ,6 7 6 d d d e e d d e e d d 9 3 4 8 10 8 ,10 3 4 ,9 9 10 ,8,10

XQzip Architecture • Index Constructor: construct the SIT • Compressor • Group semantically related items in blocks • Compress each block by gzip • Query Processor: evaluate query • Parser • Executor: apply the SIT to evaluate query • Buffer Manager (By LRU)

SIT Construction Complexity N: Total number of elements in the input XML document • Time Complexity: • Worst-case: O(N │SIT │) • Average-case: O(N) • Space Complexity: • Base tree and the subtree being merged: ≤ 2│SIT │ • Space for storing ids of eliminated nodes: O(N)

Data Compression • A balance between full-chunked and fine-grained compression • A distinct data container for each distinct element • Each container compressed (using gzip) into many smaller blocks • Block size? • Too small: query time ↑compression ratio↓ • Too large: query time ↓compression ratio↑ • Only can be determined by an empirical study

Block Size Representative datasets and queries: • Datasets: • Heavy text • Light text • A mix of heavy text and light text • Queries: • High Selectivity • MediumSelectivity • LowSelectivity

Block Size

Structure of Compressed-Data • Block size? • Determined by an empirical study • Querying Time • near-optimal range : 600-1000 data items/block (average optimal: 950) • Compression Ratio • Not improved much after 150 KB/block (usually contain more than 1000 items) • ≥ 1000 data items/block

Outline • Introduction • XQzip [EDBT 2004] • Indexing • Data Compression • Query Evaluation • Performance Evaluation • Conclusion

XQzip Query Coverage • All XPath axes except the sideways axes (e.g. preceding, following)-siblings • Multiple and nested predicates • and / or / not expressions • Aggregations: sum, count, average, max, min • Group queries: e.g. (L1 (L2 + L3 + L4)) • L1: //a[b = “Crete”] (prefix)L2: c • L3: d[f/count() >100] L4: e[//g]

Query Evaluation • Depth-first traverse the index tree • Buffer Management (LRU) • Why buffering? Decompression Time Dominates • Decompression avoidance

Outline • Introduction • XQzip • Indexing • Data Compression • Query Evaluation • Performance Evaluation • Conclusion

Effectiveness of the SIT

Effectiveness of the SIT • Index Size: less than 1% of original size • Load Time: a fraction of a second • Node Selection Acceleration: twice faster than F&B-Index • Construction Time: more than 3 times faster than F&B-Index

Compression Ratio XQzip is comparable to XMill and gzip, 17% better than XGrind with index size included, 42% better than XGrind without index.

Compression/Decompression Time • XQzip (compression + index construction) is more than 5 times better than XGrind, 1.5 times worse than XMill • XQzip (index-loading + decompression) is more than 3 times better than XGrind, 1.4 times worse than XMill

Query Preformance • Cold Buffer-pool Evaluation • 13 times better than XGrind • Warm buffer-pool Evaluation • 80 times better than XGrind • Impressive Buffer Effect!

Lessons on XML Compression • Good compression ratio and time • Comparable to that of XMill • Much better than that of XGrind (and XPRESS) • Support a very practical set of queries • A much wider range of queries than XGrind and XPRESS • Very Competitive Querying Time with Buffer • 13 time better than XGrind with cold buffer • 80 time better than XGrind with warm buffer • Limitations • Cost of building and maintenance of complex Indexes • No theoretical foundation of block size

XCQ • XCQ Framework • Experimental Results • Compression Performance • Query Performance • Lessons and Development

XCQ • Objectives: • Achieve Good Compression ratio • Comparable to XMill • Better than XGrind • Achieve Good Query performance • More efficient than XGrind • Querying compressed documents with block-based partial decompression • But addressing issues different from XQzip • Adopt minimal indexing • Establish theory between selectivity and block size

XCQ Querying Engine XCQ Compression Engine Query Results Compressed Document DTD XML Document XPath Queries XCQ Strategy • Based on four techniques • DTD Tree and SAX Event Stream Parsing (DSP) • Partition Path-Based Data Grouping (PPB) Format • Block-Statistic Signature (BSS) Indexing • Access Methods PPG format BSS indexing DSP Access Methods

Compressed Document Query Results Technique 1 –DTD Tree and SAX Event Stream Parsing (DSP) PPG format BSS indexing DSP Access Methods XCQ Querying Engine XCQ Compression Engine DTD XML Document XPath Queries

Technique 1 –DTD Tree and SAX Event Stream Parsing (DSP) • Purpose: • To utilize information in the associated DTD of the document • Benefits: • Only encode the information that cannot be inferred in the DTD • Precise path-based grouping of data items • Run in automated manner

A DTD Tree A Structure Stream DSP Module Data Streams A Stream of SAX Events DSP – Input and Output

library entry* publisher? author (name) title year num_copy | Key: paper book : PCDATA course_note DSP Step 1 – Creating a DTD Tree <!ELEMENT library (entry*)> <!ELEMENT entry (author, title, year, publisher?, (paper|course_note|book), num_copy)> <!ELEMENT author EMPTY> <!ATTLIST author name CDATA> <!ELEMENT title (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)> <!ELEMENT paper EMPTY> <!ELEMENT course_note EMPTY> <!ELEMENT book EMPTY> <!ELEMENT num_copy (#PCDATA)>

DSP Step 2 – Processing in DSP Module • How does the DSP module process the following XML document? <library> <entry> <author name="Tom"/> <title>Introduction to "OS"</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library>

<library> <entry> <author name="Tom"/> <title>Introduction to "OS"</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “library” Structure Stream: library entry* author (name) publisher? | title year num_copy Data Streams: Keys: paper book : Traversal path course_note : PCDATA : Processing DTD tree node

T <library> <entry> <author name="Tom"/> <title>Introduction to "OS"</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “entry” Match! Structure Stream: library entry* author (name) publisher? | title year num_copy Data Streams: Keys: paper book : Traversal path course_note : PCDATA : Processing DTD tree node

<library> <entry> <author name="Tom"/> <title>Introduction to "OS"</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “author”, att0:name=“Tom” End element – “author” d0 Structure Stream: library T , d0 Match! entry* author (name) publisher? | title year num_copy Data Streams: d0: Tom Keys: paper book : Traversal path course_note : PCDATA : Processing DTD tree node

<library> <entry> <author name="Tom"/> <title>Introduction to "OS"</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “title” PCDATA – “Introduction to "OS "” End element – “title” Structure Stream: library T, d0, d1 entry* author (name) publisher? | title year num_copy Data Streams: d0: Tom d1: Introduction to "OS " Keys: paper book : Traversal path course_note : PCDATA : Processing DTD tree node

Not match! F <library> <entry> <author name="Tom"/> <title>Introduction to "OS"</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: SAX Events: Start element – “year” PCDATA – “2003” End element – “year” Start element – “course_note” Structure Stream: library T, d0, d1, d2 , F entry* author (name) publisher? | title year num_copy Data Streams: d0: Tom d1: Introduction to "OS " d2: 2003 Keys: paper book : Traversal path course_note : PCDATA : Processing DTD tree node

Not match! p1 Keys: : Traversal path : PCDATA : Processing DTD tree node <library> <entry> <author name="Tom"/> <title>Introduction to "OS"</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “course_note” End element – “course_note” Structure Stream: library T, d0, d1, d2, F , p1 Match! entry* author (name) publisher? | title year num_copy Data Streams: d0: Tom p0 p2 d1: Introduction to "OS " p1 d2: 2003 paper book course_note

Keys: : Traversal path : PCDATA : Processing DTD tree node <library> <entry> <author name="Tom"/> <title>Introduction to "OS"</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “num_copy” PCDATA – “3” End element – “num_copy” End element – “entry” Structure Stream: library T, d0, d1, d2, F, p1 entry* author (name) publisher? | title year num_copy Data Streams: d0: Tom d1: Introduction to "OS " d2: 2003 paper book d4: 3 course_note

Managing XML and Semistructured Data