1 / 48

XML Compression Techniques

XML Compression Techniques. Gregory Leighton Web Data Management Lab Department of Computer Science University of Calgary June 24, 2008. Outline. XML primer XML-conscious compression schemes Non-queryable schemes Queryable schemes Future directions. The Extensible Markup Language (XML).

omer
Télécharger la présentation

XML Compression Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Compression Techniques Gregory Leighton Web Data Management Lab Department of Computer Science University of Calgary June 24, 2008

  2. Outline • XML primer • XML-conscious compression schemes • Non-queryable schemes • Queryable schemes • Future directions XML Compression Techniques

  3. The Extensible Markup Language (XML) • A W3C-endorsed standard • Originally intended as a web document authoring format, has since become a popular method for encoding semi-structured data • Data integration, data exchange applications • Support for native (tree-based) storage of XML in most commercial and open source DBMSs • MySQL, DB2, Oracle, SQL Server • A markup language: text content (PCDATA) can be surrounded by descriptive markup (elements and attributes) XML Compression Techniques

  4. Example: An XML Document <course> <name>CS 501</name> <instructor>Ron Charles</instructor> <students> <student name=“Alice“> <a1>78</a1> <a2>86</a2> <midterm>91</midterm> <project>87</project> </student> <student name=“Bob“> <a1>69</a1> <a2>71</a2> <midterm>82</midterm> <finalexam>60</finalexam> </student> </students> </course> XML Compression Techniques

  5. XML Data Model • XML documents can be modeled as ordered, labeled trees • Preserves ancestor-descendent relationships and sibling ordering • Each node is assigned a unique ordinal value corresponding to its position in a pre-order traversal • Internal nodes represent elements or attributes • Leaf nodes represent text content (PCDATA) or attribute values • Label of incident edge stores the node’s class (for internal nodes) or text content (for leaf nodes) XML Compression Techniques

  6. Example: XML Document Tree Path Expression: /course Textual Representation: <course></course> Path Expression: /course/name/text() Textual Representation: <course> <name>CS 501</name> </course> Path Expression: /course/name Textual Representation: <course> <name></name> </course> course name o1 instructor students o2 o4 student student o6 “CS 501” “Ron Charles” o5 o7 o3 o18 @name @name a1 finalexam a1 project a2 a2 midterm midterm o16 o8 o10 o12 o14 o19 o21 o23 o25 o27 “Alice” “86” “91” “87” “Bob” “78” “69” “71” “82” “60” o13 o9 o11 o15 o17 o20 o22 o24 o26 o28 XML Compression Techniques

  7. XML Technologies Query Languages Parsing APIs Document Object Model (DOM): produces an in-memory representation of document tree Efficient traversal Large memory consumption Simple API for XML (SAX): event-based, depth-first parsing of document Much smaller memory consumption Some navigation operations become expensive (allows serial access only) • XPath allows tree nodes to be selected based on their position and/or value • /course/name/text() • //student[@name=“Alice”] • XQuery: a higher-level, declarative query language that incorporates XPath XML Compression Techniques

  8. XML-Conscious Compression Schemes XML Compression Techniques

  9. Taxonomy compressors XML-conscious • generic text • bzip2 • gzip • PPM variants • Etc. queryable • non-queryable • AXECHOP • DTDPPM • SCMPPM • XComp • XMill • XMLPPM • XWRT • sequential • access • BPLEX • TREECHOP • XGRIND • XPRESS • random • access • XQueC • XSeq XML Compression Techniques

  10. Other Classification Criteria • Schema-aware or schema-oblivious? • Information in schema documents can allow document structure to be encoded more succinctly • Some schema languages (e.g., XML Schema) specify data types – this knowledge can be used to guide selection of compression schemes • Limited applicability: not all documents have an associated schema document • Online vs. offline operation • Can decompression be carried out incrementally? • Compression paradigm used • Homomorphic or permutation-based? XML Compression Techniques

  11. Schema-aware Compression: Example <!ELEMENT course (name,instructor,students)> Indicates that each <course> element must have <name>, <instructor>, and <students> elements as children (in that order): no unpredictability If encoder and decoder both possess the DTD: 0 bits needed to represent structure of <course> elements! <!ELEMENT student (a1,a2,midterm,(finalexam|project))> Similarly, only 1 bit is needed to indicate whether <finalexam> or <project> appears as the fourth child of <student> XML Compression Techniques

  12. Other Classification CriteriaIntendedApplication Domains • Archiving: volumes of data are compressed to preserve disk space, not accessed frequently • E.g., Web server logs • Priorities: compression ratio, compression speed • Data exchange: for data transferred over a network, the key goals are to improve throughput and reduce bandwidth consumption • E.g., web services, instant messaging • Priorities: compression/decompression speed, online operation • Database/IR applications: declarative queries are issued over XML documents; compression can improve query performance by reducing number of I/O operations • E.g., scientific databases • Priorities: queryable compression w/ random access, decompression speed XML Compression Techniques

  13. Permutation-based Approaches Document is rearranged to localize repetitions before passing to back-end compressor(s) • Data segments are grouped into different containers, typically based on the identity of parent element • Tag structure (“skeleton”) and data segments are compressed separately XML Document Shredder Skeleton Data Containers Structure Compressor Data Compressor XML Compression Techniques

  14. Homomorphic Approaches • Each XML token is compressed individually, “in-place” • Compression process maintains structure of original document • Poorer compression, but easier to query than permutation-based approaches (less fragmentation) XML Compression Techniques

  15. Non-Queryable Compressors XML Compression Techniques

  16. XMill (Liefke & Suciu, 2000) • Introduced idea of separately compressing document structure and data, container grouping of text segments • gzip is used as back-end compressor • Data-centric XML: often beats gzip’s compression of medium- to large-sized XML documents (> 20 KB) by 35%-60% • Document-centric XML: little to no improvement over gzip’s compression rate • Compression/decompression speed comparable to gzip’s XML Compression Techniques

  17. XMill Compression strategy is based on 3 principles: • Separation of structure from data • Start tags, attributes are assigned a binary codeword • Container-based storage of data segments, using path-based partitioning by default • Custom partitioning policies can be defined using a container expression language • Semantic compressors may optionally be applied to each container • E.g., differential encoder for numeric data; specialized compressors for handling specific formats like dates and URLs XML Compression Techniques

  18. XMillExample <a> <b> <c>text 1</c> </b> <b> <c>text 1</c> </b> <c>text 3</c> <d>text 4</d> </a> gzip gzip gzip gzip Compressed File XML Compression Techniques

  19. XMillRelated Approaches • XComp (Li, 2003): groups data values into containers based on <label, level, node_type>, then applies gzip to containers and structural summary; little (<2%) to no improvement over XMill’s compression rate or time • AXECHOP (Leighton, 2005a): applies grammar-based scheme (MPM) to compress structural summary, uses bzlib for container compression; outperforms XBMill on most documents XML Compression Techniques

  20. XMLPPM (Cheney, 2001) • Compression process is centered around two concepts: • Encoded SAX parsing (ESAX): each SAX event is replaced with a more succinct encoding • Multiplexed hierarchical modeling (MHM): separate PPM models are maintained for elements, attributes, textual content, and characters • Additional symbols are injected into models to preserve the context formed by original tag hierarchy • A single arithmetic coder is shared between all models • Often compresses 15-35% better than XMill; main drawback is slow operation XML Compression Techniques

  21. XMLPPMRelated Approaches • SCMPPM (Adiego et al, 2004): maintains a separate model for every distinct element/attribute; often achieves better compression than XMLPPM • DTDPPM (Cheney, 2005): consults DTD to increase accuracy of symbol prediction; only effective on small (no more than a few MBs) and highly-structured documents XML Compression Techniques

  22. XWRT (Skibiński et al, 2007) XML Document • The “XML Word-Replacing Transform” • Pre-processes document in hopes of boosting compression performance of selected back-end scheme (LZ77 or LZMA) • XWRT + LZMA often outperforms XMLPPM and SCMPPM in compression ratio, while offering faster performance! XWRT Pre-processed Document LZ77/LZMA Compressed file XML Compression Techniques

  23. XWRTPre-processing Techniques • Dynamic dictionary of frequently used words • Grouping of data values into containers, based on name of encapsulating element/attribute • Use of additional containers to encode some types of data with a predictable format • Numbers, dates, times XML Compression Techniques

  24. Non-Queryable Compressors: Summary XML Compression Techniques

  25. Queryable Compressors XML Compression Techniques

  26. XGRIND (Tolani & Haritsa, 2002) • Encodes elements and attributes using XMill’s approach • DTD-conscious: enumerated attributes with k possible values are encoded using a log2k-bit scheme • Data values are encoded using non-adaptive Huffman coding • Requires two passes over the input document • Separate statistical model for each element/attribute • Homomorphic compression: compressed document retains original structure XML Compression Techniques

  27. XGRIND Original Fragment: Compressed Fragment: T0 A0 nahuff(Alice) T1 nahuff(78) / T2 nahuff(86) / T3 nahuff(91) / T4 nahuff(87) / / <student name=“Alice“> <a1>78</a1> <a2>86</a2> <midterm>91</midterm> <project>87</project> </student> XML Compression Techniques

  28. XGRIND • Many queries can be carried out entirely in compressed domain • Exact-match, prefix-match • Some others require only decompression of relevant values • Range, substring • Queryability comes at the expense of achievable compression ratio: typically within 65-75% that of XMill XML Compression Techniques

  29. XPRESS (Min et al, 2003) • Like XGRIND, features a homomorphic compression process requiring two passes over the document • Defines a process called reverse arithmetic encoding for representing each distinct root-to-node label path as an interval in [0.0, 1.0) • For intervals I1, I2: if I1 contains I2, then path P1 is a suffix of path P2 • This feature facilitates execution of ancestor-descendent queries in compressed domain XML Compression Techniques

  30. XPRESS • Attempts to automatically infer types of data values, and then applies a type-specific compression scheme • Numeric data: differential encoding scheme • Textual data: 1-byte dictionary encoding scheme (if < 128 distinct values), Huffman otherwise • Queries supported in compressed domain: • Numeric data: exact match, range queries • Textual data: exact match • Tends to compress better than XGrind, while also operating 2-3 times faster on typical XPath queries • XMill’s compression rate tends to be ~20-25% better XML Compression Techniques

  31. XSeq (Lin et al, 2005) • First phase of compression groups tags and data segments into structure and data value containers • The contents of each container are then compressed using a grammar-based scheme (SEQUITUR) • Several indices are constructed to enable efficient, random-access querying over compressed containers • Supports exact-match, range queries in compressed domain • Compression rate: comparable to gzip, as much as 30% better than XGrind’s XML Compression Techniques

  32. XSeqIndices • File header: consists of • a list of pointers to structure container index and to each data value container index • a table recording mappings b/w tags and substitution codes used in structure container • Structure container indexanddata value container indices: record information about the SEQUITUR grammar generated from document’s tag sequence (resp., from contents of individual data value container) • number of grammar rules • number of symbols in each rule’s RHS • for each rule, occurrence counts of terminal symbols in the RHS XML Compression Techniques

  33. XSeqQuery Processing Example: /course/students/student[@name=“Alice”] • Determine token sequences relevant for query by consulting mapping table stored in file header • Consult structure index to determine which grammar rules contain relevant tokens • Expand out relevant rules and extract data values from appropriate data value containers XML Compression Techniques

  34. TREECHOP (Leighton et al, 2005b) • A queryable compressor intended mainly for data exchange applications • Recipient receives a stream of XML data, wants to selectively process certain nodes • Avoids necessity of decompressing entire data stream in memory to extract values • Single-pass compression process: tree nodes are encoded adaptively during depth-first traversal of document tree, and passed to a back-end gzip compressor • Achieves compression rates comparable to gzip, while allowing selection of random nodes; compression/decompression speed is slightly less than gzip’s XML Compression Techniques

  35. TREECHOP • Non-leaf nodes are encoded as a binary code with three constituent parts: • the codeword of the parent node p • a variable-length Golomb codeword recording the relative position of this node w.r.t. p • a fixed-length code indicating the node type (element, attribute, comment, CDATA, or processing instruction) • For first occurrence only, the node text (e.g., element name) is appended immediately after the codeword • Leaf nodes are written as raw text sequences • Queries supported in the compressed domain: exact match and range XML Compression Techniques

  36. XQueC (Arion et al, 2004) • Designed to support a large subset of XQuery in the compressed domain • Prototype uses either ALM (order-preserving) or Huffman (unordered) to individually compress data values • ALM: supports equality/inequality matching, doesn’t support prefix-matching • Huffman: supports equality and prefix-matching, doesn’t support inequality matching • A permutation-based strategy is used, in which data values are first assigned to containers according to their parent element’s path • Containers are later grouped together to share the same compression model if they exhibit high similarity and appear frequently together in query predicates XML Compression Techniques

  37. XQueC • Given a workload of typical queries, an attempt is made to determine the optimal container grouping, and assignment of the best available compression algorithm to apply to each group, in order to minimize the following costs • Decompression time • Storage costs for compressed data and source models • In addition to containers, the compression process creates two additional data structures: • Structure tree • Structure summary XML Compression Techniques

  38. XQueC (Arion et al, 2007)Structure Tree & Structure Summary course [1, 8] course Structure Tree students students [2, 7] Structure Summary student student [3, 3] student [6,6] @name a1 @name [4, 1] a1 [5, 2] @name [7, 4] a1 [8, 5] Header “Alice” Container “Bob” XML Compression Techniques

  39. BPLEX (Busatto et al, 2005) • Focuses on improving the compression of XML structure, by searching bottom-up for repeated patterns in the document’s minimal DAG • A succinct pointer-based representation is used to represent subsequent occurrences of a repeated subgraph • DOM operations can be carried out on the compressed skeleton • Achieves smallest queryable representation of XML structure: averages 68% of XMill’s compressed size with gzip as back-end compressor! (Maneth et al, 2008) XML Compression Techniques

  40. BPLEX Buneman et al (2003) demonstrate that Core XPath queries can be efficiently evaluated directly on the minimum DAG of a skeleton a a DAG is 2/3 smaller than original skeleton b b b b c d c d c d c d In theory, a DAG can be exponentially smaller than the original skeleton; “real world” XML DAGS are often less than 10% of the original document size XML Compression Techniques

  41. BPLEX DAGs are limited to subtree sharing, meaning they can miss out on repeated patterns occurring in the interior of the skeleton… a In this example, the DAG is equivalent to the original skeleton – no compression a b a BPLEX generates a straight-line tree grammar (SLT grammar) that is capable of representing repeated, connected subgraphs c b f a d b y2 e b SLTs are frequently half the size of the equivalent minimal DAG y1 XML Compression Techniques

  42. Queryable Compressors: Summary XML Compression Techniques

  43. Future Directions XML Compression Techniques

  44. XML Updates • Impetus for incremental update of existing XML data sets is increasing • XML-based office document standards: ODF, OOXML • Increased volume of persistent XML data • W3C has recently proposed an extension to XQuery for expressing node-level updates • So far, most approaches to XML compression have assumed a “read-only” model • How amenable are the existing schemes to updates? XML Compression Techniques

  45. Evaluation of XML-Conscious Techniques • Currently, it is difficult to make definitive statements about the relative effectiveness of different techniques… • lack of available implementations • no consistent benchmark • each approach tends to use its own corpus • many queryable compressors aren’t tested thoroughly on the full set of supported queries • most works provide no theoretical justification; instead rely on empirical results against a limited corpus XML Compression Techniques

  46. Thank you XML Compression Techniques

  47. References • J. Adiego, P. De la Fuente, and G. Navarro. Combining structural and textual contexts for compressing semistructured databases. ENC, 2005. • A. Arion, A. Bonifati, I. Manolescu, and A. Pugliese. XQueC: A query-conscious compressed XML database. ACM TOIT 7(2), 2007. • P. Buneman, M. Grohe, and C. Koch. Path queries on compressed XML. VLDB, 2003. • G. Busatto, M. Lohrey, and S. Maneth. Efficient memory representation of XML documents. DBPL, 2005. • J. Cheney. Compressing XML with multiplexed hierarchical PPM models. DCC, 2001. • J. Cheney. An empirical evaluation of simple DTD-conscious compression techniques. WebDB,2005. • J. Cheney. Tradeoffs in XML database compression. DCC, 2006. • J. Cheng and W. Ng. XQzip: querying compressed XML using structural indexing. EDBT, 2004. • G. Leighton, J. Diamond, and T. Müldner. AXECHOP: a grammar-based compressor for XML. DCC, 2005. XML Compression Techniques

  48. References (cont.) • G. Leighton, T. Müldner, and J. Diamond. TREECHOP: a tree-based query-able compressor for XML. CWIT, 2005. • W. Li. XComp: An XML compression tool. Master's thesis, University of Waterloo, 2003. • H. Liefke and D. Suciu. XMill: an efficient compressor for XML data. SIGMOD, 2000. • S. Maneth, N. Mihaylov, and S. Sakr. XML tree structure compression. XANTEC, 2008. • J. Min, M. Park, and C. Chung. XPRESS: a queriable compression for XML data. SIGMOD, 2003. • P. Skibiński, Sz. Grabowski, and J. Swacha. Effective asymmetric XML compression. To appear in: Software: Practice and Experience, 2008. • P. Tolani and J. Haritsa. XGRIND: a query-friendly XML compressor. ICDE, 2002. XML Compression Techniques

More Related