1 / 45

XPRESS: A QueriableCompression for XML Data

XPRESS: A QueriableCompression for XML Data. Jun-Ki Min. Myung-Jae Park. Chin-Wan Chung. By Erhan Durus ü t and Burak Ç etin. Outline. Motivation Background on Compression Algorithms Existing Compressors Features of XPRESS Compression Techniques in XPRESS Experimental Results

yan
Télécharger la présentation

XPRESS: A QueriableCompression for XML Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XPRESS: A QueriableCompression for XML Data Jun-Ki Min Myung-Jae Park Chin-Wan Chung By Erhan Durusüt and Burak Çetin

  2. Outline • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data

  3. Motivation • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data

  4. Motivation • XML data is irregular and verbose • To overcome the verbosity problem, research on compressors for XML data has been conducted • some XML compressors do not support querying compressed data • Some of them support querying compressed data, they blindly encode tags and data values using predefined encoding methods • So, direct and efficient evaluations of queries on compressed XML data is required XPRESS: A Queriable Compression for XML Data

  5. Background on Compression Algorithms • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data

  6. Compression Techniques • Purpose of Compression • Required disk space can be reduced significantly • Saving the network bandwidth • Overall performance of database systems • A buffer can hold more information • Number of disk I/Os is reduced XPRESS: A Queriable Compression for XML Data

  7. Classification of Compression Techniques Two scans, one for statistics one for compression We can not use lossy compression since we have text data Statistics gathered dynamically and updated during compression Fixed statistics or no statistics at all XPRESS: A Queriable Compression for XML Data

  8. Classification of Compression Techniques • Static Compression • Dictionary encoding – assigns an integer value to each new word • Example : “the classification of the data” • Encoded : 1 2 3 1 4 • Binary encoding – special types of data can be encoded in binary • Example : “8627” in string • Encoded : 8627 in numeric • Differential encoding – replaces a data item with a code value that defines its relationship to a specific data item • Example : 1500, 1520, 1600, 1550 • Encoded : 1500, 20, 100, 50 XPRESS: A Queriable Compression for XML Data

  9. Classification of Compression Techniques • Semi-adaptive Compression • Huffman encoding -Assign shorter codes to more frequently appearing symbols -Assign 0 to left edge and 1 to the right -Does not keep the order info XPRESS: A Queriable Compression for XML Data

  10. Classification of Compression Techniques • Semi-adaptive Compression • Arithmetic encoding • Symbols are assigned disjoint intervals according to their frequencies • Successive symbols of a message reduce the length of interval of the first symbol in accordance with the frequencies of the symbols. • Example : “a” “b” “c” 0 1.0 “ab” XPRESS: A Queriable Compression for XML Data

  11. Existing Compressors • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data

  12. Existing Compressors • XMILL • Separates XML tags and attributes from their data values and groups semantically related data values into containers. • XML tags and attributes are compressed by the dictionary encoding method. • To choose the compression algorithm for the container it needs human interpretation. • Finally, they are compressed again by a buildin library called “zlib” XPRESS: A Queriable Compression for XML Data

  13. Existing Compressors • XGRIND • Supports querying compressed XML data • Data values compressed by huffman or dictionary encoding, tags compressed by dictionary encoding • Uses DTD to determine the encoder for data values • A path expression is evaluated by scanning the compressed file and whenever a new tag is found the two path expressions are compared and decided • To evaluate range queries partial decompression of data values is always required XPRESS: A Queriable Compression for XML Data

  14. Features of XPRESS • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data

  15. Features of XPRESS • Reverse arithmetic encoding • Existing XML compressors : each tag by a unique identifier inefficient handling path expressions • Here, a label path as a distinct interval in[0.0, 1.0) • Handling of path expressions : containment relationships XPRESS: A Queriable Compression for XML Data

  16. Features of XPRESS • Automatic Type Inference • Some XML compressors use predefined encodings • E.g. Huffman, dictionary encoding • However, efficiency depends on data type • Some require manual interpretation • Requirement of a type inference engine XPRESS: A Queriable Compression for XML Data

  17. Features of XPRESS • Application of diverse encoding methods to different types • Inferred type – proper encoding methods • numeric: binary encoding Example: ‘120’, ’150’, ’100’, ’130’ Encoded as ‘20’, ’50’, ’0’, ’30’ • textual: huffman encoder • enumeration: dictionary encoder • High compression ratio • Less frequent partial decompression XPRESS: A Queriable Compression for XML Data

  18. Features of XPRESS • Semi-adaptive approach • Preliminary scan for statistics • Statistics not changed during compression • Encoding rules independent to location XPRESS: A Queriable Compression for XML Data

  19. Features of XPRESS • Homomorphic Compression • Preserves the structure of XML data • Efficient extraction XPRESS: A Queriable Compression for XML Data

  20. Reverse Arithmetic Encoding • Simple Path: a sequence of one or more dot-separated tags t1.t2…tn. Example: the simple path of subsectionis book.section.subsection • Label Path: a1.a2…an is the simple path of e. Thus ak,ak+1…an is the label path of e, where 1<=k<n. Example: section.subsection is a label path of subsection • Suffix: two label paths, P=pi…pn and Q=pj…pn of e, if i>=j, the P is a suffix of Q XPRESS: A Queriable Compression for XML Data

  21. Reverse Arithmetic Encoding • First partitions the entire interval [0.0, 1.0) into subinterval, one for each distinct element. The size is proportional to the frequency. Example: frequencies of elements={book, author, title, section, subsection, subtitle} are (0.1, 0.1, 0.1, 0.3, 0.3, 0.1) XPRESS: A Queriable Compression for XML Data

  22. Reverse Arithmetic Encoding • Next, encodes the simple path P=p1…pn of an element e into an subinterval [mine, maxe) XPRESS: A Queriable Compression for XML Data

  23. Reverse Arithmetic Encoding • Property 1:Suppose that a simple path p is represented as the interval I, then all intervals for suffixes of P contain I. Example: simple path book.section.subsection interval [0.69, 0.699) label path section.subsection interval[0.69, 0.78) label path subsection interval[0.6, 0.9) Implication: query processor selects the elements whose corresponding intervals are within the interval of the query. //section/subsection then choose intervals within [0.69, 0.78) • Finally, the start tag of an element is replaced by the value of the subinterval. XPRESS: A Queriable Compression for XML Data

  24. Compression Techniques in XPRESS • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data

  25. Architecture of XPRESS XPRESS: A Queriable Compression for XML Data

  26. XML Analyzer • Parses each token in XML file while keeping trace of the path • If a tag : collects statistics • If data value : apply type inferencing XPRESS: A Queriable Compression for XML Data

  27. XML Analyzer Algorithm XPRESS: A Queriable Compression for XML Data

  28. Applying Arithmetic Encoder • Problematic: counting appearances of each distinct element • Higher level tags appear rarely (e.g. root) • Intervals for long paths shrink too quickly • Requires use of high-precision numbers • Instead use: Path Tree (Weighted Frequency) XPRESS: A Queriable Compression for XML Data

  29. Weighted Frequency • Weighted Frequency: Number of subnodes + itself • Can consume so much memory; O(E) • Not efficient to construct XPRESS: A Queriable Compression for XML Data

  30. Adjusted Frequency • Add 1 to ancestors whenever a new node is met • Requires O(L) space ; L max. length of a query • Efficient heuristics XPRESS: A Queriable Compression for XML Data

  31. Statistics Collector XPRESS: A Queriable Compression for XML Data

  32. Type Inferencing • Determine whether data is: • Integer • Floating point • Enumaration type • String XPRESS: A Queriable Compression for XML Data

  33. Type Inferencing • Engine keeps track of: • inferred_type • min,max • symhash • chars_frequency • Inferred type can change in the process: • from integer to string • from dictionary to string XPRESS: A Queriable Compression for XML Data

  34. XPRESS: A Queriable Compression for XML Data

  35. XML Encoder • For data MSB is 0, for structure 1 XPRESS: A Queriable Compression for XML Data

  36. ARAE • ARAE: Approximated Reverse Arithmetic Encoder • Ensures the MSB of encoded value is 1 • Truncates the last byte from float • Truncations does not change the containment relationship • May incure inefficieny if too much truncated XPRESS: A Queriable Compression for XML Data

  37. Encoder Algorithm XPRESS: A Queriable Compression for XML Data

  38. Query Processing • If too long query the interval gets too little • Split query into intervals with sizes greater than 2-15 • Look for sequence of splitted intervals • Generally sequence length is 1 XPRESS: A Queriable Compression for XML Data

  39. Query Processing • Exact matching conditions are encoded • Range queries for numerical values done directly • Partial decompression needed for range queries on strings • Huffman and Dictionary encoding do not preserve order information XPRESS: A Queriable Compression for XML Data

  40. Experimental Results • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data

  41. Experiments • Extensive experiments on real life data with different characteristics XPRESS: A Queriable Compression for XML Data

  42. Compression Ratios XPRESS: A Queriable Compression for XML Data

  43. Sample Queries • Different types of queries are run: XPRESS: A Queriable Compression for XML Data

  44. Query Evaluation Time XPRESS: A Queriable Compression for XML Data

  45. Conclusion and Future Work • Novel approach “Reverse Arithmetic Encoding” is successful • Superior to XGrind • Future support for complex data types • e.g. Uniform Resource Identifier (URI) XPRESS: A Queriable Compression for XML Data

More Related