1 / 12

Managing XML and Semistructured Data

Managing XML and Semistructured Data. Lecture 19: Compressing XML Data. Prof. Dan Suciu. Spring 2001. In this lecture. XML Compression Motivation XMill approach and results Resources XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001.

cady
Télécharger la présentation

Managing XML and Semistructured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing XML and Semistructured Data Lecture 19: Compressing XML Data Prof. Dan Suciu Spring 2001

  2. In this lecture • XML Compression • Motivation • XMill approach and results Resources • XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001

  3. Compression: The Problem • XML for exchange (space or time) • but XML is verbose • users prefer application specific formats: • Web Server Logs • EMBL • G2 • is XML doomed to fail ?

  4. An Example:Web Server Logs ASCII File 15.9 MB (gzipped 1.6MB): 202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) <apache:entry> <apache:host> 202.239.238.16 </apache:host> <apache:requestLine> GET / HTTP/1.0 </apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> http://www.net.jp/ </apache:referer> <apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent> </apache:entry> XML-ized inflates to 24.2 MB (gzipped 2.1MB):

  5. XMill • specialized compressor for XML data • makes XML look “small” • Download: • Now: www.research.att.com/sw/tools/xmill • Soon: www.cs.washington.edu/homes/suciu/XMILL

  6. How Xmill Works: Three Ideas Compress the structure separately from the data: gzip Structure gzip Data 202.239.238.16 GET / HTTP/1.0 text/html 200 … <apache:entry> <apache:host> </apache:host> . . . </apache:entry> =1.75MB +

  7. How Xmill Works: Three Ideas Group the data values according to their types: gzip Structure gzip Data1 gzip Data2 <apache:entry> . . . </apache:entry> 202.23.23.16 224.42.24.55 … GET / HTTP/1.0 GET / HTTP/1.1 … =1.33MB + +

  8. =0.82MB gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... How Xmill Works: Three Ideas Apply semantic (specialized) compressors: • Examples: • 8, 16, 32-bit integer encoding (signed/unsigned) • differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) • compress lists, records (e.g. 104.32.23.1  4 bytes) • Need user input to select the semantic compressor

  9. XML Compression

  10. Compression Tradeoff

  11. Summary of XML Data Management • XML = • old data type (trees) • with new interpretation (data) • We discussed traditional management techniques for XML: • Data model • Query language • Optimizations • ... • Many traditional problems still unsolved (storage, processing, optimization, ...)

  12. Summary of XML Data Management • More interesting question: • what are the novel applications enabled by XML ? Some ideas: • Approximate queries over unfamiliar data instances • “Search the database for a pattern similar to this one” • Rank results based on their similarity to the pattern • What is an appropriate query language for that ? • Linking independent databases • We have Xlink, how do we use it ?

More Related