270 likes | 373 Vues
Explore how CZIP compression boosts Content-Based Naming (CBN) systems, improving data savings and server performance. This research showcases benefits, deployment scenarios, compressibility, and potential data savings. Discover the impact on web content, VM images, and server consolidation.
E N D
Supporting Content-Addressable Caching with CZIP Compression KyoungSoo Park, Sunghwan Ihm, Mic Bowman* and Vivek Pai Princeton University *Intel Research
Content-Based Naming (CBN) • Naming scheme based on its content • Name = one-way hash (content) • Hashing function: MD5, SHA-1, etc. • Rabin’s fingerprint for chunk detection • Redundancy elimination • Network-traffic/storage systems • Research/commercial systems • Special-purpose systems USENIX 2007
Where Can CBN be Applied? • Similar file distribution • Linux distribution mirror • DVD ISO contains all CD ISOs • Virtual machine image migration • Base OS takes up majority of content • httpd VM vs. httpd+mysqld VM • Uncacheable Web content • Some dynamic content doesn’t change USENIX 2007
Contribution of This Work • Generic CBN tool • Easy to build new systems • Easy to upgrade existing non-CBN systems • CZIP compression + CZIP-aware apps • Can be used on existing platforms • Provides benefit to non-CZIP apps • Demonstrate sample systems • Reduces FC6 mirror memory footprint by half • Comparable compression speed to GZIP’s • 2x throughput for CZIP-aware Apache • 4x origin server BW reduction for CZIP-aware CDN USENIX 2007
Header A Global Fields A Chunk Index 1 B B Chunk Index 2 Chunk Index 3 A Chunk Index 4 C B Chunk Index 5 C CZIP Compression • Compression scheme like GZIP, BZIP2 • Export CBN information in the header CZIP UNCZIP CZIP Header USENIX 2007
CZIP Header • Header = global attributes + chunk info • Global attributes • One-way hash function (SHA-1/MD5) • Chunk data compression (GZIP/BZIP2) • Convergent encryption (on/off) • Header CRC, File Hash, etc. • Chunk information • Content hash, start offset, chunk size USENIX 2007
read header file1.cz read chunks read header file2.cz xyzlo5g Chunk A read chunk C asdfghk Chunk B qoiertty Chunk C Deployment Scenario • CZIP-aware server xyzlo5g hdr asdfghk Client A Chunk A Server Chunk B file1.cz CBN Cache Client B xyzlo5g header asdfghk qoiertty Chunk A Chunk B Chunk C file2.cz USENIX 2007
GET /file2.cz Range: bytes=1000-1999 X-SHA-1: qoiertty file2.cz read chunk C xyzlo5g Chunk A asdfghk Chunk B qoiertty Chunk C Deployment Scenario • CZIP-aware client-side proxy xyzlo5g hdr asdfghk file1.cz Client A Chunk A Proxy Server Chunk B file1.cz CBN Cache Client B xyzlo5g header asdfghk qoiertty Chunk A 1. X-SHA-1 field helps CZIP-aware server 2. Browser cache can support CBN too! Chunk B Chunk C file2.cz USENIX 2007
7.9 6.5 6.5 48.3 48.5 3.3 3.2 3.2 20.3 19.9 19.6 2.7 2.5 2.5 1.9 Compressibility • Fedora Core 6 ISOs/ All files/ Wikipedia DB 1 Data Compression Ratio CZIP+plain 0.9 CZIP+gzip 0.8 CZIP+bzip2 0.7 GZIP 0.6 BZIP2 0.5 0.4 0.3 0.2 0.1 0 FC6_i386_ISOs.tar FC6_All_files.tar Wikipedia_DB.tar 6.7 GB 49.7 GB 7.9 GB USENIX 2007
Compression speed • On Pentium D 2.8GHz with 4GB memory 29,004 secs 3,151 secs 3,964 secs USENIX 2007
Virtual Machine Images • Server consolidation/management • Much redundancy among similar VMs • Xen FC4 base image (X) • X + httpd (Y) / Y + mysqld (Z) • Investigating content overlap over • Chunk size • Chunking methods • Rabin’s fingerprint vs. fixed-sized • After extensive use USENIX 2007
Chunk Size / Chunking Methods Compare three VM images Base = Xen FC4 image / Apache = Base + httpd Both = Apache + mysqld Rabin’s fingerprint Fixed-sized chunking USENIX 2007
Real VM Images EC1 ~ EC5: VMs based on Xen FC-4 + standard tools Daily used by five different engineers for three weeks USENIX 2007
Dynamic Web Pages • Observed the front page of these sites • Google News • CNN • Slashdot • Digg.com • Fark.com • New York Times • All of them non-cacheable • “no-cache”, “no-store” or “private” USENIX 2007
Average Content Overlap Downloaded pages every 10 minutes for 18 days USENIX 2007
Potential Data Savings via CZIP 37% 39% 61% 24% 57% 90% USENIX 2007
Summary So far • CZIP is comparable to GZIP in speed and performance • CZIP is far better with files with much redundancy • Redundancy decreases as chunk size increases • Rabin’s fingerprint exposes a good deal of redundancy regardless of chunk sizes • Optimal chunk size varies over workload • Bigger chunk size is better for network transfer • Dynamic content also exposes redundancy • CZIP can save 24-90% of BW instead of GZIP USENIX 2007
Server Performance • CZIP Apache Module • Test scenario (FC mirror simulation) • 1.5 GB from FC6 DVD • 1.5 GB is split into three 0.5 GB images • Each file is requested in round-robin fashion • 100-300 clients simulated by six machines in LAN • Server is 2.8GHz Pentium D w/ 2GB memory • w/ 2GB physical memory with 2 Gbps-NICs USENIX 2007
Worst client in CZIP-aware Apache is faster than 91% of normal Apache clients CZIP Apache Module 90% 2.56 times Median 2.07 times USENIX 2007
CBN-Aware Content Distribution • CoBlitz large-file CDN [NSDI’06] • Serving 1-2 TB every day on PlanetLab • http://coblitz.codeen.org/URL • University channel – podcast/vodcast • Fedora Core mirror, Citeseer etc. • Chunk is basic caching unit • Parallel chunk requests/responses • Chunk request in HTTP byte-range query USENIX 2007
Making CoBlitz CZIP-Aware • CoBlitz’s chunk request GET /coblitz.codeen.org/www.cs.princeton.edu/ bigfile.cz,start=1000,end=1999 HTTP/1.0 Host: coblitz.codeen.org • CZIP-aware CoBlitz (C-CoBlitz) request GET /czip.codeen.org/Chunk_SHA-1_Hash HTTP/1.0 Host: czip.codeen.org X-URL: www.cs.princeton.edu/bigfile.cz X-Range: byte=1000-1999 USENIX 2007
CZIP-Aware CoBlitz Testing • Two content-overlapping files • Simultaneously fetch from 100 PlanetLab nodes • Origin server is at Princeton • Testing cases • Regular: Download original files by regular CoBlitz • File-CZIP: DownloadCZIP’ed files by regular CoBlitz • CZIP-CDN: DownloadCZIP’ed files by C-CoBlitz USENIX 2007
273 MB, 29.6% 191 MB, 29.7% 100 MB File Downloading 388 MB Regular File-CZIP CZIP-CDN USENIX 2007
92 MB, 49.7% 24 MB, 73.9% 50 MB File Downloading 183 MB Regular File-CZIP CZIP-CDN USENIX 2007
Conclusion • CZIP is a generic compression tool providing CBN benefits • CZIP is comparable to GZIP in compression performance • CZIP helps greatly reduce memory footprint in serving similar files • It is very easy to support CZIP and the benefit is transparent USENIX 2007
Thank you! More information can be found at http://codeen.cs.princeton.edu/czip/ CZIP code will be released soon! USENIX 2007
200/300 Clients 90% 2.27 times 90% 2.11 times 80% 65% Median 1.95 times Median 1.84 times 200 clients 300 clients USENIX 2007