1 / 24

SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems

SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems. Daehee Kim , Sejun Song, Baek -Young Choi University of Missouri-Kansas City. Cloud Storage – Dropbox , Google drive,…. Network : High network bandwidth consumption. Server :

sandro
Télécharger la présentation

SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems Daehee Kim, Sejun Song, Baek-Young Choi University of Missouri-Kansas City

  2. Cloud Storage – Dropbox, Google drive,… • Network : • High network bandwidth consumption • Server : • large storage consumption i.e. Remote Backup .. Anywhere, Anytime • Client : • High uploading overhead ….. …. ….. …. Employee Employee Sales Marketing Individual

  3. Data deduplication • File-level • Sub file-level • Fixed-size chunk • Variable-size chunk • Deduplication location • Server-based • Traditionally on the high capacity servers • Client-based • Limited by the client capacity Deduplication granularity

  4. File-Level (File-Level Deduplication) control data storage unique index Index table index X duplicate index

  5. Sub-File Level : Fixed Size Chunk (Fixed Size Block Deduplication) e.g. granularity : 15 byte fixed size boundary boundary boundary File1 nice people, good papers, and good conference, …… …… nice people, go od papers, and good conference Offset shifting problem No redundancies found File2 welcome, nice people, good papers, and good conference, …… welcome, nice p eople, good pap ers, and good c ……

  6. Sub-File Level : Variable Size Chunk (Variable Size Block Deduplication) e.g. matching pattern : “go” boundary boundary File1 nice people, good papers, and good conference, …… …… nice people, go od papers, and go = File2 welcome, nice people, good papers, and good conference, …… welcome, nice people, go od papers, and go …… Based on content, not fixed offset

  7. Deduplication : Comparisons Good for client-based Good for server-based Deduplication ratio File-level < Fixed size << Variable size better Processing time File-level < Fixed size <<<< Variable size worse Index overhead File-level << Fixed size  Variable size worse • Current cloud storage systems • Client-based • JustCloud, Mozy : file deduplication • Dropbox : large fixed size block deduplication (4MB)

  8. Objective • High deduplication ratio • Low network traffic • Low processing time • Less index overhead Develop an efficient client deduplicationthat achieves

  9. Outline Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion

  10. Observations [ Example ] email attachments meta body (text) text pdf docx images … <</Type/Page/ …>> Page Image object <</Type/.. Image/.. Filter/.. Length>> <stream>Encoded image<endstream> Text object <</Filter/ .. /Length >> <stream>Encoded text<endstream> • structured file can be decomposed to various objects • Fast decomposition without shifting problem • e.g. compressed files ( zip, rar, ..), document files (pdf, doc, ppt, docx, pptx), emails

  11. Observations • Large number of structured files exist in cloud-based storage systems [ dataset ]

  12. Our Approach (SAFE) • Apply object-based deduplication for structured files • Decompose a file into objects • Find redundancies based on decomposed objects. • Combine small sized meta data into an object (to reduce index sizes) • Apply file-level deduplication for redundant files • Speed up and small index sizes

  13. Outline Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion

  14. SAFE Architecture Email parser meta pdf img Files Emails Redundant file end File-level dedup unique file Structured? Unstructured file Structured file Structure Library File parser All object indexes objects Object manager Object-level dedup objects (index, object) Unique object indexes objects Store manager

  15. SAFE in Cloud Storage SAFE file-level dedup : : object-level dedup Indexes (objects) Indexes (unique objects) unique objects Server Client

  16. Outline Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion

  17. Setup • Compared deduplications • File-level (like JustCloud, Mozy) • Fixed block (4MB, like Dropbox) • Variable block (8 KB average chunk size) • Collected real data sets • Structured files (docx, pptx, and pdf) • From file system and emails of five graduate students in the same department • file system : 4 GB, emails : 2.5 GB

  18. Evaluation Metrics • Overhead • Processing time • Relative processing time to File-Level • Index size • Relative index size per File-Level • Performance • Deduplication ratio • Space savings by removing redundancies • ( (InputData – ConsumedStorage) / InputData) * 100 • Network Traffic • Size of data transferred to a storage over network • Byte

  19. Deduplication Ratio • is about 30% to 60% in SAFE. • is 2 times higher in SAFE than in “File-level” • is as good in SAFE as variable size block deduplication (Block-V) for email datasets • is even higher in SAFE than Block-V for file system datasets x1.5 x2 File system datasets Email datasets

  20. Network Traffic • is the lowest in SAFE for both datasets • is 15% and 30% lower in SAFE than file-level deduplication (File) and fixed size block deduplication (Block-F) for both data sets. 15% 30% File system datasets Email datasets

  21. Processing Time • is hundreds times faster in SAFE than in Block-V • is as fast in SAFE as in File-level hundreds times hundreds times File system datasets Email datasets

  22. Index Size • Is proportional to the number of unique blocks (40B per index) • i.e. for 4000 emails, index sizes are 0.1 MB (file-level) and 1.3 MB (SAFE) • Is 2 to 3 times less in SAFE (1.3MB) than Block-V (3.7MB) • Block-V has 8KB block size in average • Is 2 times more in file system than email datasets • SAFE has multiple decomposed objects for a file • i.e. file system dataset has more pdf files (pdf file can be decomposed into more objects than docx) File system datasets Email datasets

  23. Conclusions • High deduplication ratio: as good as Block-V • Low network traffic: as good as Block-V • Low processing time • hundreds times than Block-V • Less index overhead • 2 ~ 3 times less than Block-V • Future work • Extend to incorporate more structured file types Developed an efficient structure-awareclient-based deduplication (SAFE)

  24. Thank you! Questions? {daehee.kim, sjsong, choiby} @umkc.edu

More Related