1 / 24

Tradeoffs in Scalable Data Routing for Deduplication Clusters

Tradeoffs in Scalable Data Routing for Deduplication Clusters. FAST '11. Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, Philip Shilance From EMC. 2011. 04. 21 ( Thu ) Kwangwoon univ . SystemSoftware Lab. HoSeok Seo. Introduction.

cybele
Télécharger la présentation

Tradeoffs in Scalable Data Routing for Deduplication Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tradeoffs in Scalable Data Routing for Deduplication Clusters FAST '11 Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, Philip Shilance From EMC 2011. 04. 21 (Thu) Kwangwoonuniv. SystemSoftware Lab. HoSeokSeo

  2. Introduction • This paper proposes • a deduplication cluster storage systemhaving a primary node with the a hard disk • Basically cluster storage systems are... • a well-known technique to increase capacity • but have 2 problems • less deduplication than the single node system • not exhibit linear performance

  3. Introduction • Goal • Scalable Throughput • using Super-chunk for data transfer • maximize the parallelism of disk I/O by balanced routing data to nodes • reduce bottleneck of disk I/O utilizing cache locality • Scalable Capacity • using a cluster storage system • route repeated data to the same node • maintain the balanced utilization between nodes • High Deduplication like single node system • using a super-chunk that consist of consecutive chunks

  4. Introduction • Chunk • Definition • A segment of Data stream • Merits • when a chunk size is small... • Show high deduplication • when a chunk size is big... • Show high throughput

  5. Introduction • Super-chunk • Definition • Consist of consecutive chunks • Merits • Maintain high cache locality • Reduce system overhead • Get similar deduplication rate of chunk • Demerits • Risk of duplication creation • Can result in imbalance utilization between nodes • Issues of super-chunk • How they are formed • How they are assigned to nodes • How they route super-chunks to nodes for a balance

  6. Dataflow of Deduplication Cluster 1. Divide Data Streams into Chunks 2. Create fingerprints of chunks 3. Create a super-chunk 4. Select a representative for a super-chunk in chunks 5. Route a super-chunk to one of nodes

  7. Deduplication flow at a node (cont.)

  8. Deduplication flow at a node A chunk Dup? at dedup logic Write Fingerprint & Chunk to a container no yes Is a container full? Fingerprint in cache? Write a container to a disk no Fingerprint in index? yes no Color box means that it requires disk access yes Load fingerprints were written at the same time to cache Dediplication Done

  9. What is Container? • Container • Definition • fixed-size large pieces in a disk • consist of two part : Fingerprint & Chunk Data • Usage • Use it to store Fingerprint & Chunk of non-duplicated data into a disk Chunk Data Fingerprints

  10. Issue 1 : How super-chunk are formed? • How super-chunk are formed? • Determine an average super-chunk size • Experimented with a variety size from 8KB to 4MB • Generally 1MB is a good choice

  11. Issue 2 : How they assigned to nodes • Use Bin Manager running on master node • Bin Manager executes rebalance between nodes by bin migration(For stateless routing) a super-chunk 1. assign number of bin to a super-chunk 2. find a node by number of bin bin manager M>N 3. route a super-chunk to a node node 1 node 2 node 3 node N

  12. Issue 3 :How they route super-chunks to nodes for balance • Use two DATA Routing to overcome demerits of super-chunk • stateless technique with a bin migration • light-weight and well suited for most balanced workloads • stateful technique • Improve deduplication while avoiding data skew

  13. Stateless Technique • Basic • 1. Create fingerprint about each chunks • 2. Select a representative fingerprint in fingerprints • 3. allocate a bin to super-chunk ( such mod #bin ) • How to Create fingerprint • Hash all of chunk (a.k.a hash(*) ) • Hash N byte of chunk ( a.k.a hash(N) ) • ※ Use SHA-1 Hash function • How to select representative fingerprint • first • maximum • minimum

  14. Stateful Technique (cont.) • Merits compare to Stateless • Higher Deduplication like single node backup system • Balanced overload • Bin migration no longer needed • Demerits • Increased operations • Increased cost of memory or communication

  15. StatefulTechnique • Process • Calculate "weighted voting" • Select a node that has the highest weighted voting 1 number of match * overloaded value 1.0 number of match : number of duplication chunk at each node overloaded value : overloaded utilization of node relative to the average storage utilization

  16. Datasets

  17. Evaluation Metrics • Capacity • Total Deduplication (TD) • the original dataset size % deduplication size • Data Skew • Max node utilization % avg node utilization • Effective Deduplication (ED) • TD % Data Skew • Normalized ED • Show that how much deduplication close to a single-node system • Throughput • # of on-disk fingerprint index lookups

  18. Experimental Results :Overall Effectiveness • Using Trace-driven simulation

  19. Experimental Results :Overall Effectiveness with mig

  20. Experimental Results :Feature Selection • HYDRAstor • - Routing chunks to nodes according to content • - Good performance • - Worse deduplication rate due to 64KB chunks

  21. Experimental Results :Cache Locality and Throughput (32node) (32node) Logical Skew : max(size before dedupe) / avg ( size before dedupe) Max lookup : maximum normalized total number of fingerprint index lookups ED : Effective Deduplication

  22. Experimental Results :Effect of Bin Migration The ED drops between migration points due to increasing skew.

  23. Summary

  24. Conclusion • 1. Using Super-chunks for data routing is superior to using individual chunks to achieve scalable throughput while maximizing deduplication • 2. The stateless routing method (hash(64)) with bin migration is a simple and efficient way • 3. The effective deduplication of the stateless routed cluster may drop quickly as the number of nodes increases. To solve this problem, proposed stateful data routing approach. Simulations show good performance when using up to 64 nodes in a cluster

More Related