290 likes | 730 Vues
Venti : A new approach to archival storage. Sean Quinlan & Sean Dorward Bell Labs, Lucent Technology FAST 2002 Presented by: Fabián E. Bustamante. Archival Storage. Abundant storage - storing data for long time, even forever Archival system – write-once policy – changes how we see storage
E N D
Venti : A new approach to archival storage Sean Quinlan & Sean Dorward Bell Labs, Lucent Technology FAST 2002 Presented by: Fabián E. Bustamante
Archival Storage • Abundant storage - storing data for long time, even forever • Archival system – write-once policy – changes how we see storage • Prevalent form of archival storage - tape backup • Central service for a number of clients • Restoring data tedious and error prone (requires sys admin or privileged software) • Infrequent, so problems may go undetected for long • Tradeoff bet/ performance of backup & restore (incremental backups are more efficient to generate, but not self-contained) • Snapshots (Plan 8, Andrew FS, …) • A consistent read-only view of FS at some point in past • Maintains file system permissions • Can be accessed with standard tools – ls, cat, cp, grep, diff • Avoids tradeoff between full vs. incremental backup • Looks like full backup • Implementation resembles incremental backup – share blocks • Only a small number of versions are kept
Venti • GOAL: “To provide a write-once archival repository that can be shared by multiple client machines and applications” • Block level network storage system • Actually a backend storage for client apps • Block level i/f places few restrictions on data structures/format • Blocks addressed by hash of their contents • Uses SHA-1 algorithm • SHA-1 output is 160 bit (20 byte) fingerprint of data block • Write once policy • Once written cannot be deleted • Multiple writes of same data coalesced • Data sharing – increases effective storage capacity • Makes write operation idempotent
Venti • Multiple clients can share a Venti server • Hash fn gives an universal namespace • Inherent integrity checking of data • Fingerprint computation on retrieval • Caching is simplified • Uses magnetic disks as storage technology • Access time comparable to non-archival data • Hashing function – sha1 is more than enough for now • Exabyte of storage (1018bytes) in 8 Kbytes blocks (~1014 blocks - n) with sha1 (b = 160 bits) – p < 10-20 Number of pairs of blocks Probability that a given pair will collide
Venti as a building block to construct storage apps Data divided into blocks App needs fingerprint for retrieval One approach - fingerprints packed into pointer blocks, also written to server Repeated recursively to get single fingerprint - root of tree Applications - general
Tree cannot be modified, but Unchanged sections of tree reused More complex data structures Mixing data & fingerprints in a block e.g. structure for storing file system 3 types of blocks Directory – file metadata + root fingerprint Pointer Data Applications - general
Example app: Vac • Similar to zip & tar – store collection of files & directories as single object • Tree of blocks formed for selected files • vac archive file - 45 bytes long • 20 bytes for root fingerprint • 25 bytes fixed header string • any amount of data compressed down to 45 bytes • Unvac – to restore file from archive • Duplicate copies of file coalesced on server • Multiple users vac same data – only 1 copy stored • vac on changed contents
Example app: Physical level backup • Copy the raw disk blocks to Venti • No need to walk file hierarchy • Higher throughput • Duplicate blocks are coalesced • Free block – null in pointer tree • User sees full backup of device • Storage space advs of incremental backup retained • Random access possible • Directly mounting a backup file system image (w/ OS support) • Enables lazy (on demand) restore
Example app - Plan 9 FS • Previously • Magnetic disk + write-once optical jukebox • Cache for faster access & accumulated changes bet/ snapshots • Cache smaller than active file system • Cache misses are painful • Plan 9 FS on top of Venti • Venti as storage device (instead of jukebox) • Equalizes access to both active & archival storage • Cache can also be smaller
Append-only log of data blocks RAID array Separate index maps fingerprints to log location Can be regenerated from data log Network Client Venti Server Client Client FS Index Cache Block Cache Data Index Implementation
Venti’s data log Append only section of data blocks (variable size) Log divided into arenas for maintenance For integrity checking & data recovery • Data compressed (LZ77) – • encoding – if, algo. • esize size after compression For fast index rebuild & error recovery Data log and directory grow in opposite directionsWhen arena is filled, mark as sealed, fingerprint compute
Venti’s index • No. of fingerprints >> no. of blocks on a server • Index as disk-resident hash table • Hashing function maps fingerprints to index buckets • Going through index is main performance penalty • Caching (block and index) • Index striped across multiple disks – fingerprint location in index is random & index is too big to fit in memory • Write buffering (to increase number of concurrent accesses) • Block cache - hit - index lookup & data log bypassed • Index cache - hit - index lookup bypassed
Performance: computing environments • Testbed • Dedicated dual 550Mhz Pentium 3, 2GB mem. • Access through a 100Mbps Ethernet • Data log stored on a 500GB MaxTronic IDE Raid 5 • Index stored on 8 Seagate Cheetah 18XL 9GB SCSI • Micro-benchmark results
Storage usage – file servers used • 2 plan 9 file servers, bootes and emelie • Spanning 1990 to 2001 • 522 user accounts, 50-100 active at a time • Numerous development projects hosted • Several large data sets • Astronomical data, satellite imagery, multimedia files • Next figures • Size of active file system • Space consumed on jukebox • Space when using Venti • Ratio – using Venti reduces the cost of retaining snapshots (so ratio is smaller)
Storage saving • When stored on Venti, size of jukebox data reduced by 3 factors • Elimination of duplicate blocks • Elimination of block fragmentation • Compression of block contents
Reliability & recovery • Tools for integrity checking & error recovery • Verifying structure of arena • Checking there is an index entry for every block in data log, vice versa • Rebuilding index from data log • Copying arena to removable media • Data log on RAID 5 disk array • Protection against single drive failures • Off-site mirrors for extra protection, if that’s not enough … • Stored sealed arenas into removable media
Future Work • Load balancing • Distribute Venti across multiple machines • Replicate server • Use of proxy server to hide it from client • Better access control • Currently just authentication to server • Single root fingerprint gives access to entire file tree • Use of variable sized blocks as in LBFS
Conclusion • Addressing block by SHA-1 hash of contents – good fit for archival storage • Magnetic disks as storage technology • Large capacity at low price – online storage • Random access • Performance comparable to non-archival data • Write once model • Reduces accidental or malicious data loss • Simplifies administration • Simplifies caching • Allows sharing of data