Directory management for (N,K) Erasure Codes in hadoop distributed file systems

Directory management for (N,K) Erasure Codes in hadoop distributed file systems AvinashKatikaPradeepChanumoluPradeepNethaganiRavitejakarumajji GROUP - G

Paper-1 Low-Overhead Byzantine Fault-Tolerant Storage

An erasure-coded Byzantine fault-tolerant block storage protocol • Features • Nearly as efficient as protocols that tolerate only crashes. • Employs novel mechanisms to optimize for the common case when faults and concurrency are rare. • Performs a write operation in two rounds of communication and a read operation in one round in common cases. • Operates based on a short checksum comprised of cryptographic hashes and homomorphic fingerprints.

Byzantine fault-tolerant • Protocols that can tolerate arbitrarily faulty behavior by components of the system are said to be Byzantine fault-tolerant. • Most Byzantine fault-tolerant protocols are used to implement replicated state machines. • In a replicated state machine each request is sent to a server replica and each non-faulty replica sends a response. • So, in distributed storage when large blocks of data are often transferred it leads to a reasonable overhead. • To avoid this, a storage protocol can reduce the amount of data that must be sent to each server by using an m-of-n erasure code.

M of N erasure code • Each block is encoded into n fragments such that any m fragments can be used to decode the block. • But existing protocols that use erasure codes struggle with tolerating Byzantine faulty clients. • Existing protocols employ two techniques: • Each server is provided with the entire block of data, which saves disk bandwidth and capacity but not network bandwidth. • Introduce expensive techniques to ensure that fragments decode into a unique block which saves network bandwidth but requires additional servers and a relatively expensive verification procedure for read operations.

Byzantine fault-tolerant m-of-n erasure-coded storage protocols • require at least m+2f servers to tolerate f faulty servers. • often require additional computational overhead like data to be cryptographically hashed. • Implementations of some recent protocols can be quite efficient due to several optimizations. e.g. • Castro and Liskov eliminate public-key signatures from the common case by replacing signatures with message authentication codes (MACs) and lazily retrieving signatures in cases of failure.

(Examples continued) • Abd-El-Maleket al. use aggressive optimistic techniques to scale as the number of faults tolerated is increased, but their protocol requires 5 f +1 servers to tolerate f faulty servers. • Cowling et al. use a hybrid of these two protocols to achieve good performance with only 3f +1 servers • When writing large blocks of data and tolerating multiple faults, a Byzantine fault-tolerant storage protocol should provide erasure coded fragments to each server to minimize the bandwidth overhead of redundancy. • Writing erasure-coded fragments has been difficult to achieve because servers must ensure that a block is encoded correctly without seeing the entire block.

PASIS • Byzantine fault-tolerant erasure-coded block storage protocol • Servers do not verify that a block is correctly encoded during write operations. • Clients verify during read operations that a block is correctly encoded • This technique avoids the problem of verifying erasure-coded fragments but has a few new problems: • fragments must be kept in versioned storage to ensure correct read operations. • The read verification process is computationally expensive. • PASIS requires 4 f +1 servers to tolerate f faults. • A separate garbage collection protocol must eventually be run to free old versions of a block.

AVID • An asynchronous verifiable information dispersal protocol that requires only 3f + 1 servers to tolerate f faults. • Here, a client sends each server an erasure-coded fragment. • Again,eachserver sends its fragment to all other servers such that each server can verify that the block is encoded correctly. • This all-to-all communication, consumes slightly more bandwidth in the common case than a typical replication protocol

Low-Overhead Byzantine Fault-Tolerant Storage approach • Introduces optimizations that minimize the number of rounds of communication, the amount of computation, and the number of servers that must be available at any time. • Employs homomorphic fingerprinting to ensure that a block is encoded correctly.

Design Improvements • No public-key cryptography • Relies entirely on MAC (Message Authentication Codes) and nonce values. • Early Write. • Partial Encoding. • Distributed verification of erasure-coded data.

No public-key cryptography • All servers share pairwise MAC keys. • Each server provides MACs to the client in the prepare phase, which the client sends in the commit phase to prove that enough servers successfully completed prepare requests at a given timestamp • Each server also provides a pseudo-random nonce value in the prepare phase, the client aggregates these values and provides them to each server in the commit phase. • So in a read operation, a server provides the client with these nonce values to prove that some client invoked a write at a specific timestamp

Early write • Clients send erasure-coded fragments in the first round of the prepare phase, which saves a round of communication. • Partial encoding • PASIS encodes and hashes fragments for all n servers. Encoding this many fragments is wasteful because f servers may not be involved in a write operation. • Instead, our protocol encodes and hashes fragments for only the first n-f servers, which lowers the computational overhead. • Encoding 2 f +1 fragments requires computing f +1 values, whereas encoding 4 f +1 fragments (as in PASIS) requires computing 3 f +1 values.

Distributed verification of erasure-coded data • Each server knows only the cross-checksum and its fragment, and so it is difficult for a server to verify that its fragment together with the corresponding fragments held by other servers form a valid erasure coding of a unique block. • This problem is solved by fingerprinted cross-checksum. • The fingerprinted cross-checksum includes a cross-checksum, as used in PASIS, along with a set of homomorphic fingerprints of the first m fragments of the block.

Distributed verification of erasure-coded data • The fingerprints are homomorphic in that the fingerprint of the erasure coding of a set of fragments is equal to the erasure coding of the fingerprints of those fragments. • The overhead of computing homomorphic fingerprints is small compared to the cryptographic hashing for the cross-checksum. • A server can determine if a fragment is consistent with a fingerprinted cross-checksum without access to any other fragments.

Write • To write a block, a client encodes the block into m+ f fragments, computes the fingerprinted cross-checksum, and sends each server its fragment and the fingerprinted cross-checksum. • The server responds with a logical timestamp, a nonce, and a MAC for each server of the timestamp. • If timestamps do not match, the client requests new MACs at the greatest timestamp found. • The client then commits this write by sending the timestamp, fingerprinted cross-checksum, nonces, and MACs to each server.

Write • When a fault occurs (which is uncommon in this case), the client can either try the commit at another server or it can send the entire block to another server in order to garner another prepare response. • The timestamps received in the first round of prepare will often match, which allows most write operations to complete in only two rounds

Read • To read a block, a client requests timestamps and fingerprinted cross-checksums from 2f +1 servers and fragments from the first m servers. • If the m fragments are consistent with the most recent fingerprinted cross-checksum, and if the client can determine that some client invoked a write at this timestamp a block is decoded and returned. • Most read operations return after one round of communication with correct servers.

Implementation • The low-overhead fault-tolerant prototype (LOFT) consists of a client library, linked to directly by client applications, and a storage server application. • The client library interface consists of two functions, “read block” and “write block.” • Fingerprinting is fast, and only the first m fragments are fingerprinted.

Write throughput Read throughput

Conclusion • Erasure-coded protocols provide higher throughput writes and can increase m for fixed f to realize even higher throughput • Previous Byzantine fault-tolerant erasure-coded protocols, however, exhibit low client throughput for reads and high computational overheads for both reads and writes • LOFT presents a Byzantine fault-tolerant erasure-coded protocol that performs well for both reads and large writes. • Measurements of a prototype implementation demonstrate that this protocol exhibits throughput within 10% of the ideal crash fault-tolerant erasurecoded protocol for reads and sufficiently large writes • This protocol has little computational overhead other than a cryptographic hash and a homomorphic fingerprint of the data.

Paper-2 “Lazy verification in fault-tolerant distributed storage systems”

PASIS Read/write protocol and delayed verification • PASIS read/write protocol uses versioning to avoid the need to verify completeness and integrity of a write operation during its execution. Instead, such verification is performed during read operations. • Lazy verification shifts the work to the background, removing it from the critical path of both read and write operations in common cases which would be discussed in later part. • This protocol supports a hybrid failure model for up to t storage-nodes, b ≤ t of which may be Byzantine faults; the remainder is assumed to crash. • At a high level, the PASIS read/write protocol proceeds as follows. Logical timestamps are used to totally order all writes and to identify erasure-coded data-fragments pertaining to the same write across the set of storage-nodes. • To perform a read, a client issues read requests to a subset of storage-nodes. Once a quorum of storage-nodes reply, the client identifies the candidate—the response with the greatest logical timestamp. • verification when used in the context of protocol, will denote two steps: checking for write completeness and performing timestamp validation

Disadvantages • One major downside to read-time veriﬁcation, however, is the potentially unbounded cost: a client may have to sift through many ill-formed or incomplete write values. A Byzantine client could degrade service by submitting large numbers of bad values. • A second practical issue in this approach is veriﬁcation of erasure-coding, called validating timestamps, requires a computation cost equivalent to fragment generation on every read on critical path.

Properties of Lazy verification Process • It is a background task that never impacts foreground read and write operations, • It verifies values of write operations before clients read them (so that clients need not perform verification) • It only verifies the value of write operations those clients actually read.

Lazy verification Process • First, a storage-node finds the latest complete write version. Second, the storage-node performs timestamp validation. • Once a block version is successfully verified, a storagenode sets a flag indicating that verification has been performed. This flag is returned to a reading client within the storage-node’s response. • A client that observes b+1 storage-node responses that indicate a specific block version has been verified need not itself perform timestamp validation, since at least one of the responses must be from a correct storage-node. • This is similar to Srikanth and Touegalgorithm on Physical clock sysnchronisation.

Cooperative lazy verification • Each storage-node can perform verification for itself. But, the overall cost of verification can be reduced if storage-nodes cooperate. • Ideally, only b+1 storage-nodes would be needed to perform lazy verification. Cooperative lazy verification targets this idea!

Contd.. • The benefit of cooperative lazy verification is a reduction in the number of verification-related messages. • Without cooperation, each storage-node must independently perform verification which result in O(N^2)messages. • In contrast, the common case for cooperative lazy verification is for just b+1 storage-nodes to perform read requests (to N−b storage-nodes) and then send N−1 notify messages. Communication complexity is still O(bN) messages.

Bounding Byzantine-faulty clients • Byzantine faulty-clients may perform poisonous writes and may stutter. Each poisonous or incomplete write introduces latent potential work into the system that may require subsequent read operations to perform additional round trips. • Lazy verification provides some protection from these degradation-of-service attacks. • Storage-nodes setting the verificationflag and discarding poisonous writes can decrease the cost of tolerating poisonous writes by reducing or eliminating the client verification cost • if b+1 storage-nodes identify a client as having performed a poisonous write, then the faulty client’s authorization could be revoked

Garbage collection • If lazy verification passes for a block version, i.e., the block version is complete and its timestamp validates, then block versions with earlier timestamps are no longer needed. The storage-node can delete such obsolete versions • Prioritizing blocks for garbage collection. Careful selection of blocks for verification can improve efficiency, both for garbage collection and lazy verification. Here, we focus on two complementary metrics on which to prioritize this selection process: number of versions and presence in cache.

Paper-3 Towards bounded wait-free PASIS

Erasure codingWhat and Why? • Storage bits may fail or corrupt in the storage system • Storage protection system need the ability to recover data from storage • Techniques? • Replication • Erasure Coding

What and Why? • Replication • Data is replicated and stored • Erasure Coding • Started life in Raid 6 • Provides redundancy by breaking object into smaller segmentsand storing the fragments in different places • The key is that you can recover the data from any combination of a smaller number of those fragments.

Erasure Coding • Number of fragments data divided into: m • Number of fragments data recoded into: n(n>m) • The key property of erasure codes is that the stored object in n fragments can be reconstructed from any m fragments. • Encoding rate r = m/n (<1) • Storage required: 1/r

Replication vs. Erasure coding Store two Integers a=2 and b=3 • Erasure coding a =2 a =2 b =3 b =3 a =2 a+b =5 b =3 Replication

Erasure Coding in WAS • Windows Azure Storage • Unit of replication – Extent (Append only distributed File System) • Replicate extent over several machines • When an extent exceed 3GB they are sealed • Till they are sealed, extents are triple replicated • When sealed Microsoft erasure codes the extent and delete the full replicas

Sealed Extent 3GB p0 d0 d1 d2 d3 d4 d5 p1 p2

WAS – 6+3 • Divide extent into 6 fragments • Add 3 parity segments for durability • Distributed over several racks • For reconstruction we need 6 I/O disks, 6 Net transfers • Overhead: (6+3)/6 = 1.5X

Can we improve? • For 12 fragments and 4 party segments • Overhead: (12+4)/12 = 1.33X • But for reconstruction we need 12 I/O disks, 12 Net transfers • Reconstruction is expensive. can we achieve1.33x overhead while requiring only 6 fragments for reconstruction?

WAS (12+2+2) Sealed Extent 3GB q1 y0 y1 y2 y3 y4 y5 q0 x1 x2 x3 x4 x5 px py

WAS LRC (12+2+2) • Divide 12 fragments into two groups • Local parity for each group (Local Reconstruction Code) • One failure in any of the groups – Use local parity to reconstruct the data. • Two failures in one group – Use global parities to reconstruct the data. • Storage Overhead: (12+2+2)/12 = 1.33x

Introduction • Implements a Byzantine fault-tolerant eraser-coded atomic register. • Extension to PASIS protocol and Lazy verification. • Extensions that are necessary to constrain Byzantine servers separately from those necessary to constrain Byzantine clients. • These extensions provide bounded wait-free reads and writes.

PASIS Protocol Review • Deals with • Any number of clients may be byzantine • No more than b servers are byzantine Assumptions • Authenticate channels exist between all servers and between each client and every serve • Servers have unbounded storage space • Writes and reads are linearizable and wait-free

PASIS Protocol Review – Write • Logical timestamps are used to totally order all writes and to identify requests that pertain to same write. • Logical timestamp • Constructed by client • Unique • Greater than the most recent completely written object.

PASIS Protocol Review - Read • Client performs Read-Latest operation • Servers reply with latest timestamp-fragment pair that it hosts. • Responses that share highest timestamp is referred as candidate. • Candidate is classifiedto • Complete • Repairable • incomplete

Byzantine Servers - Read • Responds to • READ-LATEST • READ-PREVIOUS • READ-TIMESTAMP • May fabricate timestamp-fragment pairs with arbitrarily-high timestamps. • May lead to correct client performing an unbounded number of READ-PREVIOUS operations to complete a read

Byzantine Servers-Incomplete candidates • Previous: • Client uses the highest timestamp returned to a READ-LATEST or READ-PREVIOUS operation. • Current: • client ought to use the b + 1st timestamp in the ordered list of timestamps returned from a quorum of servers • The b + 1st highest timestamp is guaranteed to be no greater than a timestamp from a correct server. • It is also guaranteed to be no less than the timestamp of the latest complete candidate.

Byzantine Servers - Write • Previous: • Client constructs the time stamp for write based on the highest timestamp returned to a READ-TIMESTAMP operation • Current: • client ought to use the b + 1st timestamp returned to a READ-TIMESTAMP operation. • The b + 1st highest timestamp is guaranteed to be no greater than a timestamp from a correct server. • It is also guaranteed to be no less than the timestamp of the latest complete candidate.

Byzantine Clients-Read • can perform incomplete or repairable writes to any server. • can construct an arbitrarily-sized timestamp for its writes. • Lazy Verification bounds the number of poisonous writes a Byzantine client can perform via server verification. • Server verification must be extended to bound the amount of work Byzantine clients can induce correct clients to perform via incomplete or repairable writes and arbitrarily-sized timestamps.

Directory management for (N,K) Erasure Codes in hadoop distributed file systems