Massively Parallel/Distributed Data Storage Systems

Massively Parallel/Distributed Data Storage Systems S. Sudarshan IIT Bombay Derived from an earlier talk by S. Sudarshan, presented at the MSRI Summer School on Distributed Systems, May/June 2012

Why Distributed Data Storage a.k.a. Cloud Storage • Explosion of social media sites (Facebook, Twitter) with large data needs • Explosion of storage needs in large web sites such as Google, Yahoo • 100’s of millions of users • Many applications with petabytes of data • Much of the data is not files • Very high demands on • Scalability • Availability IBM ICARE Winter School on Big Data, Oct. 2012

Why Distributed Data Storage a.k.a. Cloud Storage • Step 0 (prehistory): Distributed database systems with tens of nodes • Step 1: Distributed file systems with 1000s of nodes • Millions of Large objects (100’s of megabytes) • Step 2: Distributed data storage systems with 1000s of nodes • 100s of billions of smaller (kilobyte to megabyte) objects • Step 3 (recent and future work): Distributed database systems with 1000s of nodes IBM ICARE Winter School on Big Data, Oct. 2012

Examples of Types of Big Data • Large objects • video, large images, web logs • Typically write once, read many times, no updates • Distributed file systems • Transactional data from Web-based applications • E.g. social network (Facebook/Twitter) updates, friend lists, likes, … • email (at least metadata) • Billions to trillions of objects, distributed data storage systems • Indices • E.g. Web search indices with inverted list per word • In earlier days, no updates, rebuild periodically • Today: frequent updates (e.g. Google Percolator) IBM ICARE Winter School on Big Data, Oct. 2012

Why not use Parallel Databases? • Parallel databases have been around since 1980s • Most parallel databases were designed for decision support not OLTP • Were designed for scales of 10s to 100’s of processors • Single machine failures are common and are handled • But usually require jobs to be rerun in case of failure during exection • Do not consider network partitions, and data distribution • Demands on distributed data storage systems • Scalability to thousands to tens of thousands of nodes • Geographic distribution for • lower latency, and • higher availability IBM ICARE Winter School on Big Data, Oct. 2012

Basics: Parallel/Distributed Data Storage • Replication • System maintains multiple copies of data, stored in different nodes/sites, for faster retrieval and fault tolerance. • Data partitioning • Relation is divided into several partitions stored in distinct nodes/sites • Replication and partitioning are combined • Relation is divided into multiple partition; system maintains several identical replicas of each such partition. IBM ICARE Winter School on Big Data, Oct. 2012

Basics: Data Replication • Advantages of Replication • Availability: failure of site containing relation r does not result in unavailability of r is replicas exist. • Parallelism: queries on r may be processed by several nodes in parallel. • Reduced data transfer: relation r is available locally at each site containing a replica of r. • Cost of Replication • Increased cost of updates: each replica of relation r must be updated. • Special concurrency control and atomic commit mechanisms to ensure replicas stay in sync IBM ICARE Winter School on Big Data, Oct. 2012

Basics: Data Transparency • Data transparency: Degree to which system user may remain unaware of the details of how and where the data items are stored in a distributed system • Consider transparency issues in relation to: • Fragmentation transparency • Replication transparency • Location transparency IBM ICARE Winter School on Big Data, Oct. 2012

Basics: Naming of Data Items • Naming of items: desiderata • Every data item must have a system-wide unique name. • It should be possible to find the location of data items efficiently. • It should be possible to change the location of data items transparently. • Data item creation/naming should not be centralized • Implementations: • Global directory • Used in file systems • Partition name space • Each partition under control of one node • Used for data storage systems IBM ICARE Winter School on Big Data, Oct. 2012

Build-it-Yourself Parallel Data Storage: a.k.a. Sharding • “Sharding” • Divide data amongst many cheap databases (MySQL/PostgreSQL) • Manage parallel access in the application • Partition tables map keys to nodes • Application decides where to route storage or lookup requests • Scales well for both reads and writes • Limitations • Not transparent • application needs to be partition-aware • AND application needs to deal with replication • (Not a true parallel database, since parallel queries and transactions spanning nodes are not supported) IBM ICARE Winter School on Big Data, Oct. 2012

Parallel/Distributed Key-Value Data Stores • Distributed key-value data storage systems allow key-value pairs to be stored (and retrieved on key) in a massively parallel system • E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon Dynamo, .. • Partitioning, replication, high availability etc mostly transparent to application • Are the responsibility of the data storage system • These are NOT full-fledged database systems • A.k.a. NO-SQL systems • Focus of this talk IBM ICARE Winter School on Big Data, Oct. 2012

Typical Data Storage Access API • Basic API access: • get(key) -- Extract the value given a key • put(key, value) -- Create or update the value given its key • delete(key) -- Remove the key and its associated value • execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... Etc) • Extensions to add version numbering, etc IBM ICARE Winter School on Big Data, Oct. 2012

What is NoSQL? • Stands for No-SQL or Not Only SQL?? • Class of non-relational data storage systems • E.g. BigTable, Dynamo, PNUTS/Sherpa, .. • Synonymous with distributed data storage systems • We don’t like the term NoSQL IBM ICARE Winter School on Big Data, Oct. 2012

Data Storage Systems vs. Databases • Distributed data storage systems do not support many relational features • No join operations (except within partition) • No referential integrity constraints across partitions • No ACID transactions (across nodes) • No support for SQL or query optimization • But usually do provide flexible schema and other features • Structured objects e.g. using JSON • Multiple versions of data items, IBM ICARE Winter School on Big Data, Oct. 2012

Querying Static Big Data • Large data sets broken into multiple files • Static append-only data • E.g. new files added to dataset each day • No updates to existing data • Map-reduce framework for massively parallel querying • Not the focus of this talk. We focus on: • Transactional data which is subject to updates • Very large number of transactions • Each of which reads/writes small amounts of data • I.e. online transaction processing (OLTP) workloads IBM ICARE Winter School on Big Data, Oct. 2012

Background: Distributed Transactions • Concurrency Control/Replication Consistency Schemes • Distributed File Systems • Parallel/Distributed Data Storage Systems • Basics • Architecture • Bigtable (Google) • PNUTS/Sherpa (Yahoo) • Megastore (Google) • CAP Theorem: availability vs. consistency • Basics • Dynamo (Amazon) Talk Outline IBM ICARE Winter School on Big Data, Oct. 2012

Background: Distributed Transactions Slides in this section are from Database System Concepts, 6th Edition, by Silberschatz, Korth and Sudarshan, McGraw Hill, 2010 IBM ICARE Winter School on Big Data, Oct. 2012

Transaction System Architecture IBM ICARE Winter School on Big Data, Oct. 2012

System Failure Modes • Failures unique to distributed systems: • Failure of a site. • Loss of massages • Handled by network transmission control protocols such as TCP-IP • Failure of a communication link • Handled by network protocols, by routing messages via alternative links • Network partition • A network is said to be partitionedwhen it has been split into two or more subsystems that lack any connection between them • Note: a subsystem may consist of a single node • Network partitioning and site failures are generally indistinguishable. IBM ICARE Winter School on Big Data, Oct. 2012

Commit Protocols • Commit protocols are used to ensure atomicity across sites • a transaction which executes at multiple sites must either be committed at all the sites, or aborted at all the sites. • not acceptable to have a transaction committed at one site and aborted at another • The two-phase commit (2PC) protocol is widely used • The three-phase commit (3PC) protocol is more complicated and more expensive, but avoids some drawbacks of two-phase commit protocol. IBM ICARE Winter School on Big Data, Oct. 2012

Two Phase Commit Protocol (2PC) • Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached. • The protocol involves all the local sites at which the transaction executed • Let T be a transaction initiated at site Si, and let the transaction coordinator at Sibe Ci IBM ICARE Winter School on Big Data, Oct. 2012

Phase 1: Obtaining a Decision • Coordinator asks all participants to prepareto commit transaction Ti. • Ci adds the records <prepare T> to the log and forces log to stable storage • sends prepare T messages to all sites at which T executed • Upon receiving message, transaction manager at site determines if it can commit the transaction • if not, add a record <no T> to the log and send abort T message to Ci • if the transaction can be committed, then: • add the record <ready T> to the log • force all records for T to stable storage • send readyT message to Ci IBM ICARE Winter School on Big Data, Oct. 2012

Phase 2: Recording the Decision • T can be committed of Cireceived a ready T message from all the participating sites: otherwise T must be aborted. • Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto stable storage. Once the record stable storage it is irrevocable (even if failures occur) • Coordinator sends a message to each participant informing it of the decision (commit or abort) • Participants take appropriate action locally. IBM ICARE Winter School on Big Data, Oct. 2012

Three Phase Commit (3PC) • Blocking problem in 2PC: if coordinator is disconnected from participant, participant which had sent a ready message may be in a blocked state • Cannot figure out whether to commit or abort • Partial solution: Three phase commit • Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1. • Every site is ready to commit if instructed to do so • Phase 2 of 2PC is split into 2 phases, Phase 2 and Phase 3 of 3PC • In phase 2 coordinator makes a decision as in 2PC (called the pre-commit decision) and records it in multiple (at least K) sites • In phase 3, coordinator sends commit/abort message to all participating sites • Under 3PC, knowledge of pre-commit decision can be used to commit despite coordinator failure • Avoids blocking problem as long as < K sites fail • Drawbacks: higher overhead, and assumptions may not be satisfied in practice IBM ICARE Winter School on Big Data, Oct. 2012

Distributed Transactions via Persistent Messaging • Notion of a single transaction spanning multiple sites is inappropriate for many applications • E.g. transaction crossing an organizational boundary • Latency of waiting for commit from remote site • Alternative models carry out transactions by sending messages • Code to handle messages must be carefully designed to ensure atomicity and durability properties for updates • Isolation cannot be guaranteed, in that intermediate stages are visible, but code must ensure no inconsistent states result due to concurrency • Persistent messaging systems are systems that provide transactional properties to messages • Messages are guaranteed to be delivered exactly once • Will discuss implementation techniques later IBM ICARE Winter School on Big Data, Oct. 2012

Persistent Messaging • Example: funds transfer between two banks • Two phase commit would have the potential to block updates on the accounts involved in funds transfer • Alternative solution: • Debit money from source account and send a message to other site • Site receives message and credits destination account • Messaging has long been used for distributed transactions (even before computers were invented!) • Atomicity issue • once transaction sending a message is committed, message must guaranteed to be delivered • Guarantee as long as destination site is up and reachable, code to handle undeliverable messages must also be available • e.g. credit money back to source account. • If sending transaction aborts, message must not be sent IBM ICARE Winter School on Big Data, Oct. 2012

Error Conditions with Persistent Messaging • Code to handle messages has to take care of variety of failure situations (even assuming guaranteed message delivery) • E.g. if destination account does not exist, failure message must be sent back to source site • When failure message is received from destination site, or destination site itself does not exist, money must be deposited back in source account • Problem if source account has been closed • get humans to take care of problem • User code executing transaction processing using 2PC does not have to deal with such failures • There are many situations where extra effort of error handling is worth the benefit of absence of blocking • E.g. pretty much all transactions across organizations IBM ICARE Winter School on Big Data, Oct. 2012

Managing Replicated Data • Issues: • All replicas should have the same value  updates performed at all replicas • But what if a replica is not available (disconnected, or failed)? • What if different transactions update different replicas concurrently? • Need some form of distributed concurrency control IBM ICARE Winter School on Big Data, Oct. 2012

Primary Copy • Choose one replica of data item to be the primary copy. • Site containing the replica is called the primary sitefor that data item • Different data items can have different primary sites • For concurrency control: when a transaction needs to lock a data item Q, it requests a lock at the primary site of Q. • Implicitly gets lock on all replicas of the data item • Benefit • Concurrency control for replicated data handled similarly to unreplicated data - simple implementation. • Drawback • If the primary site of Q fails, Q is inaccessible even though other sites containing a replica may be accessible. IBM ICARE Winter School on Big Data, Oct. 2012

Primary Copy • Primary copy scheme for performing updates: • Update at primary, updates subsequently replicated to other copies • Updates to a single item are serialized at the primary copy IBM ICARE Winter School on Big Data, Oct. 2012

Majority Protocol for Locking • If Q is replicated at n sites, then a lock request message is sent to more than half of the n sites in which Q is stored. • The transaction does not operate on Q until it has obtained a lock on a majority of the replicas of Q. • When writing the data item, transaction performs writes on all replicas. • Benefit • Can be used even when some sites are unavailable • details on how handle writes in the presence of site failure later • Drawback • Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1) messages for handling unlock requests. • Potential for deadlock even with single item - e.g., each of 3 transactions may have locks on 1/3rd of the replicas of a data. IBM ICARE Winter School on Big Data, Oct. 2012

MajorityProtocol for Accessing Replicated Items • The majority protocol for updating replicas of an item • Each replica of each item has a version number which is updated when the replica is updated, as outlined below • A lock request is sent to at least ½ the sites at which item replicas are stored and operation continues only when a lock is obtained on a majority of the sites • Read operations look at all replicas locked, and read the value from the replica with largest version number • May write this value and version number back to replicas with lower version numbers (no need to obtain locks on all replicas for this task) IBM ICARE Winter School on Big Data, Oct. 2012

Majority Protocol for Accessing Replicated Items • Majority protocol (Cont.) • Write operations • find highest version number like reads, and set new version number to old highest version + 1 • Writes are then performed on all locked replicas and version number on these replicas is set to new version number • With 2 phase commit OR distributed consensus protocol such as Paxos • Failures (network and site) cause no problems as long as • Sites at commit contain a majority of replicas of any updated data items • During reads a majority of replicas are available to find version numbers • Note: reads are guaranteed to see latest version of data item • Reintegration is trivial: nothing needs to be done IBM ICARE Winter School on Big Data, Oct. 2012

Read One Write All (Available) • Read one write all available (ignoring failed sites) is attractive, but incorrect • If failed link may come back up, without a disconnected site ever being aware that it was disconnected • The site then has old values, and a read from that site would return an incorrect value • If site was aware of failure reintegration could have been performed, but no way to guarantee this • With network partitioning, sites in each partition may update same item concurrently • believing sites in other partitions have all failed IBM ICARE Winter School on Big Data, Oct. 2012

Replication with Weak Consistency • Many systems support replication of data with weak degrees of consistency (I.e., without a guarantee of serializabiliy) • i.e. QR +QW <= S or 2*QW < S • Usually only when not enough sites are available to ensure quorum • Tradeoff of consistency versus availability • Many systems support lazy propagation where updates are transmitted after transaction commits • Allows updates to occur even if some sites are disconnected from the network, but at the cost of consistency • What to do if there are inconsistent concurrent updates? • How to detect? • How to reconcile? IBM ICARE Winter School on Big Data, Oct. 2012

Availability • High availability: time for which system is not fully usable should be extremely low (e.g. 99.99% availability) • Robustness: ability of system to function spite of failures of components • Failures are more likely in large distributed systems • To be robust, a distributed system must either • Detect failures • Reconfigure the system so computation may continue • Recovery/reintegration when a site or link is repaired • OR follow protocols that guarantee consistency in spite of failures. • Failure detection: distinguishing link failure from site failure is hard • (partial) solution: have multiple links, multiple link failure is likely a site failure IBM ICARE Winter School on Big Data, Oct. 2012

Reconfiguration • Reconfiguration: • Abort all transactions that were active at a failed site • Making them wait could interfere with other transactions since they may hold locks on other sites • However, in case only some replicas of a data item failed, it may be possible to continue transactions that had accessed data at a failed site (more on this later) • If replicated data items were at failed site, update system catalog to remove them from the list of replicas. • This should be reversed when failed site recovers, but additional care needs to be taken to bring values up to date • If a failed site was a central server for some subsystem, an election must be held to determine the new server • E.g. name server, concurrency coordinator, global deadlock detector IBM ICARE Winter School on Big Data, Oct. 2012

Distributed Consensus From Lamport’sPaxos Made Simple: • Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. • If no value is proposed, then no value should be chosen. • If a value has been chosen, then processes should be able to learn the chosen value. • The safety requirements for consensus are: • Only a value that has been proposed may be chosen, • Only a single value is chosen, and • A process never learns that a value has been chosen unless it actually has been • Paxos: a family of protocols for distributed consensus IBM ICARE Winter School on Big Data, Oct. 2012

Paxos: Overview • Three kinds of participants (site can play more than one role) • Proposer: proposes a value • Acceptor: accepts (or rejects) a proposal, following a protocol • Consensus is reached when a majority of acceptors have accepted a proposal • Learner: finds what value (if any) was accepted by a majority of acceptors • Acceptor generally informs each member of a set of learners IBM ICARE Winter School on Big Data, Oct. 2012

Distributed File Systems IBM ICARE Winter School on Big Data, Oct. 2012

Distributed File Systems • Google File System (GFS) • Hadoop File System (HDFS) • And older ones like CODA, • Basic architecture: • Master: responsible for metadata • Chunk servers: responsible for reading and writing large chunks of data • Chunks replicated on 3 machines, master responsible for managing replicas • Replication is within a single data center IBM ICARE Winter School on Big Data, Oct. 2012

HDFS Architecture 1. Send filename Secondary NameNode NameNode 2. Get back BlckId, DataNodes o Client 3.Read data DataNodes NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk IBM ICARE Winter School on Big Data, Oct. 2012

Google File System (OSDI 04) • Inspiration for HDFS • Master: responsible for all metadata • Chunk servers: responsible for reading and writing large chunks of data • Chunks replicated on 3 machines • Master responsible for ensuring replicas exist IBM ICARE Winter School on Big Data, Oct. 2012

Limitations of GFS/HDFS • Central master becomes bottleneck • Keep directory/inode information in memory to avoid IO • Memory size limits number of files • File system directory overheads per file • Not appropriate for storing very large number of objects • File systems do not provide consistency guarantees • File systems cache blocks locally • Ideal for write-once and and append only data • Can be used as underlying storage for a data storage system • E.g. BigTable uses GFS underneath IBM ICARE Winter School on Big Data, Oct. 2012

Parallel/Distributed Data Storage Systems IBM ICARE Winter School on Big Data, Oct. 2012

Typical Data Storage Access API • Basic API access: • get(key) -- Extract the value given a key • put(key, value) -- Create or update the value given its key • delete(key) -- Remove the key and its associated value • execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... Etc) • Extensions to add version numbering, etc IBM ICARE Winter School on Big Data, Oct. 2012

Data Types in Data Storage Systems • Uninterpreted key/value or ‘the big hash table’. • Amazon S3 (Dynamo) • Flexible schema • Ordered keys, semi-structured data • BigTable [Google], Cassandra [Facebook/Apache], Hbase [Apache, similar to BigTable] • Unordered keys, JSON • Sherpa/Pnuts [Yahoo], CouchDB [Apache] • Document/Textual data (with JSON variant) • MongoDB [10gen, open source] IBM ICARE Winter School on Big Data, Oct. 2012

E.g. of Flexible Data Model ColumnFamily: Inventory Details Key Value Name Value A112341 name Ipad (new) 16GB Memory 512MB inventoryQty 100 3G No Name Value A235122 name ASUS Laptop Memory 3GB inventoryQty 9 Screen 15 inch Name Value B234567 name All Terrain Vehicle Wheels 3 inventoryQty 5 Power 10HP IBM ICARE Winter School on Big Data, Oct. 2012

Bigtable: Data model • <Row, Column, Timestamp> triple for key • lookup (point and range) • E.g. prefix lookup by com.cnn.* in example below • insert and delete API • Allows inserting and deleting versions for any specific cell • Arbitrary “columns” on a row-by-row basis • Partly column-oriented physical store: all columns in a “column family” are stored together IBM ICARE Winter School on Big Data, Oct. 2012

There are many more distributed data storage systems, we will not do a full survey, but just cover some representatives • Some of those we don’t cover: Cassandra, Hbase, CouchDB, MongoDB • Some we cover later: Dynamo Architecture of distributed storage systems Bigtable PNUTS/Sherpa Megastore IBM ICARE Winter School on Big Data, Oct. 2012

Massively Parallel/Distributed Data Storage Systems

Massively Parallel/Distributed Data Storage Systems

Presentation Transcript

Distributed File Systems

Distributed Systems

Distributed Systems: Concepts and Design

Distributed Object-Based Systems

Distributed Systems

Distributed Object-Based Systems

Distributed Operating Systems

Distributed Object-Based Systems

Optimization of Java-Like Languages for Parallel and Distributed Environments

CS 347: Parallel and Distributed Data Management Notes X: S4

Distributed Systems

Distributed Systems

Distributed Systems

DISTRIBUTED COMPUTING

Storage Systems

Distributed Databases

CS 591x Clutter Computing and Programming Parallel Computers

Secondary Storage Management

Distributed Systems

Midterm Review CS 230 – Distributed Systems (ics.uci/~cs230)

DISTRIBUTED SYSTEMS

CS 347: Parallel and Distributed Data Management Notes07: Data Replication