Cassandra - A Decentralized Structured Storage System

Cassandra - A Decentralized Structured Storage System 報告者: 呂俐禎

Abstract • Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.

Motivations • High Availability • High Write Throughput • High scalability.

Data Model • Table is a multi dimensional map indexed by a key. • Each Column has • Name • Value • Timestamp • Columns are grouped into Column Families. • 2 Types of Column Families • Simple • Super (nested Column Families)

C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Data Model Columns are added and modified dynamically ColumnFamily1 Name : MailListType : SimpleSort : Name KEY Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2 Name : WordListType : SuperSort : Time Column Families are declared upfront Name : aloha Name : dude C2 V2 T2 C6 V6 T6 SuperColumns are added and modified dynamically Columns are added and modified dynamically

System Architecture Consistent Hash • Data partitioned to subset of nodes: Consistent Hashing • Data replicated to multiple nodes for redundancy, performance: Quorum using “preference list” of nodes • Node management: • Membershipalgorithm to know which nodes are up/down. “Accrual failure detection + Gossip” • Bootstrappingto add node. Manual operation + “Seed” nodes NodeA NodeC NodeD Gossip NodeB

Partitioning • Consistent Hashing • Nodes are logically structured in Ring Topology. • Hashed value of key associated with data partition is used to assign it to a node in the ring. • Lightly loaded nodes moves position to alleviate highly loaded nodes.

Replication • Uses replication to achieve high availability and Each data item is replicated at N (replication factor) nodes. • Coordinator node replicates the key to an additional N-1 nodes. • Different Replication Policies • Rack Unaware • Rack Aware • Datacenter Aware

Partitioning and Replication h(key1) 1 0 N=3 B h(key2) A C F E D 1/2 • * Figure taken from Avinash Lakshman and Prashant Malik (authors of the paper) slides.

Membership • Cluster membership in Cassandra is based on Scuttlebutt, a very efficient anti-entropy Gossip based mechanism. • Each node locally determines if any other node in the system is up/down. • Uses gossip for node membership and to transmit system control state. • Node Fail state is given by variable ‘phi‘Φwhich tells how likely a node might fail (suspicion level) instead of simple binary value (up/down). • This type of system is known as Accrual Failure Detector.

Gossip

Accrual Failure Detector • If a node is faulty, the suspicion level monotonically increases with time. Φ(t)  k as t  k Where k is a threshold variable (depends on system load) which tells a node is dead. • Φis computed using inter-arrival times of gossip messages from other nodes in the cluster.

BootStrapping & Scaling the Cluster • When a node starts for the first time, it chooses a random token for its position in the ring. • The mapping is persisted to disk locally and also in Zookeeper. • The token information is then gossiped around the cluster. • An administrator uses command line or browser to initiate the addition and removal of nodes from Cassandra instance

Local Persistence • Relies on local file system for data persistency. • Write operations happens in 2 steps • Write to commit log in local disk of the node • Update in-memory data structure. • Read operation • Looks up in-memory as first before looking up files on disk. • Uses Bloom Filter (summarization of keys in file store in memory) to avoid looking up files that do not contain the key.

READ • Use cache data • Bloom filter to reduce SSTable access • Check SSTables in time-order In-Memory Table • Dumped to SSTable when full Commit Log SS Table SS Table SS Table WRITE SSTable • No reads, seeks • Atomic within CF • Append-only • Dedicated disk • Synced to disk OR • buffered • Immutable • Compact job in background to merge files

Facebook Inbox Search • Cassandra developed to address this problem. • 50+TB of user messages data in 150 node cluster on which Cassandra is tested. • Search user index of all messages in 2 ways. • Term search : search by a key word • Interactions search : search by a user id

Comparison with MySQL • MySQL > 50 GB Data Writes Average : ~300 msReads Average : ~350 ms • Cassandra > 50 GB DataWrites Average : 0.12 msReads Average : 15 ms • Stats provided by Authors using facebook data.

Reference • Cassandra - A Decentralized Structured Storage System • AvinashLakshmanFacebook • PrashantMalikFacebook • http://en.wikipedia.org/wiki/Apache_Cassandra • http://wiki.apache.org/cassandra/FrontPage

Cassandra - A Decentralized Structured Storage System