Cache Craftiness for Fast Multicore Key-Value Storage

Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Let’s build a fast key-value store • KV store systems are important • Google Bigtable, Amazon Dynamo, Yahoo! PNUTS • Single-server KV performance matters • Reduce cost • Easier management • Goal: fast KV store for single multi-core server • Assume all data fits in memory • Redis, VoltDB

Feature wish list • Clients send queries over network • Persist data across crashes • Range query • Perform well on various workloads • Including hard ones!

Hard workloads • Skewed key popularity • Hard! (Load imbalance) • Small key-value pairs • Hard! • Many puts • Hard! • Arbitrary keys • String (e.g. www.wikipedia.org/...) or integer • Hard!

First try: fast binary tree 140M short KV, put-only, @16 cores Throughput (req/sec, millions) • Network/disk not bottlenecks • High-BW NIC • Multiple disks • 3.7 million queries/second! • Better? • What bottleneck remains? • DRAM!

Cache craftiness goes 1.5X farther 140M short KV, put-only, @16 cores Throughput (req/sec, millions) Cache-craftiness: careful use of cache and memory

Contributions • Masstree achieves millions of queries per second across various hard workloads • Skewed key popularity • Various read/write ratios • Variable relatively long keys • Data >> on-chip cache • New ideas • Trie of B+ trees, permuter, etc. • Full system • New ideas + best practices (network, disk, etc.)

Experiment environment • A 16-core server • three active DRAM nodes • Single 10Gb Network Interface Card (NIC) • Four SSDs • 64 GB DRAM • A cluster of load generators

Potential bottlenecks in Masstree Network DRAM … … log log Disk Single multi-core server

NIC bottleneck can be avoided • Single 10Gb NIC • Multiple queue, scale to many cores • Target: 100B KV pair => 10M/req/sec • Use network stack efficiently • Pipeline requests • Avoid copying cost

Disk bottleneck can be avoided • 10M/puts/sec => 1GB logs/sec! • Single disk • Multiple disks: split log • See paper for details Single multi-core server

DRAM bottleneck – hard to avoid 140M short KV, put-only, @16 cores • Cache-craftiness goes 1.5X • father, including the cost of: • Network • Disk Throughput (req/sec, millions)

DRAM bottleneck – w/o network/disk 140M short KV, put-only, @16 cores Throughput (req/sec, millions) Cache-craftiness goes 1.7X father!

DRAM latency – binary tree 140M short KV, put-only, @16 cores Throughput (req/sec, millions) B A C serial DRAM latencies! Y Z X … 2.7 us/lookup 380K lookups/core/sec 10M keys => VoltDB

DRAM latency – Lock-free 4-way tree • Concurrency: same as binary tree • One cache line per node => 3 KV / 4 children X Y Z ½ levels as binary tree ½ DRAM latencies as binary tree A B … … …

4-tree beats binary tree by 40% 140M short KV, put-only, @16 cores Throughput (req/sec, millions)

4-tree may perform terribly! • Unbalanced: serial DRAM latencies • e.g. sequential inserts • Want balanced tree w/ wide fanout A B C D E F O(N) levels! G H I …

B+tree – Wide and balanced • Balanced! • Concurrent main memory B+tree [OLFIT] • Optimistic concurrency control: version technique • Lookup/scan is lock-free • Puts hold ≤ 3 per-node locks

Wide fanoutB+tree is 11% slower! 140M short KV, put-only • Fanout=15, fewer levels than 4-tree, but • # cache lines from DRAM >= 4-tree • 4-tree: each internal node is full • B+tree: nodes are ~75% full • Serial DRAM latencies >= 4-tree Throughput (req/sec, millions)

B+tree– Software prefetch • Same as [pB+-trees] • Masstree: B+tree w/ fanout 15 => 4 cache lines • Always prefetch whole node when accessed • Result: one DRAM latency per node vs. 2, 3, or 4 4 lines = 1 line

B+tree with prefetch 140M short KV, put-only, @16 cores Beats 4-tree by 9% Balanced beats unbalanced! Throughput (req/sec, millions)

Concurrent B+tree problem • Lookups retry in case of a concurrent insert • Lock-free 4-tree: not a problem • keys do not move around • but unbalanced insert(B) Intermediate state! A C D A C D A B C D

B+tree optimization - Permuter • Keys stored unsorted, define order in tree nodes • A concurrent lookup does not need to retry • Lookup uses permuterto search keys • Insert appears atomic to lookups Permuter: 64-bit integer insert(B) A C D A C D B A C D B … 0 1 2 … 0 3 1 2

B+tree with permuter 140M short KV, put-only, @16 cores Improve by 4% Throughput (req/sec, millions)

Performance drops dramatically when key length increases Short values, 50% updates, @16 cores, no logging Throughput (req/sec, millions) • Why? Stores key suffix indirectly, thus each key comparison • compares full key • extra DRAM fetch Keys differ in last 8B Key length

Masstree – Trie of B+trees • Trie: a tree where each level is indexed by fixed-length key fragment • Masstree: a trie with fanout 264, but each trie node is a B+tree • Compress key prefixes! … B+tree, indexed by k[0:7] … B+tree, indexed by k[8:15] B+tree, indexed by k[16:23]

Case Study: Keys share P byte prefix – Better than single B+tree • trie levels • each has one node only A single B+tree with 8B keys …

Masstree performs better for long keys with prefixes Short values, 50% updates, @16 cores, no logging Throughput (req/sec, millions) 8B key comparison vs. full key comparison Key length

Does trie of B+trees hurt short key performance? 140M short KV, put-only, @16 cores 8% faster! More efficient code – internal node handle 8B keys only Throughput (req/sec, millions)

Evaluation • Masstree compare to other systems? • Masstree compare to partitioned trees? • How much do we pay for handling skewed workloads? • Masstree compare with hash table? • How much do we pay for supporting range queries? • Masstreescale on many cores?

Masstree performs well even with persistence and range queries 20M short KV, uniform dist., read-only, @16 cores, w/ network Memcached: not persistent and no range queries Throughput (req/sec, millions) Redis: no range queries Unfair: both have a richer data and query model 0.04 0.22

Multi-core – Partition among cores? • Multiple instances, one unique set of keys per inst. • Memcached, Redis, VoltDB • Masstree: a single shared tree • each core can access all keys • reduced imbalance B Y A X C Z B A C Y X Z

A single Masstree performs better for skewed workloads 140M short KV, read-only, @16 cores, w/ network Throughput (req/sec, millions) No remote DRAM access No concurrency control Partition: 80% idle time 1 partition: 40% 15 partitions: 4% One partition receives δtimes more queries δ

Cost of supporting range queries • Without range query? One can use hash table • No resize cost: pre-allocate a large hash table • Lock-free: update with cmpxchg • Only support 8B keys: efficient code • 30% full, each lookup = 1.1 hash probes • Measured in the Masstree framework • 2.5X the throughput of Masstree • Range query costs 2.5X in performance

Scale to 12X on 16 cores Short KV, w/o logging Perfect scalability Throughput (req/sec/core, millions) • Scale to 12X • Put scales similarly • Limited by the shared memory system Number of cores

Related work • [OLFIT]: Optimistic Concurrency Control • [pB+-trees]: B+treewith software prefetch • [pkB-tree]: store fixed # of diff. bits inline • [PALM]: lock-free B+tree, 2.3X as [OLFIT] • Masstree: first system combines them together, w/ new optimizations • Trie of B+trees, permuter

Summary • Masstree: a general-purposehigh-performance persistent KV store • 5.8 million puts/sec, 8 million gets/sec • More comparisons with other systems in paper • Using cache-craftiness improves performance by 1.5X

Thank you!

Cache Craftiness for Fast Multicore Key-Value Storage

Cache Craftiness for Fast Multicore Key-Value Storage

Presentation Transcript

Energy Storage Value for New York

Cache Utilization-Aware Scheduling for Multicore Processors

Multicore for Science

MICA : A Holistic Approach to Fast In-Memory Key-Value Storage

Key/Value Stores

Distributed Load Balancing for Key-Value Storage Systems

Fast Configurable-Cache Tuning with a Unified Second-Level Cache

Fast computers, big/fast storage, fast networks

Fast Adaptive Storage and Retrieval

KEY VALUE STORAGE IN THE CLOUD

Feeder Advanced Storage Transaction (FAST) Battery Storage for a Smarter Grid

Configurable Cache Subsetting for Fast Cache Tuning

Key-Value stores

Cache Storage For the Next Billion

Cache Fusion Making Shared Storage Perform for Vanilla Systems

POOL Data Storage, Cache and Conversion Mechanism

Cache Coherence Techniques for Multicore Processors