Data management over flash memory

Data managementover flash memory IoannisKoltsidas and Stratis D. ViglasSIGMOD 2011, Athens Greece

Flash: a disruptive technology • Orders of magnitude better performance than HDD • Low power consumption • Dropping prices • Idea: Throw away HDDs and replace everything with Flash SSDs • Not enough capacity • Not enough money to buy the not-enough-capacity • However, Flash fits very well between DRAM and HDD • DRAM/Flash/HDD price ratio: ~100:10:1 per GB • DRAM/Flash/HDD latency ratio: ~1:10:100 • Integrate Flash into the storage hierarchy • complement existing DRAM memory and HDDs Koltsidas and Viglas, SIGMOD 2011

Outline • Flash-based device design • Flash memory • Solid state drives • Making SSDs database-friendly • System-level challenges • Hybrid systems • Storage, buffering and caching • Indexing on flash • Query and transaction processing Koltsidas and Viglas, SIGMOD 2011

Flash memory cells • Flash cell: a floating gate transistor • Float gate • Control Gate • Oxide Layer • Electrons get trapped in the float gate • Two states: float gate charged or not (‘0’ or ‘1’) • The charge changes the threshold voltage (VT) of the cell • To read: apply a voltage between possible VT values • the MOSFET channel conducts (‘1’) • or, it remains insulating (‘0’) • After a number of program/erase cycles, the oxide wears out Source Line Bit Line Control Gate Float Gate N Source N Drain Oxide Layer P P-Type Silicon Substrate • Single-Level-Cell (SLC): one bit per cell • Multi-Level-Cell (MLC): two or more bits per cell • The cell can sense the amount of current flow • Programming takes longer, puts more strain on the oxide Koltsidas and Viglas, SIGMOD 2011

Flash memory arrays • NOR or NAND flash depending on how the cells are connected form arrays • Flash page: the unit of read / program operations (typically 2kB – 8kB) • Flash block: the unit of erase operations (typically 32 – 128 pages) • Before a page can be re-programmed, the whole block has to be erased first • Reading much faster than writing a page • It takes some time before the cell charge reaches a stable state • Erasing takes two orders of magnitude more time than reading Koltsidas and Viglas, SIGMOD 2011

Flash-based Solid State Drives • Common I/O interface • Block-addressable interface • No mechanical latency • Access latency independent of the access pattern • 30 to 50 times more efficient in IOPS/$ per GB than HDDs • Read / Write asymmetry • Reads are faster than writes • Erase-before-write limitation • Limited endurance / wear leveling • 5 year warranty for enterprise SSDs (assuming 10 complete re-writes per day) • Energy efficiency • 100 – 200 times more efficient than HDDs in IOPS / Watt • Physical properties • Resistance to extreme shock, vibration, temperature, altitude • Near-instant start-up time Koltsidas and Viglas, SIGMOD 2011

SSD challenges • Host interface • Flash memory: read_flash_page, program_flash_page, erase_flash_block • Typical block device interface:read_sector, write_sector • Writes in place would kill performance, lifetime • Solution: perform writes out-of-place • Amortize block erasures over many write operations • Writes go to spare, erased blocks; old pages are invalidated • Device LBA space ≠ PBA space • Flash Translation Layer (FTL) • Address translation (logical-to-physical mapping) • Garbage collection (block reclamation) • Wear-leveling logical page LBAspace device level Flash Translation Layer flash chip level PBA space flash page spare capacity flash block Koltsidas and Viglas, SIGMOD 2011

SSD architecture • Various form factors • PCI-e • SAS (1.8” – 3.5”) • SATA (1.8” – 3.5”) • Number of channels • 4 to 16 or more • RAM buffers • 1MB up to more than 256MB • Over-provisioning • 10% up to 40% • Command Parallelism • Intra-command • Inter-command • Reordering / merging (NCQ) SSD Architecture … FlashChip FlashChip HostInterface Data Buffer ChannelController ECC … … FlashChip FlashChip Micro-processor RAM ChannelController ECC Firmware (Flash Translation Layer – FTL) Request Handler LBA-to-PBAMap Bad BlockList LBA-to-PBAmapper Write PageAllocator GarbageCollector WearLeveling Meta DataCache Free BlockQueue Koltsidas and Viglas, SIGMOD 2011

LBA-to-PBA mapping • Page-Level Mapping (fully associative map) • Each logical page may be mapped to any physical page • Large memory footprint: 2MB / GB of Flash* (more RAM, supercaps) 4 2 3 1 5 6 LBA 0 1 2 3 4 5 6 … PBA Block 1 Block 0 Block 2 Block 3 • Block-Level Mapping (set-associative map) • Each logical block can be mapped to any physical block • Poor performance, esp. for small random writes Footprint: 32 KB/GB of flash* 4 2 3 1 5 6 LBA 0 1 2 3 4 5 6 … PBA Block 1 Block 0 Block 2 Block 3 Koltsidas and Viglas, SIGMOD 2011 *Assuming 2KB flash page, 128 KB flash block, 8 TB max. flash capacity

[Gal & Toledo, ACM Comput. Surv., 37(2), 2005], [Chung et al., EUC 2006], [Kang et al., 2006], [Lee et al., SIGMOD 2008] LBA-to-PBA mapping: Logging • Hybrid Mapping: two types of blocks (Logging) • Data blocks mapped at block granularity • Log blocks mapped at page granularity • Updates to data blocks performed as writes to log blocks 2 3 1 5 6 LBA 0 1 2 3 4 5 6 … PBA Log Block 0 Data Block 0 Data Block 2 Data Block 1 Data Block 3 • Log block sharing? No : log block thrashing Yes : costly garbage collection Log Block 0 Data Block 0 Data Block 1 Log Block 1 Data Block 2 Data Block 3 Koltsidas and Viglas, SIGMOD 2011

[Gupta et al., 2009] LBA-to-PBA mapping: Caching • Dynamically load mapping entries depending on the workload (Caching) • Page-level mapping stored on flash • A portion of the mapping is cached in RAM • Exploits temporal locality of accesses • Uniform block utilization during page updates • Data blocks • Translation blocks • Cached mapping entries • Mapping table for translation blocks • ~ 4 KB / GB of flash DFTL, Gupta et. al 2009 Koltsidas and Viglas, SIGMOD 2011

[Hu et al., SYSTOR 2009] Garbage collection (1/3) • Goal: Reclaim a block with invalid pages Previously free block j Block i under GC (1) relocate Invalid Page Free block i Valid Page (2) erase • Large sequential writes: • Small random updates: • Write amplification: the average number of flash writes per user write • Write latency • Device lifetime • Write amplification heavily affected by: • Garbage Collection algorithm • Degree of over-provisioning • Workload Koltsidas and Viglas, SIGMOD 2011

[Gal & Toledo, ACM Comput. Surv., 37(2), 2005], [Hu et al., SYSTOR 2009] Garbage collection (2/3) • Goal: Reclaim blocks, keeping write amplification as low as possible • Greedy GC: Always reclaim the block with the least number of valid pages • Too expensive! • Windowed Greedy GC: restrict search in a time window • Only look at the s oldest blocks • Spare factor (over-provisioning) is critical • More spare blocks → more invalid pages per block → less write amplification Koltsidas and Viglas, SIGMOD 2011

[Gal & Toledo, ACM Comput. Surv., 37(2), 2005], [Hu et al., SYSTOR 2009] Garbage collection (3/3) • Writes are not uniform! • Agnostic controller: blocks with mixed pages • Separating controller: distinguishes between static / dynamic pages • Blocks with static data: almost no invalid pages • Blocks with dynamic data: mostly invalid pages Agnostic placement Separating placement Koltsidas and Viglas, SIGMOD 2011

Wear leveling • Wear leveling and garbage collection have contradictory goals! • Wear leveling aims to reclaim static blocks, swapping them with dynamic blocks • Typical wear-leveling: each block has a cycle budget • Upon each erasure, one cycle is consumed • Wear-leveling kicks in when a block diverges from the device average count • The block with the max count is swapped with the block with the min count (swap a dynamic block with a static one) • Additional write amplification with each swap! • Albeit, much less than the amplification caused by GC • Caveat: excessive wear leveling increases the overall wear Koltsidas and Viglas, SIGMOD 2011

Off-the-shelf SSDs 15k RPM SAS HDD: ~250-300 IOPS 7.2k RPM SATA HDD: ~80 IOPS Consumer Consumer Consumer Enterprise Enterprise ~ 1 order of magnitude > 2 orders of magnitude Koltsidas and Viglas, SIGMOD 2011

Read latency 4 KB Random Reads uniformly distributed over the whole medium 50% random data Koltsidas and Viglas, SIGMOD 2011

Write latency 4 KB Random Writes uniformly distributed over the whole medium 50% random data Koltsidas and Viglas, SIGMOD 2011

Mixed workload – Read latency 4 KB I/O operations uniformly distributed over the whole medium 50% random data, Queue depth = 32 Koltsidas and Viglas, SIGMOD 2011

Mixed workload – Write latency 4 KB I/O operations uniformly distributed over the whole medium 50% random data, Queue depth = 32 Koltsidas and Viglas, SIGMOD 2011

[Bouganim et al., CIDR 2009] uFlip(www.uflip.org) • The SSD is a black box for the system • What kind of IO patterns should a flash-based DBMS favor or avoid? • Collection of micro-benchmarks tailored for SSDs • Conclusions from experimental study include: • Reads and sequential writes are very efficient • Flash-page-aligned I/O requests, request sizes are very beneficial • Random writes within a small LBA address window incur almost the same latency as sequential ones • Parallel sequential writes to different partitions should be limited • Pauses between requests do not improve overall performance • Benchmarking tool also provided Koltsidas and Viglas, SIGMOD 2011

[Lee & Moon, SIGMOD 2007] In-page logging • Page updates are logged • Log sector for each DB page • Allocated when page become dirty • Log region in each flash block • Page write-backs only involve log-sector writes • Until a merge is required • Upon read: • Fetch log records from flash • Apply them to the in-memory page • Same or more number of writes • But, significant reduction of erasures • However: • The DBMS needs to control physical placement • Partial flash page writes are involved Koltsidas and Viglas, SIGMOD 2011

[Bonnet et al., 2011] SSD designers’ vs DBMS designers’ goals • SSD designers assume a generic filesystem above the device Goals: • Hide the complexities of flash memory • Improve performance for generic workloads and I/O patterns • Protect their competitive advantage, by hiding algorithm and implementation details • DBMS designers have full control of the I/O issued to the device Goals: • Predictability for I/O operations, independence of hardware specifics • Clear characterization of I/O patterns • Exploit synergies between query processing and flash memory properties Koltsidas and Viglas, SIGMOD 2011

The TRIM command • TRIM : a command to communicate logical page deletions to the device • Invalidation of flash pages before they are overwritten • Less write amplification • Improved write performance • Improved lifetime • Gaining adoption by SSD manufacturers, and OS / Filesystem designers • Linux 2.6.28, Windows 7, Mac OS X 10.7 • Actual implementation varies across device manufacturers Koltsidas and Viglas, SIGMOD 2011

[Bonnet et al., 2011] A richer interface to SSDs • Stripped-down FTL, rich interface to the host • What kind of interface would allow maximum performance? • Expose the internal geometry of the SSD (alignment, mapping of pages to blocks, etc.) • Expose logical blocks in addition to logical pages • DMBS respects flash constraints (C1, C2, C3) • Wear-leveling still done by the device firmware • Bimodal FTL: • I/O respecting constraints is “tunneled” • Unconstrained I/O • DBMS designer has full control on I/O performance • But, a lot of re-writing is required Koltsidas and Viglas, SIGMOD 2011

Parallelism? Logical block • Still, some operations are more efficient on hardware • Mapping of the address space to flash planes, dies and channels • ECC, encryption etc. • Wear-leveling still needs to be done by the device firmware • The internal device geometry is critical to achieve maximum parallelism • The DBMS needs to be aware of the geometry to some degree 0 1 2 3 … FlashChip FlashChip ChannelController ECC 3 2 … … FlashChip FlashChip ChannelController 1 0 ECC Logical block size ≠ Flash block size Logical block size relevant to number of channels, pipelining capabilities on each channel, etc. Koltsidas and Viglas, SIGMOD 2011

SSDs – summary • Flash memory has the potential to remove the I/O bottleneck • Especially for read-dominated workloads • “SSD”: multiple classes of devices • Excellent random read latency is universal • Read and write bandwidth varies widely • Dramatic difference across random write latencies • Dramatic differences in terms of economics: $/GB cost, power consumption, expected lifetime, reliability • A lot of research to be done towards defining DBMS-specific interfaces Koltsidas and Viglas, SIGMOD 2011

Storage, buffering and caching SSD cache buffer pool demand paging eviction SSD persistent storage HDD persistent storage Koltsidas and Viglas, SIGMOD 2011

Flash memory for persistent storage SSD cache buffer pool SSD persistent storage HDD persistent storage Koltsidas and Viglas, SIGMOD 2011

Hybrid storage layer SSD cache buffer pool SSD persistent storage HDD persistent storage Koltsidas and Viglas, SIGMOD 2011

Flash memory as cache SSD cache buffer pool SSD persistent storage HDD persistent storage Koltsidas and Viglas, SIGMOD 2011

Hybrid systems • Problem setup • SSDs are becoming cost-effective, but still not ready to replace HDDs in the enterprise • Certainly conceivable to have both SSDs and HDDs at the same level of the storage hierarchy • Research questions • How can we take advantage of the SSD characteristics when designing a database? • How can we optimally place data across both types of medium? • Methodologies • Workload detection for data placement • Load balancing to minimize response time • Caching data between disks Koltsidas and Viglas, SIGMOD 2011

[K & V, VLDB 2008] Workload-driven page placement User-level page I/O • Flash memory and HDD at the same level of the storage hierarchy • Monitor page use by keeping track of reads and writes • Logical operations (i.e., references only) • Physical operations (actually touching the disk) • Hybrid model (logical operations manifested as physical ones) • Identify the workload of a page and appropriately place it • Read-intensive pages on flash • Write-intensive pages on HDD • Migrate pages when they have expensed their cost if erroneously placed Logical I/O operations Buffer manager Replacement policy Requests for physical I/O Storage manager Physical I/O operations Storage layer SSD Page migrations HDD Read/write operation rSSD(wSSD) rflash(wflash) wSSD flash HDD wflash 2-state task system Koltsidas and Viglas, SIGMOD 2011

[Canim, Mihaila, Bhattacharjee, Ross & Lang, VLDB 2009] Object placement • Hybrid disk setup • Offline tool • Optimal object allocation across the two types of disk • Two phases • Profiling: start with all objects on the HDD and monitor system use • Decision: based on profiling statistics estimate performance gained from moving each object from the HDD to the SSD • Reduce the decision to a knapsack problem and apply greedy heuristics • Implemented in DB2 device parameters workload Database engine Buffer pool monitor Object placement advisor Read/writes Performance gain (s) Storage system HDD SSD Cut-off point SSD budget ($) Koltsidas and Viglas, SIGMOD 2011

[Soundararajan, Prabhakaran, Balakrishnan, & Wobber, FAST 2010] [Holloway PhD Thesis, UW-Madison, 2009] Write caching • SSD for primary storage, auxiliary HDD • Take advantage of better HDD write performance to extend SSD lifetime and improve write throughput • Writes are pushed to the HDD • Log structure ensures sequential writes • Fixed log size • Once log is full merge writes back to the SSD HDD-resident log write merge SSD read Koltsidas and Viglas, SIGMOD 2011

[Wu & Reddy, MASCOTS 2010] Load balancing to maximize throughput • Setting consists of a transaction processing system with both types of disk • Objective is to balance the load across media • Achieved when the response times across media are equal, i.e., a Wardrop equilibrium • Algorithms to achieve this equilibrium • Page classification (hot or cold) • Page allocation and migration Storage ma Storage management layer Hot/cold data classifier cache Data (re)locator Operation redirector Device performance monitor Policy configuration Koltsidas and Viglas, SIGMOD 2011

Buffering in main memory • Problem setup • Flash memory is used for persistent storage • Typical on-demand paging • Research questions • Which pages do we buffer? • Which pages do we evict and when? • Methodologies • Flash memory size alignment • Cost-based replacement • Write scheduling Koltsidas and Viglas, SIGMOD 2011

[Kim & Ahn, FAST 2008] Block padding LRU (BPLRU) • Manages the on disk RAM buffer • Data blocks are organized at erase-unit granularity • LRU queue is on data blocks • On reference, move the entire block to the head of the queue • On eviction, sequentially write the entire block MRU block LRU block Blk 0 Blk 2 Blk 3 Blk 1 Logical sector 11 referenced Blk 2 Blk 3 Blk 0 Blk 1 Victim block: logical sectors 4, 5 written Koltsidas and Viglas, SIGMOD 2011

[Kim & Ahn, FAST 2008] BPLRU: Further optimizations • Use of padding • If a data block to be written has not been fully read, read what’s missing and write sequentially • LRU compensation • Sequentially written blocks are moved to the end of the LRU queue • Least likely to be written in the future FTL reads missing sectors and replaces data block in one sequential write read write Blk 0 Blk 2 Blk 3 Blk 1 MRU block LRU block Koltsidas and Viglas, SIGMOD 2011 Blk 3 Blk 0 Blk 1 Blk 2

Cost-based replacement • Choice of victim depends on probability of reference (as usual) • But the eviction cost is not uniform • Clean pages bear no write cost, dirty pages result in a write • I/O asymmetry: writes more expensive than reads • It doesn’t hurt if we misestimate the heat of a page • So long as we save (expensive) writes • Key idea: combine LRU-based replacement with cost-based algorithms • Applicable both in SSD-only as well as hybrid systems Koltsidas and Viglas, SIGMOD 2011

[Park, Yung, Kang, Kim & Lee, CASES 2006] Clean first LRU (CFLRU) • Buffer pool divided into two regions • Working region: business as usual • Clean-first region: candidates for eviction • Number of candidates is called the window size W • Always evict from clean-first region • Evict clean pages before dirty ones to save write cost • Improvement: Clean-First Dirty-Clustered [Ou, Harder & Jin, DAMON 2009] • Cluster dirty pages of the clean-first region based on spatial proximity dirty page clean page Working region Clean-first region P1 P2 P3 P4 P5 P6 P7 P8 LRU MRU Window size W Koltsidas and Viglas, SIGMOD 2011

[K & V, VLDB 2008] Cost-based replacement in hybrid systems • Similar to the previous idea, but for hybrid setups • SSD and HDD for persistent storage • Divide the buffer pool into two regions • Time region: typical LRU • Cost region: four LRU queues, one per cost class • Clean flash • Clean magnetic • Dirty flash • Dirty magnetic • Order queues based on cost • Evict from time region to cost region • Final victim is always from the cost region Time region Cost region cost Koltsidas and Viglas, SIGMOD 2011

[Stoica, Athanassoulis, Johnson & Ailamaki, DAMON 2009] Append and pack • Convert random writes to sequential ones • Shim layer between storage manager and SSD • On eviction, group dirty pages, in blocks that are multiples of the erase unit • Do not overwrite old versions, instead write block sequentially • Invalidate old versions • Pay the price of a few extra reads but save the cost of random writes random writes Shim storage manager layer group and write sequentially invalidate SSD persistent storage Koltsidas and Viglas, SIGMOD 2011

Caching in flash memory • Problem setup • SSD and HDD at different levels of the storage hierarchy • Flash memory used as a cache for HDD pages • Research questions • When and how to use the SSD as a cache? • Which pages to bring into the cache? • How to choose victim pages? • Methodologies • Optimal choice of hardware configuration • SSD as a read cache • Flash-resident extended buffer pools Koltsidas and Viglas, SIGMOD 2011

Data management over flash memory