Sougata Bhattacharjee

Caching for Flash-Based Databases • Summer Semester 2013 Sougata Bhattacharjee

OUTLINE • MOTIVATION • FLASH MEMORY • FLASH CHARACTERISTICS • FLASH SSD ARCHITECTURE • FLASH TRANSLATION LAYER • PAGE REPLACEMENT ALGORITHM • ADAPTIVE REPLACEMENT POLICY • FLASH-AWARE ALGORITHMS • CLEAN-FIRST LRU ALGORITHM • CLEAN-FIRST DIRTY-CLUSTERED (CFDC) ALGORITHM • AD-LRU ALGORITHM • CASA ALGORITHM • CONCLUSION • REFERENCES

Motivation Flash Memory • Data storage technology : HDDs and DRAM HDDs suffer from HIGH LATENCY. DRAM comes with HIGHER PRICE. • Data Explosion The worldwide data volume is growing at an astonishing speed. In 2007, we had 281 EB data; in 2011, we had 1800 EB data. Page Replacement Algorithm • Energy consumption • In 2005, total power used by servers in USA was 0.6% of • its total annual electricity consumption. Flash-Aware Algorithms Conclusion We need to find a memory technology which may overcome these limitations. http://faculty.cse.tamu.edu/ajiang/Server.pdf

Motivation Flash Memory BACKGROUND • In 1980, Dr. FujioMasuoka invented Flash memory. • In 1988, Intel Corporation introduced Flash chips. • In 1995, M-Systems introduced flash-based solid-state drives. Page Replacement Algorithm Flash-Aware Algorithms • What is flash? Flash memory is an electronic non-volatile semiconductor storage device that can be electrically erased and reprogrammed. Conclusion • 3 Operations: program (Write), Erase, and Read. • Two major forms NAND flash and NOR flash NAND is newer and much more popular.

FLASH AND MEMORY HIERARCHY Motivation Flash Memory Page Replacement Algorithm Higher Speed, Cost Larger Size Flash-Aware Algorithms Flash is faster, has lower latency, is more reliable, but more expensive than hard disks Conclusion NAND Flash READ - 50 μsec WRITE – 200 μsec ERASE- Very Slow

Why Flash is popular? Motivation • Benefits over magnetic hard drives Flash Memory • Semi-conductor technology, no mechanical parts. • Offers lower access latencies. Page Replacement Algorithm • High data transfer rate. Flash-Aware Algorithms • Higher reliability (no moving parts). Conclusion • Lower power consumption. • Small in size and light in weight. • Longer life span. • Benefits over RAM • Lower price. • Lower power consumption.

USE OF FLASH Motivation Flash Memory • Flash SSD is widening its range of applications • Embedded devices • Desktop PCs and Laptops • Servers and Supercomputers Page Replacement Algorithm Flash-Aware Algorithms Conclusion http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2011/20110811_S308_Cooke.pdf , Page 2

Data Data Data Page Page Page Page Page Page …… …… …… Page Page Page FLASH OPERATIONS Motivation Flash Memory Page Replacement Algorithm …… …… Flash-Aware Algorithms Conclusion Block 1 Block 2 Block n • Three operations: Read, Write, Erase. • Reads and writes are done at the granularity of a page (2KB or 4KB) • A flash block is much larger than a disk block: • Contains p (typically 32 - 128) fixed-size flash pages with 512 B - 2 KB • Erasures are done at the granularity of a block (10,000 – 100,000 erasures) • Block erase is the slowest operation requiring about 2ms • Update of flash pages not possible; only overwrite of an entire block where erasure is needed first

Data Data Page Page Page Page Page Page Page Page Page Page Page Page Page Page Page FLASH OPERATIONS Motivation Flash Memory Free Free Free Page Replacement Algorithm Steps of Page Page Page Flash-Aware Algorithms Page Page Page Page Page modified DB Pages Page ERASE Full Block Conclusion Block 1 Block 2 Block 1 Block 2 • Update of flash pages not possible; only overwrite of an entire block where erasure is needed first • Updates go to new page (new block).

FLASH CONSTRAINTS Motivation Limited celllifetime (Cons3) Erase-before-writerule (Cons2) Write/Erasegranularityasymmetry (Cons1) Flash Memory Page Replacement Algorithm + Flash-Aware Algorithms Conclusion • Invalidate • Out-of-place update + Garbage Collection Logical to Physical Mapping Wear Leveling

FLASH MEMORY STRUCTURE Motivation • Various operations need to be carried out to ensure correct operation of a flash device. Flash Memory File System Page Replacement Algorithm • Mapping • Garbage collection Flash-Aware Algorithms • Wear leveling FTL Mapping Garbage Collection Conclusion • Flash Translation Layer controls • flash management. Wear Leveling Other • Hides complexities of device management from the application • Garbage collection and wear leveling Flash Device • Enable mobility – flash becomes plug and play

MAPPING TECHNIQUES (1/2) Motivation • 3 Types of Basic Mappings Flash Memory • Page-Level Mapping • Block-Level Mapping • Hybrid Mapping LPN PPN @7 Page Replacement Algorithm Flash-Aware Algorithms • Each page mapped independently. • Highest performance potential. • Highest resource use • Large size of mapping table. Conclusion Page-Level Mapping

MAPPING TECHNIQUES (2/2) Motivation • 3 Types of Basic Mapping Flash Memory • Page-Level Mapping • Block-Level Mapping • Hybrid Mapping Page Replacement Algorithm 7 mod 4 =3 Flash-Aware Algorithms • Only block numberskept in themapping table. • Page offsets remain unchanged. • Small mapping table. • Bad performance for write updates. LBN PBN @7 = 7/4 =1 Conclusion Block-Level Mapping

FTL BLOCK-LEVEL MAPPING (BEST CASE) Motivation k flash blocks: B g log blocks: L free blocks: F Flash Memory Page Replacement Algorithm ….. ….. ….. ….. Flash-Aware Algorithms 1 2 k-1 k i j 1 g Conclusion Switch: L1 becomes B1, Erase old B1 Erase B1 1 Erase operation

FTL BLOCK-LEVEL MAPPING (SOME CASE) Motivation k flash blocks: B g log blocks: L free blocks: F Flash Memory Page Replacement Algorithm ….. ….. ….. ….. Flash-Aware Algorithms 1 2 k-1 k i j 1 g Conclusion Merge: B1 and L1 to Fi Erase L1 Erase B1 2 Erase operation  n+1 erasures Merge of n flash blocks and one log block to Fi

GARBAGE COLLECTION Motivation • Moves valid pages from blocks containinginvalid data and thenerases the blocks Flash Memory • Removesinvalid pages and increases free pages Page Replacement Algorithm Flash-Aware Algorithms Conclusion ERASE Full Block ERASE Full Block • Wear Levelingdecides where to write the new data • Wear Leveling picks up most frequently erased blocks and least worn-out blocks to equalize overall usage and swap their content; enhances lifespan of flash

BASICS OF PAGE REPLACEMENT Motivation Flash Memory • Find the location of the desired page on disk. • Find free frame: • If a free frame exists, use it. • Otherwise, use a page replacement algorithm to select a victim page. Page Replacement Algorithm Flash-Aware Algorithms • Load the desired page into the frame. • Update the page allocation table (page mapping in the buffer). • Upon the next page replacement, repeat the whole process again in the same way. Conclusion

THE REPLACEMENT CACHE PROBLEM Motivation Flash Memory Cache is FAST but EXPENSIVE HDDs are SLOW but CHEAP Page Replacement Algorithm Flash-Aware Algorithms How to manage the cache? Which page to replace? How to maximize the hit ratio? Conclusion

PAGE REPLACEMENT ALGORITHMS (1/2) Motivation • Least Recently Used (LRU) - Removes the least recently used items first - Constant time and space complexity & simple-to-implement - Expensive to maintain statistically significant usage statistics - Does not exploit "frequency” - It is not scan-resistant Flash Memory How LRU Works? Page Replacement Algorithm Flash-Aware Algorithms Conclusion String: C A B D EFDGE Time: 0 12 3 4 5 6 7 8 9 Page Fault C goes out • Page Fault • B goes out • Page Fault • A goes out

PAGE REPLACEMENT ALGORITHMS (2/2) Motivation • Least Frequently Used (LFU) - Removes least frequently used items first • - Is scan-resistant • - Logarithmic time complexity (per request) • - Stale pages can remain a long time in the buffer Flash Memory Page Replacement Algorithm Flash-Aware Algorithms • LRU + LFU = LRFU (Least Recently/Frequently Used) • - Exploit both recency and frequency • - Better performance than LRU and LFU • - Logarithmic time complexity • - Space and time overhead Conclusion Adaptive Replacement Cache(ARC) is a Solution

ARC (ADAPTIVE REPLACEMENT CACHE) CONCEPT Motivation General double cache structure (cache size is 2C) Flash Memory L1 L2 Page Replacement Algorithm LRU MRU MRU LRU Flash-Aware Algorithms • Cache is partitioned into two lists L1 and L2. • L1 contains recently seen pages: Recencylist. Conclusion • L2 contains pages seen at least twice recently: Frequency list. • If L1 contains exactly C pages, replace the LRU page from L1. • Otherwise , replace the LRU page in L2.

ARC CONCEPT Motivation ARC structure (cache size is C) C Flash Memory B1 B2 T1 T2 L1 L2 Page Replacement Algorithm LRU MRU MRU LRU 2C Flash-Aware Algorithms • Divide L1 into T1 (MRU end) & B1 (LRU end). • Divide L2 into T2 (MRU end) & B2 (LRU end). Conclusion • The size of T1 and T2 is C. • The size of T1 and B1 is C, same for T2 and B2. • Upon a page request: if it is found in T1 or T2 , move it to MRU of T2. • When cache miss occurred, new page is added in MRU of T1 • - If T1 is full, LRU of T1 is moved to MRU of B1.

ARC PAGE EVICTION RULE Motivation ARC structure (cache size is C) C Flash Memory B1 B2 T1 T2 L1 L2 Page Replacement Algorithm LRU MRU MRU LRU 2C Flash-Aware Algorithms • ARC adapts parameter P , according to an observed workload. • - P determines the target size of T1. Conclusion • If requested page found in B1, P is increased & the page moved to MRU position of T2. • If requested page found in B2, P is decreased & the page moved to MRU position of T2.

HOW ARC WORKS? (1/2) Motivation C C T1 T2 B2 B1 Flash Memory Frequency Recency C Page Replacement Algorithm Reference String : Flash-Aware Algorithms Time 0 1 Conclusion 2 3 4 5 6 7 8 9

HOW ARC WORKS? (2/2) Motivation Reference String : Flash Memory Page Replacement Algorithm Scan-Resistant Time 9 Flash-Aware Algorithms 10 11 Conclusion 12 Page B is out from the list 13 Increase T1, Decrease B1 Self-Tuning 14 15 16 17 Increase T2, Decrease B2 Self-Tuning

ARC ADVANTAGE Motivation • ARC is scan-resistant. Flash Memory • ARC is self-tuning and empirically universal. • Stale pages do not remain in memory; better than LFU. Page Replacement Algorithm • ARC consumes about 10% - 15% more time than LRU, but the hit ratio is almost twice as for LRU. Flash-Aware Algorithms • Low space overhead for ‘B’ lists. Conclusion

FLASH-AWARE BUFFER TECHNIQUES Motivation • Cost of page write is much higher than page read. Flash Memory • Buffer manager decides How and When to write. Page Replacement Algorithm • Minimize the number of physical write operations. Flash-Aware Algorithms • CFLRU (Clean-First LRU) • LRUWSR (LRU Write Sequence Reordering) • CCFLRU (Cold-Clean-First LRU) • AD-LRU (Adaptive Double LRU) Conclusion • Read/Write entire flash blocks (addressing the FRW problem) • FAB (Flash Aware Buffer) • REF (Recently-Evicted-First)

CLEAN-FIRST LRU ALGORITHM (1/3) Motivation • One of the earliest proposals of flash-aware buffer techniques. Flash Memory • CFLRU is based on LRU replacement policy. Page Replacement Algorithm • LRU list is divided into two regions: • Working region: Recently accessed pages. Flash-Aware Algorithms • Clean-first region: Pages for eviction. Working Region Clean-First Region MRU Conclusion LRU Window , W = 4 Clean Dirty • CFLRU always selects clean pages to evict from the clean- first region first to save flash write costs. • If there is no clean page in this region, a dirty page at the end of the LRU list is evicted.

CLEAN-FIRST LRU ALGORITHM (2/3) Motivation • CFLRU always selects clean pages to evict from the clean- first region first to save flash write costs. Flash Memory • If there is no clean page in this region, a dirty page at the end of the LRU list is evicted. Page Replacement Algorithm Flash-Aware Algorithms Working Region Clean-First Region MRU Conclusion LRU Clean Dirty Evicted Pages : P7 P5 P8 P6

CLEAN-FIRST LRU ALGORITHM (3/3) Motivation Disadvantage : Flash Memory • CFLRU has to search in a long list in case of a buffer fault. Page Replacement Algorithm • Keeping dirty pages in the clean-first region can shorten the memory resources. Flash-Aware Algorithms Conclusion • Determine the size of W, the window size of the clean-first region. CFDC : Clean-First, Dirty-Clustered

CFDC (CLEAN-FIRST, DIRTY-CLUSTERED) ALGORITHM Motivation Victim Clean Queue Flash Memory Dirty Queue Page Replacement Algorithm Working Region Priority Region Flash-Aware Algorithms • Implement two-region scheme. Buffer divided into two region: 1. Working Region : Keep hot pages • 2. Priority Region: Assign priority to pages Conclusion • Divide clean-first region (CFLRU) into two queue: Clean Queue and Dirty Queue; Separation of Clean and Dirty Pages. • Dirty pages are grouped in clusters according to spatial locality. • Clusters are ordered by priority. • Clean pages are always chosen first as victim pages. • Otherwise, a dirty page is evicted from the LRU end of a cluster having lowest priority.

CFDC ALGORITHM – PRIORITY FUNCTION Motivation For a cluster c with n pages, its priority P(c) is computed according to Formula 1 Flash Memory IPD (Inter-Page Distance) Page Replacement Algorithm Flash-Aware Algorithms Where P0, …, Pn-1 are the page numbers ordered by their time of entering the cluster. Conclusion Example : Victim Page GlobalTime : 10 Timestamp -> 4 2 3 6 2/9 1/8 1/14 1/18 Lowest Priority Priority ->

CFDC ALGORITHM – EXPERIMENTS Motivation Flash Memory Page Replacement Algorithm Flash-Aware Algorithms • CFDC vs. CFLRU: 41% • CFLRU vs. LRU: 6% • Cost of page flushes • Clustered writes are efficient Conclusion • Influence of increasing update ratios • CFDC is equal with LRU for update workload. • Number of page flushes • CFDC has close write count to CFLRU

CFDC ALGORITHM – CONCLUSION Motivation • Reduces the number of physical writes Flash Memory • Improves the efficiency of page flushing Page Replacement Algorithm • Keeps high hit ratio. Flash-Aware Algorithms Conclusion • Size of the Priority Window is a concern for CFDC. CASA : Dynamically adjusts buffer size

AD-LRU (ADAPTIVE DOUBLE LRU) ALGORITHM Motivation • AD-LRU integrates the properties of recency, frequency and cleanness into the buffer replacement policy. Flash Memory Page Replacement Algorithm Cold LRU Hot LRU Flash-Aware Algorithms MRU LRU MRU LRU Min_lc FC FC Conclusion • Cold LRU: Keeps pages referenced once • Hot LRU: Keeps pages referenced at least twice (frequency) • FC (First-Clean) indicates the victim page. • If page miss occurs, increase the size of the cold queue • If buffer is full, cold clean pages are evicted from Cold LRU. • If cold clean pages are not found, then cold dirty pages are evicted by using a second-chance algorithm.

AD-LRU ALGORITHM EVICTION POLICY Motivation Example : Flash Memory Buffer size : 9 pages MRU Page Replacement Algorithm LRU 3 Dirty Hot 7 Dirty Hot 2 Clean Hot 1 Dirty Hot Flash-Aware Algorithms Hot Queue 10 Dirty Cold 4 Dirty Cold 4 Dirty Cold 6 Clean Cold 5 Clean Cold 6 Clean Cold 9 Dirty Cold 8 Dirty Cold Conclusion Cold Queue New Page Ad-LRU Victim Ad-LRU Victim • If no clean cold page is found, then a dirty cold page will be chosen as victim using a second-chance algorithm.

AD-LRU ALGORITHM EXPERIMENTS Motivation Flash Memory Page Replacement Algorithm Flash-Aware Algorithms Random Read-Most Conclusion Zipf Write-Most Write count vs. buffer size for various workload patterns AD-LRU has the lowest write count

AD-LRU ALGORITHM - CONCLUSION Motivation • AD-LRU considers reference frequency, an important property of reference patterns, which is more or less ignored by CFLRU. Flash Memory Page Replacement Algorithm • AD-LRU frees the buffer from the cold pages as soon as appropriate. Flash-Aware Algorithms • AD-LRU is self-tuning. Conclusion • AD-LRU is scan-resistant.

CASA (COST-AWARE SELF-ADAPTIVE) ALGORITHM Motivation • CASA makes trade-off between physical reads and physical writes. It adapts automatically to varying workloads. Flash Memory Page Replacement Algorithm Clean List Lc Dirty List Ld LRU LRU MRU MRU Flash-Aware Algorithms |Lc| |Ld| b= |Lc|+ |Ld| Conclusion • Divide buffer pool into 2 dynamic lists: Clean and Dirty list • Both lists are ordered by reference recency. • CASA continuously adjust parameter τ; 0 ≤ τ≤ b • τ is the dynamic target size of Lc, so size of Ld is (b – τ). • In case of a buffer fault: τdecides from which list the victim page will be chosen.

CASA ALGORITHM Motivation • CASA algorithm considers both read and write cost. Flash Memory • CASA algorithm also considers the status (R/W) of a requested page Page Replacement Algorithm Clean List Lc Dirty List Ld LRU LRU MRU MRU Flash-Aware Algorithms |Lc| |Ld| b= |Lc|+ |Ld| Conclusion • Case 1: Logical Read request in Lc , τ increased. • Case 2: Logical Write request in Ld , τ decreased.

CASA ALGORITHM – EXAMPLE (1/2) Motivation • Total buffer size b = 13, τ = 6, Target Size of Lc = 6, Ld = 7 • Total buffer size b = 13, τ = 7, target size of Lc = 7, Ld = 6 Flash Memory Incoming page : 14 (Read) in Lc Page Replacement Algorithm Dirty List Ld Clean List Lc Flash-Aware Algorithms LRU LRU b= |Lc|+ |Ld| Conclusion • Case 1: Logical Read request in Lc , τ increased.

CASA ALGORITHM – EXAMPLE (2/2) Motivation • Total buffer size b = 13, τ = 6, target size of Lc = 6, Ld = 7 • Total buffer size b = 13, τ = 7,Target Size of Lc = 7, Ld = 6 Flash Memory Incoming page : 15 (Write) in Ld Page Replacement Algorithm Dirty List Ld Clean List Lc Flash-Aware Algorithms LRU LRU b= |Lc|+ |Ld| Conclusion • Case 1: Logical Read request in Lc , τ increased. • Case 2: Logical Write request in Ld , τ decreased.

CASA ALGORITHM - CONCLUSION Motivation • CASA is implemented for two-tier storage systems based on homogeneous storage devices with asymmetric R/W costs. Flash Memory Page Replacement Algorithm • CASA can detect cost ratio dynamically. Flash-Aware Algorithms • CASA is self-tuning. It adapts itself to varying cost ratios and workloads Conclusion

CONCLUSION Motivation • Flash memory is a widely used, reliable, and flexible non-volatile memory to store software code and data in a microcontroller. Flash Memory • However, the performance behavior of flash devices is still remaining unpredictable due to complexity of FTL implementation and its proprietary nature. Page Replacement Algorithm Flash-Aware Algorithms • To gain more efficient performance, we need to implement a flash device simulator. Conclusion • We addressed issues of buffer management for two-tier storagesystems (Caching for a flash DB); ARC and CASA are two better approach. • Phase-change memory (PCM) is a promising next-generation memory technology, which can be used for database storage systems.

REFERENCES Yi Ou: Caching for flash-based databases and flash-based caching for databases, Ph.D. Thesis, University of Kaiserslautern, Verlag Dr. Hut, Online August 2012 Nimrod Megiddo, Dharmendra S. Modha: ARC: A Self-Tuning, Low Overhead Replacement Cache. FAST 2003: (115-130) Nimrod Megiddo, Dharmendra S. Modha: Outperforming LRU with an Adaptive Replacement Cache Algorithm. IEEE Computer 37(4): 58-65 (2004) Yi Ou, Theo Härder: Clean first or dirty first?: a cost-aware self-adaptive buffer replacement policy. IDEAS 2010: 7-14 Seon-Yeong Park, Dawoon Jung, Jeong-Uk Kang, Jinsoo Kim, Joonwon Lee: CFLRU: a replacement algorithm for flash memory. CASES 2006: 234-241 Yi Ou, Theo Härder, Peiquan Jin: CFDC: a flash-aware replacement policy for database buffer management. DaMoN 2009: 15-20 Peiquan Jin, Yi Ou, Theo Härder, Zhi Li: AD-LRU: An efficient buffer replacement algorithm for flash-based databases. Data Knowl. Eng. 72: 83-102 (2012) SumanNath, AmanKansal: FlashDB: dynamic self-tuning database for NAND flash. IPSN 2007: 410-419 Kyoungmoon Sun, SeungjaeBaek, JongmooChoi, Donghee Lee, Sam H. Noh, Sang Lyul Min: LTFTL: lightweight time-shift flash translation layer for flash memory based embedded storage. EMSOFT 2008: 51-58 Nimrod Megiddo, Dharmendra S. Modha: System and method for implementing an adaptive replacement cache policy, US 6996676 B2, 2006 Wikipedia: Flash memory Wikipedia: Page replacement algorithm N. Megiddo , D. S. Modha: Adaptive Replacement Cache, IBM Almaden Research Center, April 2003 Yang Hu, Hong Jiang, Dan Feng, Lei Tian, Shu Ping Zhang, Jingning Liu, Wei Tong, Yi Qin, Liuzheng Wang: Achieving page-mapping FTL performance at block-mapping FTL cost by hiding address translation. MSST 2010: 1-12

THANK YOU

Sougata Bhattacharjee