1 / 16

Gecko Storage System

Gecko Storage System. Tudor Marian, Lakshmi Ganesh , and Hakim Weatherspoon Cornell University. Gecko. Save power by spinning/powering down disks E.g. RAID-1 mirror scheme with 5 primary/mirrors File system (FS) access pattern of disk is arbitrary

kaori
Télécharger la présentation

Gecko Storage System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gecko Storage System Tudor Marian, LakshmiGanesh, and Hakim Weatherspoon Cornell University

  2. Gecko • Save power by spinning/powering down disks • E.g. RAID-1 mirror scheme with 5 primary/mirrors • File system (FS) access pattern of disk is arbitrary • Depends on FS internals, and gets worse as FS ages • When to turn disks off? What if prediction is wrong? write(fd,…) read(fd,…) Block Device

  3. Predictable Writes • Access same disks predictably for long periods • Amortize the cost of spinning down & up disks • Idea: Log Structured Storage/File System • Writes go to the head of the log until disk(s) full write(fd,…) Log head Log tail Block Device

  4. Unpredictable Reads • What about reads? May access any part of log! • Keep only the “primary” disks spinning • Trade off read throughput for power savings • Can afford to spin up disks on demand as load surges • File/buffer cache absorbs read traffic anyway read(fd,…) write(fd,…) Log head Log tail Block Device

  5. Stable Throughput • Unlike LSF, reads do not interfere with writes • Keep data from head (written) disks in file cache • Log cleaning not on the critical path • Afford to incur penalty of on-demand disk spin-up • Return reads from primary, clean log from mirror read(fd,…) write(fd,…) Log head Log tail Block Device

  6. Design Virtual File System (VFS) File/Buffer Cache Block Device File Mapping Layer Generic Block Layer Device Mapper Disk Filesystem Disk Filesystem I/O Scheduling Layer (anticipatory, CFQ, deadline, null) Block Device Drivers

  7. Design Overview • Log structured storage at block level • Akin to SSD wear-leveling • Actually, supersedes on-chip wear-leveling of SSDs • The design works with RAID-1, RAID-5, and RAID-6 • RAID-5 ≈ RAID-4 due to the append-nature of log • The parity drive(s) are not a bottleneck since writes are appends • Prototype as a Linux kernel dm (device-mapper) • Real, high-performance, deployable implementation

  8. Challenges • dm-gecko • All IO requests at this storage layer are asynchronous • SMP-safe: leverages all available CPU cores • Maintain in-core (RAM) large memory maps • battery backed NVRAM, and persistently stored on SSD • Map: virtual block <-> linear block <-> disk block (8 sectors) • To keep maps manageable: block size = page size (4K) • FS layered atop uses block size = page size • Log cleaning/garbage collection (gc) in the background • Efficient cleaning policy: when write IO capacity is available

  9. Commodity Architecture Dell PowerEdge R710 Dual Socket Multi-core CPUs Battery Backed RAM OCZ RevoDrivePCIe x4 SSD 2TB Hitachi HDS72202 Disks

  10. dm-gecko • In-memory map (one-level of indirection) • virtual block: conventional block array exposed to VFS • linear block: the collection of blocks structured as a log • Circular ring structure • E.g.: READs are simply indirected read block Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks

  11. dm-gecko • WRITE operations are append to log head • Allocate/claim the next free block • Schedule log compacting/cleaning (gc) if necessary • Dispatch write IO on new block • Update maps & log on IO completion write block Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks

  12. dm-gecko • TRIM operations free the block • Schedule log compacting/cleaning (gc) if necessary • Fast forward the log tail if the tail block was trimmed trim block Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks

  13. Log Cleaning • Garbage collection (gc) block compacting • Relocate the used block that is closest to tail • Repeat until compact (e.g. watermark), or fully contiguous • Use spare IO capacity, do not run when IO load is high • More than enough CPU cycles to spare (e.g. 2x quad core) Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks

  14. Gecko IO Requests • All IO requests at storage layer are asynchronous • Storage stack is allowed to reorder requests • VFS, file system mapping, and file/buffer cache play nice • Un-cooperating processes may trigger inconsistencies • Read/write and write/write conflicts are fair game • Log cleaning interferes w/ storage stack requests • SMP-safe solution that leverages all available CPU cores • Request ordering is enforced as needed • At block granularity

  15. Request Ordering • Block b has no prior pending requests • Allow read or write request to run, mark block w/ ‘pending IO’ • Allow gc to run, mark block as ‘being cleaned’ • Block b has prior pending read/write requests • Allow read or write requests, track the number of `pending IO’ • If gc needs to run on block b, defer until all read/write requests have completed (zero `pending IOs’ on block b) • Block b is being relocated by the gc • Discard gc requests on same block b (doesn’t actually occur) • Defer all read/write requests until gc has completed on block b

  16. Limitations • In-core memory map (there are two maps) • Simple, direct map requires lots of memory • Multi-level map is complex • Akin to virtual memory paging, only simpler • Fetch large portions of the map on demand from larger SSD • Current prototype uses two direct maps:

More Related