Trends in Scalable Storage System Design and Implementation

March 15, 2012 Prof. Matthew O’Keefe Department of Electrical and Computer Engineering University of Minnesota Trends in Scalable Storage System Design and Implementation

Organization of Talk • History: work in 1990’s and 2000’s • Scalable File Systems • Google File System • Hadoop File System • Sorrento, Ceph, Ward Swarms • Comment on FLASH • Questions…

Prior Work Part I: Generating Parallel Code • Detecting parallelism in scientific codes, generating efficient parallel code • Historically, had been done on loop-by-loop basis • Distributed memory parallel computers required more aggressive optimization • Parallel programming still a lot like assembly language programming • Increasing scope of code to analyze optimize as parallelism increases • What’s needed is a way to express the problem solution at a much higher level from which efficient code can be generated • Leverage design patterns and translation technologies to reduce the semantic gap

Exploiting Problem Structure

Prior Work Part II: Making Storage Systems Go Faster and Scale More

Storage System Scalability/Speed • Storage interface standards lacked ability to scale in both speed and connectivity • Industry countered this with new standard: Fibre Channel • Allowed shared disks, but system software like file systems and volume managers not built to exploit this • Started the Global File System project at U. of Minnesota in 1995 to counter this • Started Sistina Software to commercialize GFS and LVM, sold to Red Hat in 2003 • Worked for Red Hat for 1.5 years, then (being a glutton for punishment) started another company ( Alvarri) to do cloud backup in 2006 • Alvarri sold to Quest software in late 2008, dedupe engine part of Quest’s backup product • So two commercial products developed and still shipping

Making Storage Systems Faster and More Scalable • GFS pioneered several interesting techniques for cluster file systems: • no central metadata server • distributed journals for performance, fast recovery • first Distributed Lock Manager for Linux — now used in other cluster projects in Linux • Implemented POSIX IO • Assumption at time was: POSIX is all there is, have to implement that • Kind of naïve, assumed that it had to be possible • UNIX/Windows view of files as linear stream of bytes which can be read/written to anywhere in file by multiple processors • Large files, small files, millions of files, directory tree structure, synchronous write/read semantics, etc. all make POSIX difficult to implement

Why POSIX File Systems Are Hard • They’re in the kernel and tightly integrated with complex kernel subsystems like virtual memory • Byte-granularity, coherency, randomness • Users expect them to be extremely fast, reliable, and resilient • Add parallel clients and large storage networks (e.g., Lustre or Panassas) things get even harder • POSIX IO was the emphasis for parallel HPC IO (1999 through 2010) until recently • HPC community re-thinking this • Web/cloud has already moved on

Meanwhile: Google File System and its Clone (Hadoop) • Google and others (Hadoop) went a different direction: change the interface from POSIX IO to something inherently more scalable • Users have to write (re-write) applications to exploit the interface • All about scalability — using commodity server hardware — for a specific kind of workload • Hardware-software co-design: • append-only write semantics from parallel producers • mostly write-once, read many times by consumers • explicit contract on performance expectations: small reads and writes — Fuggedaboutit! • Obviously quite successful, and Hadoop is becoming something of an industry standard (at least the API’s) • Lesson: if solving the problem is really, really hard, look at it a different way, move interfaces around, change your assumptions (e.g., as in the parallel programming problem)

Google/Hadoop File Systems • Google needed a storage system for its web index, various applications — enormous scale • GFS paper at FAST conference in 2004 led to development of Hadoop, open source GoogleFS clone • Co-designed file system with applications • Applications use map-reduce paradigm • Streaming (complete) reads of very large file/datasets, process this data into reduced form (e.g., an index) • Files access is write-once, append-only, read-many

Scalable File System Goals • Build with commodity components that frequently fail (to keep things cheap) • So design assumes failure is common case • Nodes incrementally join and leave the cluster • Scale to 10s to 100s of Petabytes, headed towards exabytes; 1000’s to 10s of 10,000s of storage nodes and clients • Automated administration, simplified recovery (in theory, not practice)

Hadoop Slides • Hadoop is open source • Until recently missing these GoogleFS features: • Snapshots • HA for NameNode • Multiple writers (appender’s) to single file • GoogleFS • Within 12 to 18 months, GoogleFS was found to be inadequate for Google’s scaling needs, had to build federated GoogleFS clusters

Other Research on Scalable Storage Clusters (none in production yet) • Ceph – Sage Weil, UCSC : POSIX lite • Multiple metadata servers, dynamic workload balancing • Mathematical hash to map file segments to nodes • Sorrento – UCSB : POSIX with low write-sharing • Distributed algorithm for capacity and load balancing, distributed metadata • Lazy consistency semantics • Ward Swarms – Lee Ward, Sandia • Similar to Sorrento, uses victim cache and storage tiering

Reliability Motivates Action • DoE building exascale machines with 100,000+ separate multicore processors for simulations • Storage systems for current teraflop and petaflop systems do not provide sufficient reliability or speed at scale • Just adding more disks does not solve the problem, requires a whole new design philosophy

Semantics Motivate Action • IEEE POSIX I/O is nice and familiar • Portable and guarantees coherent access and views • Stable API • But it cannot scale

Architecture Motivates Action • Existing, accepted, solutions all use the same architecture • Based on “Zebra”, it is a centrally managed metadata service and collection of stores • Examples: Lustre, Minnesota GFS, Panassas File System, IBM GPFS, etc. • Existing, at scale, file systems are challenged • Managing all the simultaneous transfers is daunting • Managing the number of components is daunting • Anticipate multiple orders of magnitude growth with move to exascale

Nebula: A Peer-To-Peer Inspired Architecture • A self-organizing store • Peers join swarm to maximize aggregate bandwidth need • A self-reorganizing store • Highly redundant components supply symmetric services • Failure of a component or comms link only motivates the search for an equivalent peer

Node-Local Capacity Management • Since we want to make more replicas than required for performance reasons • We must not tolerate a concept of “full” • We must be able to eject file segments • Can’t do this in a file system, they are persistent stores • Can do this in a cache! • But scary things can happen • If all are equal partners we can thrash • To the point where we deadlock, even • And at some point, somewhere we must guarantee persistence

Cache Tiers • Need to order the storage nodes by capability • Some combination of storage latency, storage bandwidth, how full, communication performance characteristics • A tuple (which can be ordered) can approximate all of these • With an ordering, we can apply concepts like better and worse • When we can apply “better” and “worse” we can generate a set of rules • That tell us when and whether we can eject a segment • That tell us how we may eject a segment • Effectively, we will impose a tier-like structure

Think of it as a Sliding Scale • More reliable components • Larger capacity • Lower performance • Maybe even offline • Tends to • Less churn • Older data “Worse” “Better” • Less reliable • Maybe even volatile • High performance • Tends to • Very high churn • Only current data

Victim-Caches • With a tier-like concept we can implement victim-caches • Just a cache that another cache ejects into • We introduce a couple of relatively simple rules to motivate all kinds of things • When ejecting, if a copy must be maintained then eject into a victim cache the lies within a similar or “worse” tier • “Worse” translates as more persistence, potentially higher latency and lower bandwidth. When application is recruiting for performance, best candidates are those from “better” tiers • Better translates as more volatile, lower latency and higher bandwidth

Data Movement: Writing • App writes into highly volatile, local swarm • Which is small and must eject • To less performant, more persistent nodes • Which must eject • To a low performance, high persistence source

Reading is similar • First, find all the parts of the file • Start requesting segments • If not happy with achieved bandwidth • Find some nodes in high, “best” tier and ask them to stage parts of the file • Since they will use the same sources as the requestor, they will tend to ask near tiers to stage as well in order to provide them the bandwidth they require • Rinse, wash, repeat throughout • Start requesting segments from them as well, redundantly • As they ramp, source bias logic will shift focus onto the recruited nodes

Current Status of Project • Primitive client and storage node prototype working now • Looking for students (undergrad and grad students) — happy to co-advise with other faculty • Ramping up… • Using memcached and libmemcached (name-value store) as a development and research tool: support HPC and web workloads • Nebula storage server semantics are richer • Question: how much does that help?

Software-Hardware Co-Design of First Tier Storage Node • New hardware technologies (SSD, hybrid DRAM) pushing the limits of OS storage/networking stack • Question: Is it possible to co-design custom hardware and software for first-tier storage node? • File system in VLSI: BlueArc NAS • design FPGA for specific performance goal

Nibbler: Co-Design Hardware and Software • Accelerates memcached and Nebula storage node performance via SSD’s and hybrid DRAM • Hardware-software codesign • e.g., BlueArc/HDS puts file system in VLSI • Large, somewhat volatile memory • Performance first, but also power and density • First-tier storage node will drive ultimate performance achievable by this design

Top 10 Lessons in Software Development • Be humble (seriously) • Be paranoid: for large projects with aggressive performance or reliability goals, you have to do just about everything right • Be open: to new ideas, new people, new approaches • Make good senior people mentor the next generation (e.g., pair programming): temptation is to reduce/remove their distractions • Let the smartest, most capable people bubble up to the top of the organization • Perfect is the enemy of the good • Take upstream processes seriously, but beware the fuzzy front-end • Upstream (formal) inspections not just reviews • For projects of any size, never abandon good SE practices, especially rules around coding styles, regularity, commenting and documentation • Never stop learning: software engineering practices and tools, new algorithms and libraries, etc. — and if in management, create an environment that let’s people learn

Questions?

Trends in Scalable Storage System Design and Implementation

Trends in Scalable Storage System Design and Implementation

Presentation Transcript

Design and Implementation

StreamStats: System Design and Web Implementation

ECE7995 Computer Storage and Operating System Design

Distributed File System Design and Implementation

WHEA System Design And Implementation

Trends in Embedded System Design

Trends in Storage Technologies .

Systems Analysis and Design : System Implementation

Scalable Logging System for DRAM based Storage

Storage Trends

Design and Implementation Experiments of Scalable Socket Buffer Tuning

Design and Implementation

Storage Trends

Design and Implementation*

Exercise in Design and Implementation of a Database System

Frontline Decision Support System: Design and Implementation

StreamStats: System Design and Web Implementation

Design and Implementation

Design and Implementation

System design and implementation

Design and Implementation*

Digital Design and System Implementation