Isilon Clustered StorageOneFS Nick Kirsch
Introduction • Who is Isilon? • What Problems Are We Solving? (Market Opportunity) • Who Has These Problems? (Our Customers) • What Is Our Solution? (Our Product) • How Does It Work? (The Cool Stuff)
Who is Isilon Systems? • Founded in 2000 • Located in Seattle (Queen Anne) • IPO’d in 2006 (ISLN) • ~400 employees • Q3 2008 Revenue: $30 million, 40% Y/Y • Co-founded by Paul Mikesell, UW/CSE • I’ve been at the company for 6+ years
What Problems Are We Solving? Structured Data Unstructured Data • Small files • Modest-size data stores • I/O intensive • Transactional • Steady capacity growth • Larger files • Very large data stores • Throughput intensive • Sequential • Explosive capacity growth
Traditional Architectures • Data Organized in Layers of Abstraction • File System, Volume Manager, RAID • Server/Storage Architecture - “Head” and “Disk” • Scale Up (vs Scale Out) • Islands of Storage • Hard to Scale • Performance Bottlenecks • Not Highly Available • Overly Complex • Cost Prohibitive Storage Device #1 Storage Device #2 Storage Device #3
Who Has These Problems? Worldwide File And Block Disk Storage Systems, 2005-2011* File Server Consolidation Cloud Computing Disk-based Archiving Rich Media Content HPC By 2011, 75% of all storage capacity sold will be for file-based data (PB) File Based: 79.3% CAGR Block Based: 31% CAGR • Isilon has over 850 customers today. * Source: IDC, 2007
Isilon IQ Enterprise-class hardware OneFS™intelligent software Clustered Storage What is Our Solution? Scales to 96 nodes 2.3 PB (single file system) 20 GB/s (aggregate) A 3-nodeIsilon IQ Cluster
Clustered Storage Consists Of “Nodes” • Largely Commodity Hardware • Quad-core 2.3Ghz CPU • 4 GB memory read cache • GbE and 10GbE for front-end network • 12 disks per node • InfiniBand for intra-cluster communication • High-speed NVRAM journal • Hot-swappable disks, power supplies, and fans • NFS, CIFS, HTTP, FTP • Integrates with Windows and UNIX • OneFS operating system
CIFS Ethernet NFS Either Isilon Network Architecture • Drop-in replacement for any NAS device • No client-side drivers required, like Andrew FS (Coda), or Lustre • No application changes, like Google FS or Amazon S3 • No changes required to adopt.
How Does It Work? • Built on FreeBSD 6.x (originally 5.x) • New kernel module for OneFS • Modifications to the kernel proper • User space applications • Leverage open-source where possible • Almost all of the heavy-lifting is in the kernel • Commodity Hardware • A few exceptions: • We have a high-speed NVRAM journal for data consistency • We have an Infiniband low-latency cluster inter-connect • We have a close-to-commodity SAS card (commodity chips) • A custom monitoring board (fans, temps, voltages, etc.) • SAS and SATA disks
OneFS architecture • Fully Distributed • Top Half • Initiator • Bottom Half • Participant • The OneFS architecture is basically an Infiniband SAN • All data access across the back-end network is block-level • The participants act as very smart disk drives • Much of the back-end data traffic can be RDMA Network Operations (TCP, NFS, CIFS) FEC Calculations, Block Reconstruction VFS layer, Locking, etc. File-Indexed Cache Journal and Disk Operations Block-Indexed Cache
OneFS architecture • OneFS started from UFS (aka FFS) • Generalized for a distributed system. • Little resemblance in code today, but concepts are there. • Almost all data structures are trees • OneFS Knows Everything – no volume manager, no RAID • Lack of abstraction allows us to do interesting things, but forces the file system to know a lot – everything. • Cache/Memory Architecture Split • “Level 1” – file cache (cached as part of the vnode) • “Level 2” – block cache (local or remote disk blocks) • Memory used for high-speed write coalescer • Much more resource intensive than a local FS
Atomicity/Consistency Guarantees • POSIX file system • Namespace operations are atomic • fsync/sync operations are guaranteed synchronous • FS data is either mirrored or FEC-protected • Meta-data is always mirrored; up to 8x • User-data can be mirrored (up to 8x) or FEC up to +4 • We use Reed-Solomon codings for FEC • Protection level can be chosen on a per-file or per-directory basis. • Some files can be at 1x (no protection) while others can be at +4 (survive four failures). • Meta-data must be protected at least as high as anything it refers to. • All writes go to the NVRAM first as part of a distributed transaction – guaranteed to commit or abort.
Group Management • Transactional way to handle state changes • All nodes need to agree on their peers • Group changes: split, merge, add, remove • Group changes don’t “scale”, but are rare 1 4 + 2 3
Distributed Lock Manager • Textbook-ish DLM • Anyone requesting a lock is an initiator. • Coordinator knows the definitive owner for the lock. • Controls access to locks. • Coordinator is chosen by a hash of the resource. • Split/Merge behavior • Locks are lost at merge time, not split time. • Since POSIX has no lock-revoke mechanism, advisory locks are silently dropped. • Coordinator renegotiates on split/merge. • Locking optimizations – “lazy locks” • Locks are cached. • Lock-lost callbacks. • Lock-contention callbacks.
RPC Mechanism • Uses SDP on Infiniband • Batch System • Allows you to put dependencies on the remote side. • i.e. Send 20 messages, checkpoint, send 20 messages. • Messages run in parallel, then synchronize, etc. • Coalesces errors. • Async messages (callback) • Sync messages • Update message (no response) • Used by DLM, RBM, etc. (everything)
Writing a file to OneFS • Writes occur via NFS, CIFS, etc. to a single node • That node coalesces data and initiates transactions • Optimizing for write performance is hard • Lots of variables • Each node might have different load • Unusual scenarios, e.g. degraded writes • Asynchronous Write Engine • Build a directed acyclical graph (DAG) • Do work as soon as dependencies satisfied • Prioritize and pipeline work for efficiency
Servers NFS, CIFS, FTP, HTTP Servers (optional 2nd switch) (optional 2nd switch) (optional 2nd switch) Servers Writing a file to OneFS
(optional 2nd switch) Writing a file to OneFS
Writing a file to OneFS • Break the write into regions • Region are protection group aligned • For each region: • Create a layout • Use layout to generate a plan • Execute the plan asynchronously write FEC compute FEC compute layout write block allocate blocks write block
Writing a file to OneFS • Plan executes and transaction commits • Data and parity blocks are now on disks Data and Parity blocks Data and Parity blocks Data and Parity blocks Inode mirror 0 Inode mirror 1
Servers NFS, CIFS, FTP, HTTP Servers (optional 2nd switch) (optional 2nd switch) (optional 2nd switch) Servers Reading a file from OneFS
Servers NFS, CIFS, FTP, HTTP Servers (optional 2nd switch) (optional 2nd switch) Servers Reading a OneFS File Reading a file from OneFS
Handling Failures • What could go wrong during a single transaction? • A block-level I/O request fails • A drive goes down • A node runs out of space • A node disconnects or crashes • In a distributed system, things are expected to fail. • Most of our system calls automatically restart. • Have to be able to gracefully handle all of the above, plus much more!
Handling Failures • When a node goes “down”: • New files will use effective protection levels (if necessary) • Affected files will be reconstructed automatically per request. • That node’s IP addresses are migrated to another node. • Some data is orphaned and later garbage collected. • When a node “fails”: • New files will use effective protection levels (if necessary) • Affected files will be repaired automatically across the cluster. • AutoBalance will automatically rebalance data. • We can safely, proactively SmartFail nodes/drives: • Reconstruct data without removing the device. • In the event of a multiple-component failure occurs, use the original device – minimizes WOR.
CIFS Ethernet NFS Either SmartConnect SmartConnect • Client must connect to a single IP address. • SmartConnect - DNS server which runs on the cluster • Customer delegates zone to the cluster DNS server • SmartConnect responds to DNS queries with only available nodes • SmartConnect can also be configured to respond with nodes based on load, connection, throughput, etc.
We've got Lego Pieces • Accelerator Nodes • Top-Half Only • Adds CPU and Memory – no disks or journal • Only has Level 1 cache… high single-stream throughput • Storage Nodes • Both Top or Bottom Half • In Some Workloads, Bottom Half Only Makes Sense • Storage Expansion Nodes • Just a dumb extension of a Storage Node – add disks • Grow Capacity Without Performance
SmartConnect Zones • hpc. tx.com • 10 GigE dedicated • Accelerator X nodes • NFS Failover required Processing 10gige-1 Finance BizDev • gg.tx.com • Storage nodes • NFS clients, no failover Interpreters 10.20 IT Eng ext-2 10.10 ext-1 10.30 • eng.tx.com • Shared subnet • Separate sub-domain • NFS Failover • bizz.tx.com • Renamed sub-domain • CIFS clients (static IP) • it.tx.com • Full access, maintenance interface • Corporate DNS, no SC • Static (well-known) IPs required • fin.tx.com • VLAN (confidential traffic, isolated) • Same physical LAN
ISILON CONFIDENTIAL Initiator Software Block Diagram
ISILON CONFIDENTIAL Participant Software Block Diagram
ISILON CONFIDENTIAL System Software Block Diagram Accelerator Storage Node
Too much to talk about… • Snapshots • Quotas • Replication • Bit Error Protection • Rebalancing Data • Handling Slow Drives • Statistics Gathering • I/O Scheduling • Network Failover • Native Windows Concepts (ACLs, SIDs, etc.) • Failed Drive Reconstruction • Distributed Deadlock Detection • On-the-fly Filesystem Upgrade • Dynamic Sector Repair • Globally Coherent Cache
Thank You! Questions?