Scalable Clusters

Scalable Clusters Jed Liu 11 April 2002

Overview • Microsoft Cluster Service • Built on Windows NT • Provides high availability services • Presents itself to clients as a single system • Frangipani • A scalable distributed file system

Microsoft Cluster Service • Design goals: • Cluster composed of COTS components • Scalability – able to add components without interrupting services • Transparency – clients see cluster as a single machine • Reliability – when a node fails, can restart services on a different node

Cluster Abstractions • Nodes • Resources • e.g., logical disk volumes, NetBIOS names, SMB shares, mail service, SQL service • Quorum resource • Implements persistent storage for cluster configuration database and change log • Resource dependencies • Tracks dependencies btw resources

Cluster Abstractions (cont’d) • Resource groups • The unit of migration: resources in the same group are hosted on the same node • Cluster database • Configuration data for starting the cluster is kept in a database, accessed through the Windows registry. • Database is replicated at each node in the cluster.

Node Failure • Active members broadcast periodic heartbeat messages • Failure suspicion occurs when a node misses two successive heartbeat messages from some other node • Regroup algorithm gets initiated to determine new membership information • Resources that were online at a failed member are brought online at active nodes

Member Regroup Algorithm • Lockstep algorithm • Activate. Each node waits for a clock tick, then starts sending and collecting status messages • Closing. Determine whether partitions exist and determines whether current node is in a partition that should survive • Pruning. Prune the surviving group so that all nodes are fully-connected

Regroup Algorithm (cont’d) • Cleanup. Surviving nodes local membership information as appropriate • Stabilized. Done

Joining a Cluster • Sponsor authenticates the joining node • Denies access if applicant isn’t authorized to join • Sponsor sends version info of config database • Also sends updates as needed, if changes were made while applicant was offline • Sponsor atomically broadcasts information about applicant to all other members • Active members update local membership information

Forming a Cluster • Use local registry to find address of quorum resource • Acquire ownership of quorum resource • Arbitration protocol ensures that at most one node owns quorum resource • Synchronize local cluster database with master copy

Leaving a Cluster • Member sends an exit message to all other cluster members and shuts down immediately • Active members gossip about exiting member and update their cluster databases

Node States • Inactive nodes are offline • Active members are either online or paused • All active nodes participate in cluster database updates, vote in the quorum algorithm, maintain heartbeats • Only online nodes can take ownership of resource groups

Resource Management • Achieved by invoking a calls through a resource control library (implemented as a DLL) • Through this library, MSCS can monitor the state of the resource

Resource Migration • Reasons for migration: • Node failure • Resource failure • Resource group prefers to execute at a different node • Operator-requested migration • In the first case, resource group is pulled to new node • In all other cases, resource group is pushed

Pushing a Resource Group • All resources in the old node are brought offline • Old host node chooses a new host • Local copy of MSCS at new host brings up the resource group

Pulling a Resource Group • Active nodes capable of hosting the group determine amongst themselves the new host for the group • New host chosen based on attributes that are stored in the cluster database • Since database is replicated at all nodes, decision can be made without any communication! • New host brings online the resource group

Client Access to Resources • Normally, clients access SMB resources using names of the form \\node\service • This presents a problem – as resources migrate between nodes, the resource name will change • With MSCS, whenever a resource migrates, resource’s network name also migrates as part of resource group • Clients only sees services and their network names – cluster becomes a single virtual node

Membership Manager • Maintains consensus among active nodes about who is active and who is defined • A join mechanism admits new members into the cluster • A regroup mechanism determines current membership on start up or suspected failure

Global Update Manager • Used to implement atomic broadcast • A single node in the cluster is always designated as the locker • Locker node takes over atomic broadcast in case original sender fails in mid-broadcast

Frangipani • Design goals: • Provide users with coherent, shared access to files • Arbitrarily scalable to provide more storage, higher performance • Highly available in spite of component failures • Minimal human administration • Full and consistent backups can be made of the entire file system without bringing it down • Complexity of administration stays constant despite the addition of components

User program User program User program Frangipani file server Frangipani file server Distributed lock service Petal distributed virtual disk service Physical disks Server Layering

Assumptions • Frangipani servers trust: • One another • Petal servers • Lock service • Meant to run in a cluster of machines that are under a common administration and can communicate securely

System Structure • Frangipani implemented as a file system option in the OS kernel • All file servers read and write the same file system data structures on the shared Petal disk • Each file server keeps a redo log in Petal so that when it fails, another server can access log and recover

User programs User programs File system switch File system switch Frangipani file server module Frangipani file server module Petal device driver Petal device driver Network Petal server Petal server Petal server Lock server Lock server Lock server Petal virtual disk

Security Considerations • Any Frangipani machine can access and modify any block of the Petal virtual disk • Must run only on machines with trusted OSes • Petal servers and lock servers should also run on trusted OSes • All three types of components should authenticate one another • Network security also important: eavesdropping should be prevented

Disk Layout • 264 bytes of addressable disk space, partitioned into regions: • Shared configuration parameters • Logs – each server owns a part of this region to hold its private log • Allocation bitmaps – each server owns parts of this region for its exclusive use • Inodes, small data blocks, large data blocks

Logging and Recovery • Only log changes to metadata – user data is not logged • Use write-ahead redo logging • Log implemented as a circular buffer • When log fills, reclaim oldest ¼ of buffer • Need to be able to find end of log • Add monotonically increasing sequence numbers to each block of the log

Concurrency Considerations • Need to ensure logging and recovery work in the presence of multiple logs • Updates requested to same data by different servers are serialized • Recovery applies a change only if it was logged under an active lock at the time of failure • To ensure this, never replay an update that has already been completed • keep a version number on each metadata block

Concurrency Considerations (cont’d) • Ensure that only one recovery daemon is replaying the log of a given server • Do this through an exclusive lock on the log

Cache Coherence • When lock service detects conflicting lock requests, current lock holder is asked to release or downgrade lock • Lock service uses read locks and write locks • When a read lock is released, corresponding cache entry must be invalidated • When a write lock is downgraded, dirty data must be written to disk • Releasing a write lock = downgrade to read lock, then release

Synchronization • Division of on-disk data structures into lockable segments is designed to avoid lock contention • Each log is lockable • Bitmap space divided into lockable units • Unallocated inode or data block is protected by lock on corresponding piece of the bitmap space • A single lock protects the inode and any file data that it points to

Locking Service • Locks are sticky – they’re retained until someone else needs them • Client failure dealt with by using leases • Network failures can prevent a Frangipani server from renewing its lease • Server discards all locks and all cached data • If there was dirty data in the cache, Frangipani throws errors until file system is unmounted

Locking Service Hole • If a Frangipani server’s lease expires due to temporary network outage, it might still try to access Petal • Problem basically caused by lack of clock synchronization • Can be fixed without synchronized clocks by including a lease identifier with every Petal request

Adding and Removing Servers • Adding a server is easy! • Just point it to a Petal virtual disk and a lock service, and it automagically gets integrated • Removing a server is even easier! • Just take a sledgehammer to it • Alternatively, if you want to be nicer, you can flush dirty data before using the sledgehammer

Backups • Just use the snapshot features that are built into Petal to do backups • Resulting snapshot is crash-consistent: reflects state reachable if all Frangipani servers were to crash • This is good enough – if you restore the backup, recovery mechanism can handle the rest

Summary • Microsoft Cluster Service • Aims to provide reliable services running on a cluster • Presents itself as a virtual node to its clients • Frangipani • Aims to provide a reliable distributed file system • Uses metadata logging to recover from crashes • Clients see it as a regular shared disk • Adding and removing nodes is really easy

Scalable Clusters