Veritas Cluster Server

Veritas Cluster Server Delivering Data Center Availability

Agenda • Functional overview of Cluster Server • Architectural overview of Cluster Server • Future directions

Types of clusters • “Cluster” is an broadly used term • High Availability (HA) clusters • Parallel processing clusters • Load balancing clusters • High Performance Computing (HPC) clusters • VCS is primarily an HA cluster • With support for some key parallel processing applications like Oracle RAC

What is Availability? • Availability = Availability of application • Reduce planned and unplanned downtime of application • Needed for server type of applications (databases, app/web servers, ..) • Planned downtime: caused by • Hardware/OS/application maintenance • Unplanned downtime: caused by • Logical failures: software bugs, operator errors, viruses • Component failures: CPU, NIC, HBA, Disk, software crashes • Site/infrastructure failures: Power outage, natural and other disasters

Before you cluster • Clustering is not a replacement for backup or storage virtualization • Avoid all single point of failures and build redundancy • Multiple HBAs, switches, mirroring and paths to storage • Multiple NICs and network paths from clients to clustered application • Multiple machines, power supplies, cooling systems, data centers • Redundancy alone is not sufficient. Clustering software is needed: • Monitor and orchestrate all components and specify policies to act deterministically in case of failures and other events • It is not realistic to have complete redundancy for all applications. Clustering software allows you to have different kinds of tradeoffs for different applications • Clustering software should be able to make any application highly available without making change in the application

VCS Highlights • Developed completely “in-house” • Shipping since 1998 – replaced previous generation 2-node product • Current shipping version 4.x; 5.0 release in Q3 06 • Released from common code base and available on following platforms: Solaris, Windows, AIX, Linux, HP-UX, Vmware, Zen • Support for up to 32 nodes in a cluster • Single HA solution for local, campus-wide, metro area, wide area • Java GUI, web GUI, CLI, API interfaces • Enablers • Heterogeneous storage & SAN support • Shipping standalone and part of application focused solutions

Clust-omers & Competition • VCS is the #1 cross-platform HA product on the market (IDC) • With about 10% yearly growth • Customers include many Global 500 • Fidelity, Morgan Stanley, EBay, Verizon, .. • Competition: Sun Cluster, Microsoft Cluster Server, IBM HACMP, HP Service Guard, Oracle and other Linux clusters • Relative strength compared to competition • Comprehensive, consistent solution with support for all major hardware, OSes, storage and applications • Integrated stack with support for failover, parallel applications; High Availability & Disaster Recovery • Feature richness

Storage Network After: SAN Model • SAN model brings IP network like connectivity and accessibility to storage • Data on shared disk – and even binaries – can be “imported” on a given host on SAN Enablers: Storage Area Networks IP Network Before: Shared SCSI (Direct Attached) Model • Shared SCSI is very limited in connectivity and accessibility • Now, applications can move around freely within cluster for Life, Liberty and Pursuit of Availability!

Enablers: Integrated HA Solutions • Foundation Stack • Veritas Volume Manager • Storage virtualization • Broad device & array support • Multi-pathing and storage path failure handling • Veritas File System • Extent based, journaled file system • Online operations • Storage checkpoints • HA Products & Solutions • VCS • Storage Foundation HA (Storage Foundation + VCS) • Storage Foundation for Applications (Storage foundation HA tuned for apps like Oracle, Oracle RAC, DB2, Sybase) • Tested and released together as a “train”

Application Requirements for Clustering • Robustness • If application keeps crashing all the time, VCS will restart it within cluster and increase availability but not the reliability of the application • Recoverability • Application should be able to recover on its own upon restart • Application should be able to restart without rebooting the machine • Location Independence • Clients of application should be able to reconnect after application is failed over – using virtual IP address that failed over with app • Application should be able to restart on same node or a different node – no “hostname” dependence – can be worked around but not clean • Application should allow multiple instances on a machine (Needed for server consolidation scenario)

Cluster & Connections Clients of Applications Public Network Machine1 Machine2 Machine3 Machine4 … Up to 32 homogenous machines Applications Applications Applications Applications VCS VCS VCS VCS Private Networks Storage Storage Area Network

Resources, Types & Agents • Application is made up of related resources • Storage resources: Disks, volumes, file systems • Network resources: NIC, IP address • Application processes themselves • Resource Type is the definition of resource – like class definition • Resources are always of a given type – like object instances • Resource Type needs a corresponding agent which is responsible for onlining, offllining and monitoring of resources of a given type on given machine in the cluster

APPLICATION IP Address Database File System Network Card Volumes Physical Disk Physical Disk Physical Disk Service Groups • Application is represented as a service group or collection of related service groups • Service group is a basic unit of planned migration and unplanned failover of application within cluster • Service group consists of one or more resources with their dependencies • Should be online only on one node at a time • Once defined, they can be onlined, offlined, migrated and failed over with declarative command like “Online Oracle1 service group on machine1” without worrying about procedural details Application requires database and IP address Volume requires disks

APPLICATION APPLICATION IP Address IP Address IP Address IP Address Database Database Database Database Database IP Address IP Address File System File System File System File System File System File System File System Network Card Network Card Network Card Network Card Network Card Network Card Network Card Network Card Network Card Volumes Volumes Volumes Volumes Volumes Volumes Volumes Volumes Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk IP Address IP Address Service Group Failover Clients

Flexible Clustering Architectures Local Clustering (SAN, LAN) Metropolitan HA & Disaster Recovery (SAN, MAN/LAN) Wide Area Disaster Recovery (WAN) Campus Clustering Replicated Data Cluster One Cluster VM/VCS One Cluster VM/VCS 2 or more Clusters Replication/VCS One Cluster Replication/VCS Shared Storage Remote Mirror Replica Replica SAN or Direct Attached SAN Attached; Fibre IP; DWDM; Escon IP; DWDM; Escon

LAN Local HA Clustering Environment • One Cluster located at a single site • Redundant servers, networks, storage for applications and databases Advantages • Simpler to implement Disadvantages • Data center or site can be a single point of failure

Metropolitan HA & Disaster Recovery Clustering Environment • One cluster: Servers and storage located in multiple sites are part of the same cluster Advantages • Protection against failures and localized disasters • No replication needed – quick recovery Disadvantages • SAN infrastructure limited by distance • Requires careful planning for allocation of storage across two sites MAN

Replicated Data Cluster Environment • Cluster stretches over multiple buildings, data centers, or sites • Local storage is replicated between cluster nodes at each location • Limited solution – not used by many customers Advantages • Replication (over IP) rather than SAN -- Leverage existing IP connection • Protection against disasters local to a building, data center, or site Disadvantages • Synchronization after replication switch can be more complicated and error prone

Wide Area Disaster Recovery Environment • Customers needing local failover for HA as well as remote site takeover for disaster recovery over long distance Advantages • Can support any distance using IP • Support for array, host and DB replication • Local failover before remote takeover • Single button site migration Disadvantages • Cost of a remote HOT site • Synchronization after replication switch can be more complicated and error prone VCS Fire Drill • Tools to help customers periodically test “disaster readiness”

Clustering & Disaster Recovery • Having replication solution alone is not sufficient for disaster recovery. • Need proper planning including people, processes and software to help taking right decisions and do right things in right order • If a site is down, who/how detects it? (VCS) • Who needs to be notified? (VCS -> Administrators) • Who makes decision whether it is a real disaster (People) • When you fail over full or part of data center or site from one place to another, which application should stay alive, which can stay down (Admin -> VCS) • Do you do migration of application even if corresponding replicated data is not up to date (Admin ->VCS) • How do you make sure all applications come up in proper order upon site migration? (VCS) • How do I periodically test disaster preparedness without taking my production application down? (VCS Firedrill)

Site Migration App App App Failover IP File DB DB IP DB IP File File NIC NIC NIC Volumes Volumes Volumes Replication Local & Global Failover

Available Agents • Bundled agents: NIC, IP, Volume, Mount, Process, Application, Service, Registry Replication, NBU, BE, … • Enterprise agents: Oracle, DB2, SQL server, Exchange, SAP, BEA, Siebel, Sybase, … • Replication Agents: EMC SRDF, Hitachi True Copy, Netapp replication, Oracle and DB2 replication, … • Consulting Agents: 50+ agents written by consulting group for various applications like Clear case, Tibco and so on • Custom agents: Written by customers for custom apps

Virtualization Support • Current support for server and hardware based virtualization • Virtual machine or partition as a machine • IBM Micro-partitions • Hardware based partitions: HP Superdome partitions, Sun partitions • Virtual machine or partition as a resource in service group • VMWARE • Microsoft Virtual Server • Solaris Zones • Easier disaster recovery • Replicate virtual machine image along with data – VCS provides failover & DR management

Support for Parallel Applications • Cluster Volume Manager • Simultaneously import and read/write to volumes across cluster nodes • Application has to control mutual exclusion of read/writes to same blocks • Cluster File System • Simultaneously mount and read/write to file systems across cluster nodes • Unlike NFS, read/writes are direct from/to disk (except for metadata) • Global lock management • Oracle RAC • We provide integrated package with VCS+CVM+CFS • Support for Oracle Disk Manager: Data file management tuned for Oracle • VCS supports both non clustered and clustered versions of these apps • Support for parallel CVM, CFS and RAC only on UNIX platforms

Cluster Management • All cluster management available using CLI, Java GUI, Web GUI & API • Consistent cluster management across all platforms • Java GUI • Fat client based interactive UI • Will be deprecated in future • Web GUI • HTML based browser clients for accessibility from anywhere • Command Central Availability • Management of multiple clusters • Aggregation of information and availability, uptime reporting based business units, locations • Multi Cluster Management • Super set of web GUI and CCA with interactivity using Macromedia Flash

Agents Agents Agents Agents VCS engine VCS engine GAB GAB LLT LLT VCS Architecture Policy (VCS engine) executed on all nodes in replicated state machine (RSM) mode Java GUI Browser clients Public Network Web GUI Server CLI Agents Agents … Up to 32 homogenous machines VCS engine GAB LLT Private Networks Storage

LLT • Low Latency Transport: Proprietaryprotocol on top of Ethernet -- Non routable • Provides heartbeat and communication mechanism for clustered nodes over private network • Guaranteed delivery of packets – like TCP • Supports up to 8 network links • Performs implicit link level heartbeat across all clustered nodes • Transparent handling of link failures • Can multiplex across links and scale • Also used by CVM (Clustered Volume Manager) and CFS (Clustered File System) for communication and messaging

GAB • Group membership and Atomic Broadcast protocol • Consistent membership across all nodes • Guarantees ordering of broadcasts from same node • Guarantees consistent ordering of broadcasts from multiple nodes • Depends on LLT for reliable delivery of messages • Clients register on designated ports • Reconfiguration: Happens when node joins or leaves OR • client registers with GAB on given port or leaves • Used also by CVM and CFS for membership – provides consistent membership and cluster broadcasts for all clustered Veritas products

Split Brain • What happens when One or more nodes lose all means of communication with rest of the cluster? • Now cluster does not operate as a single “brain” • Sub clusters can decide to online same application on different nodes at same time • Result? Data corruption. BAD • Causes • Machine loaded – common cause • NIC or network switch failures for all links • Avoidance • Support for multiple heartbeat links • VCS components run at very high priority • Can’t completely prevent split brain – failures do occur • Detection: Not easy to detect, symptoms are same as node(s) dying • Solution? IO fencing

App File DB IP NIC Volumes I/O Fencing 1) Upon split brain, Fencing components race for coordinator disks on sub clusters Write(..) 2) Winner sub cluster fences out loser sub cluster 3) Any attempt for writes from loser sub cluster are rejected. The loser sub cluster would panic depending upon configuration Coordinator Disks Data Disks (VxVM or CVM controlled)

Recovery Order • When node leaves cluster • Fencing first: Nodes no longer in cluster can’t write to data disks • CVM next: Perform volume metadata & other recovery • CFS next: GLM recovery New master for file systems mastered on dead node • Oracle RAC next: Now RAC can do its own recovery (locks, open transactions) • VCS engine in the end: VCS engine can now decide how to react to node failure

VCS Engine LLT GAB Fencing LLT GAB Fencing VCS Engine Hardware “pipe” RAC NIC NIC • Consistent Messaging & Communication • Cluster Membership/State • Datafile Management • Filesystem Metadata • Volume Management Cache Fusion, Lock mgm Cluster State Volume Metadata mgmt Datafile, FS metadat, GLM L L T L L T CFS Server Server RAC CVM NIC NIC CFS Hardware “pipe” CVM Consistent Communication Stack Across Products

VCS Engine: Cluster Configuration & State Management • Local, On-disk repository of definitions of nodes in cluster, resource types, resources, agents, service groups and dependencies • Grammar based configuration file • First node to come up in the cluster builds in-memory configuration • In-memory configuration includes transient information like state of resource etc. too • Subsequent nodes get in memory snapshot from one of the existing nodes • Replicate state machine: Any change in cluster is broadcast to all nodes, then take the action • All nodes in cluster have working configuration and state all the time – any node dying does not affect functioning of cluster software itself • Any event that changes configuration is written to local configurations on each node atomically

CLI: Online G1 on S1 R1 is online Online R1 VCS Engine Cluster Access Mgmt Cluster Access Mgmt VCS Engine Policy Mgmt Policy Mgmt Config & State mgmt Config & State mgmt Inter-node communication Inter-node communication Send Bcast: R1 is online (S1) Send Bcast: Online R1 (S1) Initial snapshot Cluster Interconnect (LLT/GAB) Recv Bcast: R1 is online (S1) Recv Bcast: Online R1 A Very Short Glimpse into the Life of VCS Engine System S1 System S2 Agent T1 Agent T2 Agent T1 Agent T2 • Config & state repository • Group G1 • Resource R1 (Type T1) • Config & state repository • Group G1 • Resource R1 (Type T1) • Online on N1 • Config & state repository • Group G1 • Resource R1 (Type T1) • Config & state repository • Group G1 • Resource R1 (Type T1) • Online on N1 • Local disk config • Group G1 • Resource R1 (Type T1) • System S1 • System S2 • Local disk config • Group G1 • Resource R1 (Type T1) • System S1 • System S2

VCS Engine: Policy Management • Declarative specification and enforcement of • What applications can run where: • Active/passive, Active/Active • N+1, N->1, Server consolidation configurations • Failover & Parallel Service groups • Failover service group is online on only one machine at a given time • What to do in case of various failures: resource, system, cluster, site • Register with GAB for notification of node join/leave and inter-node communication • Work with agents for resource state management • Highly flexible: can add/modify/delete nodes, service groups, resources, types and attributes on the fly • Extend policy using scriptable triggers • Cluster Policy Simulator for modeling and running “what-if” scenarios

VCS Engine: Cluster Access Management • All cluster functionality available through Java GUI, Web GUI, CLI and API • Secure authentication & communication (VxSS based) • Registration API for state change or any notification in general • Role based access control at cluster or service group level

Agents: Agent Framework • Agent framework provides common functionality needed by all agents • Communication and heartbeat with VCS engine • Handling of multiple resources of same resource type – monitor interval and other timers • Provides API for entry points • Agent entry points • Online, offline, monitor, clean – just do intended work periodically or upon engine request and leave the rest to the agent framework • “In process” entry point (C/C++) • Script or executable entry point

How do I write my own agent? • Reuse generic application, process agent • Use of script agent for scripting EPs • Write type definition including specification for config file.. • Write 4 entry points (online, offline, monitor, clean) with clearly defined actions • Should be able to identify one and only one instance of application • Monitor should be very efficient

Example Type Definition type Mount ( static str ArgList[] = { MountPoint, BlockDevice, FSType, MountOpt, FsckOpt, SnapUmount } NameRule = resource.MountPoint str MountPoint str BlockDevice str FSType str MountOpt str FsckOpt int SnapUmount )

VCS Futures • Support for Faster interconnects • More application focused editions • Interoperability with Application Automation products • More news to be announced in March

Faster Interconnects • Potentially excellent performance improvements for latency sensitive parallel applications like CFS and RAC • Types: • 10 GB: Fat pipe but not much improvement in latency • TCP Offload Engines: Not much performance improvement • Infiniband (Currently investigating) • Can bypass data path copies using RDMA • Presents a separate qpair (communication pipe) for each set of communicating processes across cluster – reduces locking and sequencing overhead • APIs still evolving and for Linux only as of now

Application Focused Editions • Today: Oracle, DB2 and Sybase • Investigating: • SAP edition • Mysql • Virtualization focused solutions

Thank You!

Veritas Cluster Server

Veritas Cluster Server

Presentation Transcript

Server Cluster and LVS based Cluster

Windows Compute Cluster Server 2003

VERITAS Cluster Server

SQL Server on a Cluster

Veritas

D.C. Veritas

Cluster Server EJB JDBC JMS ---------- ?

VERITAS Cluster Server for Solaris

Shared SQL Server Cluster…… It is a fail-over cluster

VERITAS MyCD

Windows Server 2012 R2 Hyper-V Cluster

VERITAS Cluster Server Solaris

Windows Compute Cluster Server 2003

OpenTS for Windows Compute Cluster Server

Club Veritas

Energy Efficient Web Server Cluster

Embedded Computing Cluster: NFS Server Analysis

VERITAS Cluster Server

VERITAS ClusterX

VERITAS Global Cluster Manager™

VERITAS OPFORCE

Veritas Resources