Microsoft Exchange Server 2010 High Availability Deep Dive

Required Slide SESSION CODE: UNC304 Microsoft Exchange Server 2010High Availability Deep Dive Scott Schnoll scott.schnoll@microsoft.com Principal Technical Writer Microsoft Corporation

Agenda • Exchange 2010 High Availability Basics • Deep Dive on Exchange 2010 High Availability Basics • Deeper Dive on Exchange 2010 High Availability Advanced Features • Monitoring Exchange 2010 High Availability • Improvements in Service Pack 1

Exchange 2010 High Availability Basics Database Availability Groups, Mailbox Database Copies and Lagged Database Copies

Database Availability Group (DAG) • A group of servers that host a set of replicated mailbox databases • Server can be a member of one DAG • Orgs can have multiple DAGs • Leverages Windows Failover Cluster • Manage DAG membership (DAG member = node) • Heartbeating of DAG members • Active Manager stores data in cluster database • Defines a boundary for • Mailbox database replication • Database and server *overs • Active Manager RPC Client Access Service Active Manager Active Manager Active Manager DB1 DB1 DB1 DB2 DB2 DB2 DB3 DB3 DB3 Database Availability Group

Mailbox Database Copies • Create up to 16 copies of each mailbox database • Each mailbox database must have a unique name within Organization • Mailbox database objects are global configuration objects • All mailbox database copies use the same GUID • No longer connected to specific Mailbox servers • Each DAG member can host only one copy of a given mailbox database • Database path and log folder path for copy must be identical on all members • Optional replay lag and truncation lag settings • Using these features affects your storage design • Copies have an activation preference • RTM: Used as secondary sorting key during best copy selection • SP1: Used for distributing active databases across DAG

Lagged Database Copies • Lagged copies are only for point-in-time protection, but they are not a replacement for point-in-time backups • Logical corruption and/or mailbox deletion prevention scenarios • Provide a maximum of 14 days protection • When should you deploy a lagged copy? • Useful only to mitigate a risk • Not needed if deploying a backup solution (e.g. DPM 2010) • Lagged copies are not HA database copies • Lagged copies should never be automatically activated! • Steps for manual activation documented at http://technet.microsoft.com/en-us/library/dd979786.aspx • Lagged copies affect your storage design

Deep Dive on Exchange 2010 High Availability Basics Quorum, Witness, DAG Lifecycle, and DAG Networks

Quorum

Quorum • Used to ensure that only one subset of members is functioning at one time • A majority of members must be active and have communications with each other • Represents a shared view of members (voters and some resources) • Dual Usage • Data shared between the voters representing configuration, etc. • Number of voters required for the solution to stay running (majority); quorum is a consensus of voters • When a majority of voters can communicate with each other, the cluster has quorum • When a majority of voters cannot communicate with each other, the cluster does not have quorum

Quorum • Quorum is not only necessary for cluster functions, but it is also necessary for DAG functions • In order for a DAG member to mount and activate databases, it must participate in quorum • Exchange 2010 uses only two of the four available cluster quorum models • Node Majority (DAGs with an odd number of members) • Node and File Share Majority (DAGs with an even number of members) • Quorum = (N/2) + 1 (whole numbers only) • 6-member DAG: (6/2) + 1 = 4 votes for quorum (can lose 3 voters) • 9-member DAG: (9/2) + 1 = 5 votes for quorum (can lose 4 voters) • 13-member DAG: (13/2) + 1 = 7 votes for quorum (can lose 6 voters) • 15-member DAG: (15/2) + 1 = 8 votes for quorum (can lose 7 voters)

Witness and Witness Server

Witness • A witness is a voter that is external to the DAG that participates in quorum by adding a tie-breaking voter to DAGs that have an even number of members • Witness server does not maintain a full copy of quorum data • Represented by File Share Witness resource • File share witness cluster resource, directory, and share are automatically created by Exchange when needed and removed by Exchange when not needed • Uses IsAlive check for availability • If witness is not available, cluster core resources are failed and moved to another node • If other node does not bring witness resource online, the resource will remain in a Failed state, with restart attempts every 60 minutes • See http://support.microsoft.com/kb/978790 for details on this behavior

Witness • If not online and needed for quorum, cluster will try to online File Share Witness resource once • If witness cannot be restarted, it is considered failed and quorum is lost • If witness can be restarted, it is considered successful and quorum is maintained • An SMB lock is placed on witness.log • Node PAXOS information is incremented and the updated PAXOS tag is written to witness.log • When witness is no longer needed to maintain quorum, lock on witness.log is released • Any member that locks the witness, retains the vote (“locking node”) • Members in contact with locking node are in majority and maintain quorum • Members not in contact with locking node are in minority and lose quorum

Witness Server • No pre-configuration typically necessary • Exchange Trusted Subsystem must be member of local Administrators group on Witness Server if Witness Server is not running Exchange 2010 • Cannot be a member of the DAG (present or future) • Must be in the same Active Directory forest as DAG • Can be Windows Server 2003 or later • File and Printer Sharing for Microsoft Networks must be enabled • Replicating witness directory/share with DFS not supported • Not necessary to cluster Witness Server • Single witness server can be used for multiple DAGs • Each DAG requires its own unique Witness Directory/Share

Database Availability Group Lifecycle

Database Availability Group Lifecycle • Create a DAGNew-DatabaseAvailabilityGroup -Name DAG1 –WitnessServer EXHUB1 -WitnessDirectory C:\DAG1FSW -DatabaseAvailabilityGroupIpAddresses 10.0.0.8New-DatabaseAvailabilityGroup -Name DAG2 -DatabaseAvailabilityGroupIpAddresses 10.0.0.8,192.168.0.8 • Add first Mailbox Server to DAGAdd-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EXMBX1 • Add second and subsequent Mailbox serverAdd-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EXMBX2 • Add a Mailbox Database CopyAdd-MailboxDatabaseCopy -Identity DB1 -MailboxServer EXMBX2

Database Availability Group Lifecycle • DAG is created initially as empty object in Active Directory • Continuous replication or 3rd party replication using Third Party Replication mode • Once changed to Third Party Replication mode, the DAG cannot be changed back • DAG is given a unique name and configured for IP addresses (or configured to use DHCP)

Database Availability Group Lifecycle • When first Mailbox server is added to a DAG • A failover cluster is formed with name of DAG using Node Majority quorum • The server is added to the DAG object in Active Directory • A cluster name object (CNO) for the DAG is created in default Computers container • The Name and IP address of the DAG is registered in DNS • The cluster database for the DAG is updated with info about local databases • When second and subsequent Mailbox server is added to a DAG • The server is joined to cluster for the DAG • The quorum model is automatically adjusted • The server is added to the DAG object in Active Directory • The cluster database for the DAG is updated with info about local databases

Database Availability Group Lifecycle • After servers have been added to a DAG • Configure the DAG • Network encryption • Network compression • Replication port • Configure DAG networks • Network subnets • Collapse DAGNetworks in single network with multiple subnets • Enable/disable MAPI traffic/replication • Block network cross-talk (Server1\MAPI !<-> Server2\Repl) • Create mailbox database copies • Seeding is performed automatically, but you have options • Monitor health and status of database copies and perform switchovers as needed

Database Availability Group Lifecycle • Before you can remove a server from a DAG, you must first remove all replicated databases from the server • When a server is removed from a DAG: • The server is evicted from the cluster • The cluster quorum is adjusted • The server is removed from the DAG object in Active Directory • Before you can remove a DAG, you must first remove all servers from the DAG

DAG Networks

DAG Networks • A DAG network is a collection of subnets • All DAGs must have: • Exactly one MAPI network • MAPI network connects DAG members to network resources (Active Directory, other Exchange servers, etc.) • Zero or more Replication networks • Separate network on separate subnet(s) • Used for/by continuous replication only • LRU determines which replication network to use when multiple replication networks are configured • Initially created DAG networks based on enumeration of cluster networks • Cluster enumeration based on subnet • One cluster network is created for each subnet

DAG Networks

DAG Networks • To collapse subnets into two DAG networks and disable replication for the MAPI network: Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork01 -Subnets 192.168.0.0,192.168.1.0 -ReplicationEnabled:$falseSet-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork02 -Subnets 10.0.0.0,10.0.1.0Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork03Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork04

DAG Networks • Automatic network detection occurs only when members added to DAG • If networks are added after member is added, you must perform discovery Set-DatabaseAvailabilityGroup -DiscoverNetworks • DAG network configuration persisted in cluster registry • HKLM\Cluster\Exchange\DAG Network • DAG networks include built-in encryption and compression • Encryption: Kerberos SSP EncryptMessage/DecryptMessageAPIs • Compression: Microsoft XPRESS, based on LZ77 algorithm • DAGs use a single TCP port for replication and seeding • Default is TCP port 64327 • If you change the port and you use Windows Firewall, you must manually change firewall rules

Deeper Dive on Exchange 2010 High Availability Advanced Features Active Manager, Best Copy Selection, Datacenter Activation Coordination Mode

Active Manager and Best Copy Selection

Active Manager • Exchange component that manages *overs • Runs on every server in the DAG • Selects best available copy on failovers • Is the definitive source of information on where a database is active • Stores this information in cluster database • Provides this information to other Exchange components (e.g., RPC Client Access and Hub Transport) • Active Manager roles • Standalone Active Manager • Primary Active Manager (PAM) • Standby Active Manager (SAM) • Active Manager client runs on CAS and Hub

Active Manager • Transition of role state logged into Microsoft-Exchange-HighAvailability/Operational event log (Crimson Channel)

Active Manager • Primary Active Manager (PAM) • Runs on the node that owns the cluster group • Gets topology change notifications • Reacts to server failures • Selects the best database copy on *overs • Detects failures of local Information Store and local databases • Standby Active Manager (SAM) • Runs on every other node in the DAG • Detects failures of local Information Store and local databases • Reacts to failures by asking PAM to initiate a failover • Responds to queries from CAS/Hub about which server hosts the active copy • Both roles are necessary for automatic recovery • If the Replication service is stopped, automatic recovery will not happen

Active Manager • Startup process depends on whether AM is Standalone or in a DAG • Standalone • Replication service starts and reads configuration from Active Directory • Sets Active Manager role to Standalone • Active Manager queries Active Directory every 30 seconds for changes • If it has been added to a DAG, the DAG Active Manager logic is started • DAG • Replication service starts and reads configuration from Active Directory • Replication service assumes SAM role and sets CurrentPAM to unknown • Replication service determines current PAM holder (who owns cluster group) • If local server is PAM, Replication service assumes PAM role • If remote server is PAM, Replication service maintains SAM role • Replication service sets CurrentPAM to PAM role holder

Active Manager • Replication service thread monitors for cluster group changes and reacts as follows: • If DAG member owns cluster group and CurrentPAM is set to another member, it will • Verify with all other DAG members • Assume PAM role • Set CurrentPAM to itself • If DAG member does not own cluster group, but it is configured as CurrentPAM, • Indicates that the cluster group has been moved to another DAG member • All outstanding Active Manager operations are immediately finished • CurrentPAM is set to new owner of cluster group • DAG member assumes SAM role • If DAG member does not own cluster group, and is not configured as CurrentPAM, DAG member maintains SAM role

Best Copy Selection • Active Manager selects the “best” copy to become the new active copy when the existing active copy fails • Sorts copies by currency (copy queue length) to minimize data loss • Breaks ties during sort based on Activation Preference • Selects from sorted listed based on which set of criteria met by each copy • Attempt Copy Last Logs (ACLL) runs and attempts to copy missing log files from previous active copy • Is database mountable? Is copy queue length < AutoDatabaseMountDial? • If Yes, database is marked as current active and mount request is issued • If not, database that meets next set of criteria tried • During best copy selection, any servers that are unreachable or “activation blocked” are ignored

Best Copy Selection

Best Copy Selection Server1 Server2 Server3 Server4 • Four copies of DB1 • DB1 currently active on Server1 X DB1 DB1 DB1 DB1

Best Copy Selection • Sort list of available copies based by Copy Queue Length (using Activation Preference as secondary sort key if necessary): • Server3\DB1 • Server2\DB1 • Server4\DB1

Best Copy Selection • Only two copies meet first set of criteria for activation (CQL < 10; RQL < 50; CI=Healthy): • Server3\DB1 • Server2\DB1 • Server4\DB1 Lowest copy queue length – tried first

Best Copy Selection • After Active Manager determines the best copy to activate • The Replication service on the target server attempts to copy missing log files from the source (ACLL) • If successful, then the database will mount with zero data loss • If unsuccessful (lossy failure), then the database will mount based on the AutoDatabaseMountDial setting • If data loss is outside of dial setting, next copy will be tried • The mounted database will generate new log files (using the same log generation sequence) • Transport Dumpster requests will be initiated for the mounted database to recover lost messages • When original server or database recovers, it will run through divergence detection and either perform an incremental resync or require a full reseed

Datacenter Activation Coordination Mode

Datacenter Activation Coordination Mode • DAC mode is a property of a DAG • Acts as an application-level form of quorum • Designed to prevent multiple copies of same database mounting on different members due to loss of network • RTM: DAC Mode is only for DAGs with three or more members that are extended to two Active Directory sites • Should not be enabled for two-member DAGs where each member is in a different Active Directory site or DAGs where all members are in the same Active Directory site • In RTM, DAC Mode also enables use of Site Resilience tasks • Stop-DatabaseAvailabilityGroup • Restore-DatabaseAvailabilityGroup • Start-DatabaseAvailabilityGroup • SP1: DAC Mode can be enabled for all DAGs

Datacenter Activation Coordination Mode • Uses Datacenter Activation Coordination Protocol (DACP), which is a bit in memory set to either: • 0 = can’t mount • 1 = can mount • Active Manager startup sequence • DACP is set to 0 • DAG member communicates with other DAG members it can reach to determine the current value for their DACP bits • If the starting DAG member can communicate with all other members, DACP bit switches to 1 • If other DACP bits are set to 0, starting DAG member DACP bit remains at 0 • If another DACP bit is set to 1, starting DAG member DACP bit switches to 1

Monitoring Exchange 2010 High Availability Built-in Tools

Monitoring Best Practices • Ensure that your servers are operating reliably and that your database copies are healthy are key objectives for daily messaging operations • Actively monitor hardware, the Windows operating system, Exchange 2010 services, and database and database copy health • Monitoring actively and daily enables you to: • Meet service level agreements (SLAs) • Ensure regular administrative tasks have completed (e.g., backups) • Detect and address issues that might affect service or data availability • Exchange 2010 includes several built-in tools for monitoring high availability • System Center Operations Manager automates and enhances these tools with Exchange 2010 Management Pack

Get-MailboxDatabaseCopyStatus • Used to view information about copies of a particular database, a specific copy of a database on a specific server, or about all database copies on a server • Examples • Get status for all copies of a database Get-MailboxDatabaseCopyStatus -Identity DB2 | FL • Get status for all copies on the local server Get-MailboxDatabaseCopyStatus -Local | FL • Get status for all copies on a remote server Get-MailboxDatabaseCopyStatus -Server MBX2 | FL • Get status, log shipping and seeding network information Get-MailboxDatabaseCopyStatus-Identity DB3\MBX1 -ConnectionStatus | FL

Test-ReplicationHealth • Designed for proactive monitoring of continuous replication, the availability of Active Manager, and the health and status of the underlying cluster service, quorum, and network components • Can be run locally on or remotely against any Mailbox server in a DAG • Example • Test the health of a DAG member Test-ReplicationHealth -Identity MBX1

Crimson Channel Event Logging • Windows Server 2008 includes two categories of event logs • Windows logs (includes legacy Application, Security and System event logs, as well as new Setup and ForwardedEvent logs) • Applications and Services logs • New category of events logs used for storing events from a single application or component, rather than events that might have system-wide impact • This new category is referred to as an application’s ‘crimson channel’. • Includes four general subtypes (there can be custom ones, too) • Admin (useful for troubleshooting; contain guidance for problem resolution) • Operational (somewhat useful; require a bit more interpretation) • Analytic (hidden and disabled by default) • Debug (used by developers when debugging applications)

Crimson Channel Event Logging • Active Manager monitors the crimson channel for events

CollectReplicationMetrics.ps1 • Collects replication performance data metrics for a DAG in real-time • Script represents an active form of monitoring • Collects metrics in real-time while running • Hours as active copy, Hours as passive copy • Minutes unavailable, Minutes Resynchronizing, Minutes Failed, Minutes Suspended, Minutes FailedAndSuspended, Minutes Disconnected • Average log generation rate, Peak log generation rate • Average log copy rate, Peak log copy rate • Average log replay rate, Peak log replay rate • Percentage of time log copying used replication network N (for each N) • Percentage of the time log copying was using MAPI network

Microsoft Exchange Server 2010 High Availability Deep Dive