Exchange server 2010 high availability concepts

Agenda • Exchange Server 2010 High Availability Concepts • Terminology • Quorum • Witness, Witness Server and Alternate Witness Server • Active Manager • AutoDatabaseMountDial • Activation Preference • Circular Logging and Continuous Replication • Replay Lag and Truncation Lag • Continuous Replication Modes (c) 2011 Microsoft. All rights reserved.

Exchange Server 2010 High Availability Concept: Terminology

Exchange Server High Availability Terminology • High Availability must meet three criteria • Service Availability • Data Availability • Automatic recovery from most failures • Site Resilience • Manual switchover process (datacenter switchover) used to activate service and data access in an alternate datacenter when the primary datacenter can no longer provide the required level of service

Exchange Server High Availability Terminology • *overs (pronounced “star overs”) • Short for Failovers and Switchovers • Failover is automatic, performed by the system • Switchover is manual, performed by an administrator • Database *overs • Server *overs • Datacenter switchover (c) 2011 Microsoft. All rights reserved.

Exchange Server 2010 High Availability Concept: Quorum

Quorum • A consensus of voters used to ensure that only one subset of members is functioning at one time • A majority of members must be active and have communications with each other • Represents a shared view of members • Voters and resources • Dual Usage • Data shared between the voters representing configuration, etc. • Number of voters required for the solution to stay running (majority)

Quorum • Quorum is necessary for cluster functions and for DAG functions • Exchange 2010 uses two cluster quorum models • Node Majority (DAGs with an odd number of members) • Node and File Share Majority (DAGs with an even number of members) • Quorum = (N/2) + 1 (whole numbers only) • 6 members: (6/2) + 1 = 4 votes for quorum (can lose 3 voters) • 9 members: (9/2) + 1 = 5 votes for quorum (can lose 4 voters) • 13 members: (13/2) + 1 = 7 votes for quorum (can lose 6 voters) • 15 members: (15/2) + 1 = 8 votes for quorum (can lose 7 voters)

Exchange Server 2010 High Availability Concept: Witness, Witness Server and Alternate Witness Server

Witness • A witness is a share with a file on a server that is external to the DAG that participates in determining quorum by providing a weighted vote for the DAG member that has a lock on the witness.log file • Configured for all DAGs • Used only by DAGs that have an even number of members • Witness server does not maintain a copy of quorum data, does not vote, and is not a member of the DAG or cluster

Witness • Part of cluster core resource group (Cluster Group) • Represented by File Share Witness resource (UNC path) • File share witness cluster resource, directory, and share automatically created and removed as needed • Uses Cluster IsAlive check for monitoring and availability • If witness IsAlive fails, Cluster Group is failed and moved to another DAG member • If other DAG member cannot bring witness resource online, the resource will remain in a Failed state, with restart attempts every 60 minutes • See http://support.microsoft.com/kb/978790 for details

Witness • If Failed and needed for quorum, cluster will try to online File Share Witness resource once • If witness is Failed and cannot be brought Online, quorum is lost • If witness can be restarted or is already Online: • An SMB lock is placed on witness.log typically by node that owns Cluster Group (“locking node”) • The Locking Node increments PAXOS and writes the updated PAXOS tag to the witness.log file

Witness • When locked, the Locking Node retains a weighted vote • Members in contact with locking node are in majority and maintain quorum • Members not in contact with locking node are in minority and lose quorum • When witness is no longer needed to maintain quorum, SMB lock on witness.log is released

Witness • Witness used only when needed for quorum X X X

Witness • Witness used only when needed for quorum X X

Witness Server • No pre-configuration typically necessary • Exchange Trusted Subsystem must be member of local Administrators group on Witness Server if Witness Server is not running Exchange 2010 • Cannot be a member of the DAG (present or future) • Must be in the same Active Directory forest as DAG

Witness Server • Can be Windows Server 2003 or later • File and Printer Sharing for Microsoft Networks must be enabled • Replicating witness directory/share with DFS not supported • Not necessary to cluster Witness Server • If you do cluster witness server, you must use Windows 2008 • Single witness server can be used for multiple DAGs • Each DAG requires its own unique Witness Directory/Share

Witness Server • Using a non-Exchange 2010 witness server: • You will receive a spurious warning when running New-DatabaseAvailabilityGroup, Set-DatabaseAvailabilityGroup or Restore-DatabaseAvailabilityGroupThe Exchange Trusted Subsystem is not a member of the local Administrators group on specified witness server <ServerName>. • Code bug described at http://aka.ms/xh5rah

Alternate Witness Server • Witness server used by a DAG after a datacenter switchover • DAG is configured to use alternate witness server when you run Restore-DatabaseAvailabilityGroup or ahead of time by using Set-DatabaseAvailabilityGroup • DAGs do not dynamically switch witness servers • Alternate witness server does not provide redundancy for witness server or FSW resource

Exchange Server 2010 High Availability Concept: Active Manager

Active Manager • Exchange component that manages high availability platform • Runs inside the Microsoft Exchange Replication service on every Mailbox server • Is the definitive source of information on where a database is active • Stores this information in cluster database • Provides this information to Active Manager client running on other server roles (Client Access and Hub Transport)

Active Manager Roles • Standalone Active Manager • Primary Active Manager (PAM) • Standby Active Manager (SAM) • Active Manager Client • Runs in RPC Client Access service on CAS and Transport service on Hub

Active Manager • Primary Active Manager (PAM) • Runs on the node that owns the cluster core resources (cluster group) • Gets topology change notifications • Reacts to server failures • Selects the best database copy on failovers and targetless switchovers • Detects failures of local Information Store and local databases

Active Manager • Standby Active Manager (SAM) • Runs on every other node in the DAG • Detects failures of local Information Store and local databases • Reacts to failures by asking PAM to initiate a failover • Responds to queries from CAS/Hub about which server hosts the active copy • Both roles are necessary for automatic recovery • If the Microsoft Exchange Replication service is stopped, automatic recovery will not happen

Exchange Server 2010 High Availability Concept: AutoDatabaseMountDial

AutoDatabaseMountDial • When a replicated mailbox database is affected by a failure (e.g., disk, network, service, server), what two types of recovery are possible? • Answer: • Manual – the Administrator performs recovery (switchover) • Automatic – Exchange 2010 performs recovery (failover) • Exchange 2010 uses AutoDatabaseMountDial setting as part of its automatic recovery logic • AutoDatabaseMountDial is configured using Set-MailboxServer

AutoDatabaseMountDial • Configured on a per-Mailbox server basis • Used to determine if activating passive copy can be mounted, based on number of missing log files • Can be overridden by an administrator performing a switchover

AutoDatabaseMountDial • Best Availability – 12 or fewer missing log files • Good Availability – 6 or fewer missing log files (Default) • Lossless – 0 missing log files • Best Effort – Boundless; available for switchovers only • Custom value – Configured using ADSIEdit • msExchDataLossForAutoDatabaseMountattribute of Mailbox server object • if custom value >12 is used, increase transport dumpster size accordingly

AutoDatabaseMountDial • Dial setting consulted as part of a failover or targetless switchover during Best Copy Selection (BCS) • During a failover, the passive copy being activated attempts to copy any missing log files from previous active • If successful, then the database will mount with zero data loss • If unsuccessful (lossy failure), then the database will mount based on the AutoDatabaseMountDial setting • If data loss is outside of AutoDatabaseMountDial setting, another copy (if available) will be tried • If another copy is not available, administrator must intervene

Exchange Server 2010 High Availability Concept: Activation Preference

Activation Preference • A whole number value assigned to each copy of a given database (e.g., 1, 2, 3, 4, etc.), where 1 is at the top of the preference order • Different use in Exchange 2010 RTM vs. SP1 and later • RTM uses it as tie-breaking mechanism when multiple activation targets have the same copy queue length • SP1 uses it for sorting initial list of potential activation targets when AutoDatabaseMountDial is set to Lossless on all servers that host a copy of the database

Activation Preference • Simple example • Multiple copies of a database in multiple datacenters 1 2 3 4

Activation Preference • Activation preference is not a guarantee • Based on nature of the failure and health and status of passive copies • Can be reconfigured on the fly • Use Set-MailboxDatabaseCopy • Does not require any restarts • The number cannot be larger than the number of copies of the mailbox database • Also used as part of DAG re-balancing

Exchange Server 2010 High Availability Concept: Circular Logging and Continuous Replication

Circular Logging and Continuous Replication • Exchange 2010 includes two forms of circular logging • ESE (aka JET) circular logging (CL) • Continuous replication circular logging (CRCL) • Over time, each Exchange database generates a set of log files known as the log stream • When enabled, circular logging allows Exchange to overwrite transaction log files after the data contained in the log files is committed to the database (c) 2011 Microsoft. All rights reserved.

Circular Logging and Continuous Replication • Exchange 2007 continuous replication also included CRCL, which is different from CL • CL is a function of JET and is performed by the Information Store service (store.exe) • CRCL is a function of continuous replication and is performed by the Exchange Replication service (msexchangerepl.exe) (c) 2011 Microsoft. All rights reserved.

Circular Logging and Continuous Replication • For truncation to occur on highly available (non-lagged) mailbox database copies, the answer must be "Yes" to the following questions: • Has the log file been backed up, or is CRCL enabled? • Is the log file below the checkpoint? • Do the other non-lagged copies of the database agree with deletion? • Has the log file been inspected by all lagged copies of the database? • For truncation to occur on lagged database copies, the answer must be "Yes" to the following questions: • Is the log file below the checkpoint? • Is the log file older than ReplayLagTime + TruncationLagTime? • Is the log file deleted on the active copy of the database? (c) 2011 Microsoft. All rights reserved.

Circular Logging and Continuous Replication • MSExchangeRepl manages CRCL so that log continuity is maintained and logs are not deleted if they are still needed for replication • MSExchangeRepl and Store communicate via RPCs regarding which log files can be deleted (c) 2011 Microsoft. All rights reserved.

Circular Logging and Continuous Replication • Both features enabled and disabled same way • By using the checkbox on the Maintenance tab of the database Properties dialog that says Enable circular logging • By using the Set-MailboxDatabasecmdlet with the -CircularLoggingEnabledparameter set to $true (c) 2011 Microsoft. All rights reserved.

Circular Logging and Continuous Replication • Once enabled, the feature you get depends on whether or not database is replicated • If the mailbox database is not replicated, it will use JET circular logging. In this case, enabling or disabling JET circular logging will require a dismount and mount of the database • If the mailbox database is replicated, it will use CRCL. In this case, enabling or disabling CRCL takes effect dynamically; there is no need to dismount and re-mount the database (c) 2011 Microsoft. All rights reserved.

Circular Logging and Continuous Replication • No way to transition a database from CL to CRCL or vice versa • If you have circular logging enabled for a non-replicated database and you add your first passive copy, Exchange will block that action to prevent switching from JET CL to CRCL • This is to prevent logs from being truncated while the new copy seed starts (the newly added passive copy would not seed correctly if required logs were removed) • When you try to remove the last passive copy, that would result in a transition from CRCL to CL, and is blocked because that transition would require a dismounting and re-mounting of the active copy of the database (c) 2011 Microsoft. All rights reserved.

Circular Logging and Continuous Replication • You must therefore disable CRCL/CL when • You add the second copy of a database (e.g., add the first passive copy) • You are trying to remove the last passive copy of a database Database "{0}" has circular logging enabled. It is not possible to add or remove database copies while circular logging is enabled. Please disable circular logging before adding or removing mailbox database copies. (c) 2011 Microsoft. All rights reserved.

Exchange Server 2010 High Availability Concept: Replay Lag and Truncation Lag

Replay Lag Time • Enables you to take a database copy back to a specific point in time • Configure a Replay Lag Time >0 • Configure using Set-MailboxDatabaseCopy • Any database copy configured with a replay lag time >0 is considered a “lagged copy” • Lagged copies are only for point-in-time protection, but they are not a replacement for point-in-time backups • Logical corruption and/or mailbox deletion prevention scenarios • Provide a maximum of 14 days protection

Replay Lag Time • When should you deploy a lagged copy? • Useful only to mitigate a risk • Might not be needed if deploying a backup solution • Lagged copies are not HA database copies • Lagged copies should never be automatically activated! • Steps for manual activation documented at http://technet.microsoft.com/en-us/library/dd979786.aspx • ESE single page restore feature does not support lagged copies • If a lagged copy has database page corruption, it will have to be reseeded (which will lose the lagged aspect of the copy)

Truncation Lag Time • Enables you to use the logs on a passive database copy to recover from the loss of log files on the active database copy • Configure a Truncation Lag Time >0 • Configure using Set-MailboxDatabaseCopy

Using Replay Lag or Truncation Lag • Using Replay Lag or Truncation Lag affects your storage design • By design, both cause logs to build up, even when continuous replication circular logging (CRCL) is enabled • Plan your storage design accordingly • See http://technet.microsoft.com/en-us/library/dd335158.aspx for more info

Exchange Server 2010 High Availability Concept: Continuous Replication Modes

Exchange server 2010 high availability concepts