Exchange 2010 Data Center Failover
- Last Updated: July 29, 2024
- 4 minute read
- LoadMaster
- LoadMaster LTSF
- Documentation
To better understand the concept of site resiliency, it's helpful to understand the basic operation of an Exchange 2010 data center failover.
- Typically a site-resilient deployment will contain a stretched Database Availability
Group (DAG), that is, a DAG that has members in both data centers. Within a
stretched DAG, the majority of the DAG members should be located in the primary
data center or, when each data center has the same number of members, the
primary data center hosts the witness server. This design guarantees that
service will provided in the primary data center, as it will have ‘quorum’, even
if network connectivity between the two data centers fails.
However, it also means that when the primary data center fails the quorum is lost for the members in the second data center.
- From the point-of-view of site resiliency, the GEO LoadMaster provides automatic
site failover options for disaster recovery.
The GEO LoadMaster also offers DNS load balancing for all active data centers. The GEO LoadMaster can be deployed in a distributed (Active/Active) high availability configuration, with both GEO LoadMaster appliances securely synchronizing information.
Introducing GEO LoadMaster in your existing Authoritative Domain Name Services (DNS) requires minimal integration work and risk, allowing you to fully leverage your existing DNS investment.
- The LoadMaster server load balancers located within the datacenter provide
highly-available, high-performance load balancing functionality within the
individual datacenters. They can also provide a single point for consolidated
health checking and provide the GEO LoadMaster with real-time health check
information for the datacenter.
With the configuration as described in the above image, when a data center fails, a second data center can be rapidly activated to serve the failed data center's clients. However, a data center or site failure is managed differently from the types of failures that can cause a server or database failover. In a high availability configuration, automatic recovery is initiated by the system, and the failure typically leaves the messaging system in a fully functional state. By contrast, a data center failure is considered to be a disaster recovery event. For recovery to occur, a combination of automatic and manual steps must be performed and completed for the client service to be restored, and for the outage to end. The process followed is referred to as a data center failover.
When a data center or site failure occurs, a number of automatic and manual steps occur. The GEO LoadMaster detects the site failure and automatically switches all traffic from the servers in the failed data center, with the exception of the Mailbox servers, to the servers in the second data center.
Because implementing a data center failover is not a trivial event, it can be useful to avoid such a failover for only transient failures of the primary data center. Upon detection of a site failure, the GEO LoadMaster can be configured to delay initiating the site failover for an administratively-specified period of time. If after the delay the site has recovered, the failover is not initiated. If the site has not recovered, the failover is initiated as per normal. This option ensures that site failovers do not occur because of temporary issues within a site.
The Exchange deployment administrators must now perform a number of steps to complete the data center failover:
- Terminate services within the failed data center. All Mailbox and Unified Messaging services still running within the failed data center must be terminated.
- Validate the health of the second data center. The health of the second data center must be determined to ensure that it is capable of providing adequate service.
- Activate the Mailbox servers. This involves a process of marking the failed servers from the primary data center as unavailable followed by activation of the servers in the secondary data center.
The Failure Delay option can also be useful to ensure that the Exchange deployment administrators have sufficient time to perform the required manual steps as described below. This enables the mailboxes to be correctly configured before the clients begin to attempt access to the secondary data center.
As can be seen from the previous description, a data center failover is not a fully automated process and may take some time to complete. If the failed data center recovers then issues may arise if an attempt to restore services to the recovered data center is initiated (a failback) before the initial failover process is complete and/or until the recovered data center is deemed to be healthy and the mailbox databases are ready for use. It's important that a failback not be performed until the infrastructure dependencies for Exchange have been reactivated, are functioning and stable, and have been validated. If these dependencies are not available or healthy, it's likely that the failback process will cause a longer than necessary outage, and it is possible the process could fail altogether.
To ensure that this cannot occur, the GEO LoadMaster can be configured to administratively disable the failed data center upon the initiation of a failover. This ensures that, even if the failed data center recovers, administrator intervention is required before the data center is available for a failback to occur.
For further information on how to configure the GEO LoadMaster to provide Exchange 2010 site resiliency, please refer to the GEO LoadMaster documentation.