Designing High Availability with Microsoft Exchange Server 2010
- 7/15/2010
Availability Planning for Mailbox Servers
In addition to normal IT best practices and redundant hardware, the DAG is the primary high-availability option for Exchange 2010 Mailbox servers. A DAG is a collection of servers that provides continuous replication and availability for mailbox databases, as shown in Figure 11-1.
FIGURE 11-1 A Database Availability Group
Continuous replication creates a passive database copy on another Mailbox server in the DAG, and then uses asynchronous log shipping to maintain the copies.
The continuous replication process follows these steps:
The active transaction log is written and then closed.
The Microsoft Exchange Replication service replicates the closed log to servers hosting the passive database copies.
Because each copy of the database is identical, the Log Inspector will examine the transaction logs for the following:
Verifies the physical integrity of the transaction log
Verifies that the header generation is not higher than the highest generation for the current database copy
Verifies the log header matches the generation of the file name
Verifies the log file signature in the header matches the log file
The transaction log is then placed in the defined transaction log directory.
The Information Store then validates the transaction log and then applies the logs to the database copy. The databases remain in sync.
A DAG also has the following characteristics:
Requires the Windows failover clustering feature and uses an Enterprise version of Windows server (Windows Server 2008 or Windows Server 2008 R2), although the installation and configuration tasks occur with the Exchange Server management tools. Exchange Server does not use Windows failover clustering to handle database failover. Instead, it uses Active Manager to manage the failover process.
Members must have the same operating system.
You can add up to 16 servers to a single DAG and create up to 16 copies of a database. Up to 100 databases can be mounted as either a passive or active copy of the database on each server in the DAG.
Uses an evolution of the continuous replication technology that is available in Exchange 2007.
A DAG can be created after you install the Mailbox server. If a Mailbox server is hosting active mailbox databases, it can be added to a DAG later, it if meets the requirements.
Allows you to move a single database between servers in the DAG without affecting other databases. Failover occurs per mailbox database, not for an entire server.
Allows up to 16 copies of a single database on separate servers. A server can only host one copy of each database.
Requires the database and transaction log copies for each database to be stored in the same path on all servers. For example, if you store Mailbox Database 1 in D:\DB\Mailbox Database 1\ on Dallas-MB01A, you must also store it in D:\DB\Mailbox Database 1\ on all other servers that host copies of Mailbox Database 1.
Defines the boundary for replication, failovers, and switchovers—only servers in the DAG can host database copies. You cannot replicate database copies to Mailbox servers that are not in the same DAG.
Does not require that all databases have the same number of copies. In a 16-node DAG, one database can have 16 copies, whereas other databases are neither redundant nor have varying number of copies.
In Exchange 2010 transaction log shipping occurs over TCP sockets as opposed to the file share (Server Message Block) used in Exchange 2007. You can view the current TCP port used for replication by running Get-DatabaseAvailabilityGroup -Status | Format-List. The default TCP port used for replication is 64327. This can be set using the Set-DatabaseAvailabilityGroup -ReplicationPort cmdlet. For this change to take effect, you need to create the Windows Firewall exceptions for the new TCP port and then restart the Microsoft Exchange Replication service on each node in the DAG. In the initial release of Exchange 2010, when you created a DAG using the EMC, the DAG was automatically configured to obtain an IP address from DHCP. To complete the configuration and assign a static IP address, you had to use the EMS. In SP1, the DAG can be configured with an IP address from within the EMC.
The target member notifies the member running the active copy of which transaction logs it expects to receive. The source member then responds by sending the required transaction log files. After the transaction logs are received from the source server, the files are placed in the target server’s Inspector directory for processing. The logs are then inspected and verified for integrity and the header is inspected. After passing inspection, a transaction log is placed in the log directory on the target Mailbox server. If the transaction log does not pass inspection the target server will request it from the source up to three times before setting the mailbox database copy to Failed. When a database copy status is Failed, it will periodically attempt to copy the missing log files in order to return the database to a state of Healthy. The target Exchange server then plays the logs against the local copy of the database.
Before this transaction log shipping process can start, the database copy must first be seeded. Seeding is the process of creating a consistent database copy on a DAG member to act as a baseline that will be updated through continuous replication of the transaction log files. This can be accomplished using the following methods:
Automatic seeding Automatic seeding occurs during the creation of a new database.
Manually copying the offline database This method involves dismounting the database and copying the database file to the target server. If you do this, service will be interrupted while the database is dismounted.
Using the Update-MailboxDatabaseCopy cmdlet You can use the Update-MailboxDatabaseCopy cmdlet in the EMS to seed a database copy.
Using the Update Database Copy Wizard You can use the Update Database Copy Wizard within the EMC to seed a database copy.
Database failover occurs when the active database fails, and another copy of the database is activated on another server in the DAG. This can occur because of a number of failure types including: network, storage, and server hardware. If a entire DAG member fails, each of the active highly available databases will attempt to fail over to another configured DAG member. A switchover occurs when an administrator initiates moving an active database from one server to another.
Active Manager
Windows failover clustering is not used to replicate or manage the active database copies in a DAG; however it is used to store information for several pieces of volatile information about the DAG such as the state of active database copies. Exchange Server uses a Windows failover cluster, but there are no cluster groups for Exchange Server, and the cluster has no storage resources. In the Failover Cluster Management Console, you will see an empty cluster, as shown in Figure 11-2. Exchange 2010 does use the cluster API library functions for cluster network (heartbeating), node management, and cluster registry functions. Although Active Manager stores database information in the cluster database, it isn’t accessed directly by any other components.
FIGURE 11-2 Windows Failover Cluster Management objects for a DAG
To manage mailbox database replication and activation Exchange 2010 includes a new component called Active Manager, which runs as a function of the Microsoft Exchange Replication service (MSExchangeRepl.exe). Active Manager replaces the resource model and failover management features integrated into Windows failover clustering that previous Exchange Server versions used. To simplify the architecture Active Manager runs on all Mailbox servers, even if the server is not part of a DAG.
Active Manager runs on all of the DAG members and runs as either the primary active manager (PAM) or a standby active manager (SAM). The PAM is the Active Manager in a DAG that controls which copies will be active and which will be passive. It is responsible for processing topology change notifications and reacting to server failures. The DAG member acting as the PAM is always the member that currently owns the default cluster group, as shown in Figure 11-3. In order to identify the PAM it is recommended to use GetDatabaseAvailabilityGroup <DAG Name> -Status | Format-List Name, PrimaryActiveManager rather than using the Windows Failover Clustering tools. If the server that owns the default cluster group fails, the PAM function automatically moves to the server that takes ownership of the default cluster group.
FIGURE 11-3 Identifying the DAG member that has the PAM function
If you are going to perform maintenance on the server that hosts the default cluster group, you must first manually move the PAM function to another server in the DAG, as shown in Figure 11-4, on a Windows Server 2008 R2 server. To do the same on Windows Server 2008 you run from a command prompt cluster.exe group “Cluster Group” /MOVETO:Dallas-MB01B.
FIGURE 11-4 Moving the PAM function
Far from having a passive role, the SAM function provides information about which server hosts the active copy of a mailbox database. The SAM detects local database and Information Store failures and reacts to them by requesting the PAM to initiate a failover when a copy is available. A SAM does not determine a failover target, nor does it update a database’s location state for the PAM. Each SAM accesses the state of the active database copy in order to answer any request for where the active copy is from other Exchange components like the Hub Transport of Client Access servers. The PAM also performs the functions of the SAM role on the local system.
SP1 includes StartDagServerMaintenance.ps1, a script that you use to take a computer out of service. The script moves active databases off of the server and blocks databases from activating on that server. It will also ensure that all critical DAG support functionality is moved to another server, and blocked from moving back. The StopDagServerMaintenance.ps1 script is then used to complete the operation and remove the blocks and allow databases to be activated on that node.
Adding Database Copies
Creating a database availability group is just the first step in making a database highly available. A database that exists on one of the DAG members must be set up with additional copies on other DAG members. Some databases may require more copies than others.
When creating a database copy, you can specify the following details:
The name of the database you are copying.
The name of the Mailbox server that will host the database copy.
The amount of time (in minutes) to delay log replay. This sets how long to wait before the transaction logs are committed to the database copy. Setting the value for replay lag time to 0 disables the log replay delay.
The amount of time (in minutes) for log truncation delay. This controls how long to wait before truncating committed transaction logs. Setting the value for truncation lag time to 0 disables the log truncation delay.
An activation preference number. This represents the activation preference order of a database copy when multiple databases have the same copy queue length after a failure or outage of the active copy,
The seed copy server. This server will be used to copy the seed database and content indexing information to the new copy. Although this is specified when creating a new database copy, replication always occurs from the active database to each of the copies.
Creating databases copies should be done according to a high-availability plan. A high-availability plan should be created that identifies the level of redundancy required for your environment. If JBOD (Just a Bunch of Disks) will be used to store database files, additional copies of the database should exist on other servers to sustain a disk failure.
You can add database copies using the Add-MailboxDatabaseCopy cmdlet or you can use the Add Mailbox Database Copy Wizard in the EMC.
Lagged Database Copies
One of the options available when configuring mailbox database copies is to configure a lag time of up to 14 days. This lag time is the time that the transaction logs will be held before being committed to the database copy. By delaying committing the logs to a database copy, you have the capability to recover the copy to a point in time using the copy rather than having to pull data from tape-based backup media.
Lagged database copies are deployed to protect from logical corruption. Database logical corruption and store logical corruption are the two types of logical corruption that can occur in the Exchange database.
If you use multiple database copies and Single Item Recovery, only the extremely rare catastrophic store logical corruption case remains unaddressed. In the following scenarios lagged database copies can be used to recover data:
Recovering a deleted item from within 14 days outside the retention period
Recovering to a point in time because of virus outbreak
You should deploy lagged copies to mitigate a specific risk and lagged copies are usually not needed if you are also deploying a third-party backup solution. Lagged copies should not be treated as another high-availability database copy and should not be activated for the following reasons:
You lose your point-in-time recoverability.
You lose your backup copy.
Page patching is not processed on lagged copies.
Lagged copies take a long time to bring online as transaction logs are applied.
Lagged copies have storage implications as enough space must be available to store the transaction logs for lag period. However, rather than just meeting those requirements, it is best practice to have at least enough room for three additional days of transaction logs, to provide for potential truncation failures or periods of excessive log file generation. More information on planning for and recovering Exchange 2010 is covered in Chapter 12, “Backup, Restore, and Disaster Recovery.”
Continuous Replication
Block Mode Introduced in Exchange 2010 Service Pack 1 (SP1), continuous replication–block mode reduces the exposure of data loss on failover by replicating all logs writes to the passive database copies in parallel to writing them locally. In other words, block mode replicates the transactions to the database copies as they are being written to the active local transaction log files. Enabling and disabling block mode is done automatically by the log copy process by database. Block mode will automatically become active when continuous replication file mode is up-to-date with the database copies. The replication transport is the same when granular replication is enabled or disabled.
The benefit of block mode is that it can dramatically reduce the latency between the active copy and the passive copy while also reducing the possibility of data loss during a failover and the time it takes to perform a switchover.
DAG Networks
A DAG network is a set of subnets that can be configured for replication or MAPI communication. Exchange supports the use of a single network adapter and path for DAG members. However, to provide network redundancy as well as the ability to separate replication and MAPI communication, multiple network adapters and networks (subnets) are recommended. After the network hardware is in place and configured and windows failover clustering has detected the changes, these additional physical networks can be configured by setting up additional DAG networks within Exchange.
Consider the following criteria when designing the network for a DAG deployment:
Each DAG can have only one MAPI network. This network must provide connectivity to other Exchange servers, Active Directory, and DNS.
Each DAG member must have at least one network adapter that is able to communicate with all other DAG members.
Each DAG member’s MAPI network must be able to communicate with each of the DAG node’s MAPI network interfaces.
Each DAG member must have the same number of networks.
Each DAG can have zero or more replication networks.
Regardless of location, each DAG member cannot have round-trip return network latency greater than 250 milliseconds (ms).
DAG networks support Internet Protocol Version 4 (IPv4) and IPv6. IPv6 is supported only when IPv4 is also used; a pure IPv6 environment isn’t supported.
APIPA addresses (including manually assigned addresses from the APIPA address range) aren’t supported for use by DAGs.
Each DAG member’s replication network must be able to communicate with every other DAG member’s replication network.
There should be no direct routing to allow heartbeat traffic from the replication network on one DAG member to the MAPI network on another DAG node, or vice versa.
Each DAG requires a minimum of one IP address on the MAPI network. Additional IP addresses are required when the MAPI network is extended across multiple subnets. The DAG requires an IP address on each subnet it will be active on.
When Internet SCSI (iSCSI) is used for storage, these networks should not be used for replication. This keeps replication communication from interfering with storage operations. It is a best practice to manually disable the iSCSI network from being used by the DAG and by the cluster. For more information see “Managing Database Availability Groups” under the DAG Networks and iSCSI Networks subheading at http://technet.microsoft.com/en-us/library/dd298065.aspx.
A DAG network can be configured in a couple different ways. The previous list suggested having at least two networks defined: one network dedicated for MAPI communication and one network dedicated for replication, as shown in Figure 11-5. If all of the replication networks go offline or fail the MAPI network will be used for replication.
FIGURE 11-5 DAG network configuration
Database Failover Process
When a highly available mailbox database failure occurs the PAM will attempt to perform a failover of the database. Before attempting to select a suitable copy to activate the attempt copy last logs (ACLL) process occurs. ACLL makes remote procedure calls (RPCs) to each DAG node that hosts a copy of the mailbox database that is being activated. This call requests to see whether the servers are available and healthy and determines the LogInspectorGeneration value for the database copy. The last active mailbox database copy is used to copy any missing log files to the copy selected by Active Manager for activation. If the ACLL process fails to retrieve all of the missing log files, the configured AutoDatabaseMountDial value is consulted. The AutoDatabaseMountDial value has the following three potential values:
BestAvailability This value allows the database to be automatically mounted if the copy queue length is less than or equal to 12. The copy queue length is the number of logs that the passive copies recognize and have not been replicated. When the copy queue length is less than or equal to 12, Exchange Server attempts to replicate the remaining logs to the passive copies and mount the database. This is the default value.
GoodAvailability This value allows the database be automatically mounted immediately after a failover if the copy queue length is less than or equal to six. When the copy queue length is less than or equal to six, Exchange Server attempts to replicate the remaining logs to the passive copy and mount the database.
Lossless This value does not allow a database to mount automatically until all logs generated on the active copy have been copied to the passive copy.
If the number of lost logs is within the configured AutoDatabaseMountDial value, Exchange Server mounts the database. If the number of lost logs falls outside the configured AutoDatabaseMountDial value, Exchange Server does not mount the database until either missing log files are recovered or an administrator manually mounts the database and accepts that the loss of data is larger than the AutoDatabaseMountDial setting. You use the Set-MailboxServer cmdlet to configure the AutoDatabaseMountDial setting for each DAG node.
It may seem counterintuitive to list the Best Availability as allowing for 12 missing transaction logs, and Good Availability as only allowing 6. In this case, availability is referring to the database being mounted and available, not to the possibility of lost data. In most enterprise environments, data loss is less acceptable than the loss of service. You must decide whether to keep the database available by allowing it to mount despite potential data loss or to leave it unavailable and wait for manual recovery of missing log files.
Mailbox Database Activation
When an active database failure occurs, Active Manager uses a set of selection criteria to determine which copy should be activated. It would make sense that Active Manager attempts to locate the best database copy to perform the quickest failover that is least likely to lose data. Active Manager uses a complex sorting system to determine which copy to make active.
When a failover occurs, Active Manager uses several sets of selection criteria to determine which database copy to activate. During the process for selecting the best copy to activate, Active Manager will:
Enumerate all the available copies.
Remove any copies on unreachable servers.
Sort available copies by how up to date they are.
Use the activation preference if a tiebreaker is necessary.
For more information on selection process see “Understanding Active Manager” at http://technet.microsoft.com/en-us/library/dd776123.aspx.
Exchange 2010 SP1 provides the RedistributeActiveDatabases.ps1 script that provides thee ways to balance active database copies. The first option, switch parameter -BalanceDbsByActivationPreference, just activates the copy that has the lowest ActivationPreference value without taking into account Active Directory site balance. The second option, switch parameter –BalanceDbsIgnoringActivationPreference, attempts to balance active copies across the DAG, as shown in Figure 11-6. The third option, -BalanceDbsBySiteAndActivationPreference, attempts to keep active databases balanced between Active Directory sites. The version of the script included in SP1 won’t move databases to less preferred copies to achieve site balance, but it will log a warning. The script will attempt to minimize an active copy imbalance during the redistribution process; this will help prevent a single node from being overwhelmed with active copies during this process.
FIGURE 11-6 Running RedistributeActivateDatabases.ps1
Controlling Database Activation
In large environments you may want to limit which servers can host an active database in the event of a failure so that a database is not brought online in a secondary datacenter if you are performing maintenance on a server or the database is a lagged copy. A database activation policy can be set on the Mailbox server, or only the database copy can be configured to not activate. When setting this on the Mailbox server using Set-MailboxServer ServerName –DatabaseCopyAutoActivationPolicy, the following three policies are available:
Blocked No database can be automatically activated.
IntrasiteOnly This prevents database failovers from copies that are not in the same Active Directory site.
Unrestricted This allows any server in the DAG to be for database activation. This is the default configuration.
These policies only affect how Active Manager calculates where to activate database copies. An administrator can manually mount the database on a server that has the activation policy set to Blocked. The server auto activation policy is usually used during periods of maintenance when you do not want a database copy to be automatically activated on a specific server.
The second way to control database activation is to suspend database activation on a specific copy of the database. This can be done by running Suspend-MailboxDatabaseCopy <Database Name>\<Server Name> -ActivationOnly, as shown in Figure 11-7. Suspending activation for a specific database copy should be done on copies that you do not want to be activated automatically, such as lagged database copies.
FIGURE 11-7 Suspending activation on a database copy
Unlike setting an activation policy on the Mailbox server, suspending activation on a database copy cannot be mounted directly by an administrator, as shown in Figure 11-8. However, this block can be reset in two ways: when the database copy is reseeded or if replication is suspended and then resumed.
FIGURE 11-8 Attempting to activate a database copy when activation is blocked
Transport Dumpster
In case failure occurs and some transaction logs are not replicated to the passive copy, the transport dumpster is used to redeliver any recently delivered e-mail. If a database failure occurs, a request is made to the Hub Transport servers to redeliver any lost e-mail messages.
The transport dumpster only retains e-mail that has already been delivered. The local submission queue withholds any pending outgoing e-mail. After the transaction logs containing the e-mail message are replicated to and inspected by each DAG member with a copy of the database, the Hub Transport server purges the message from the dumpster.
The transport dumpster is enabled by default. Transport dumpster can be configured by using the Get-TransportConfig cmdlet using the following two properties:
MaxDumpsterSizePerDatabase This setting defines the maximum size of the transport dumpster queue per database and is set globally for the entire Exchange organization. The recommended size is 1.5 times the maximum message size that can be sent. For example, if the maximum size for messages is 20 MB, this parameter should be set to 30 MB.
MaxDumpsterTime This is the time for which the transport dumpster retains a message if the message is not purged for exceeding the maximum dumpster size. The default is set to seven days.
Managing Database Copies
You can use a number of cmdlets to manage database copies. Understanding the function of each is essential to being able to manage database copies. The following cmdlets are available:
Add-MailboxDatabaseCopy This cmdlet is used to create a passive copy of an existing mailbox database on another DAG member.
Remove-MailboxDatabaseCopy This cmdlet is used to delete a passive copy of an existing mailbox database.
Update-MailboxDatabaseCopy This cmdlet updates or seeds a passive database copy. This is useful in situations in which seeding was not performed when the copy was created, or an error has caused the passive copy to be diverged from the active copy.
Suspend-MailboxDatabaseCopy This cmdlet suspends continuous replication to the specified database copy.
Resume-MailboxDatabaseCopy This cmdlet resumes continuous replication to the specified database copy that was previously suspended.
Set-MailboxDatabaseCopy This cmdlet is used to configure the activation preference, replay lag time, and truncation lag time.
Get-MailboxDatabaseCopy This cmdlet is used to retrieve information about the mailbox copy, such as the activation preference, replay lag time, and truncation lag time.
Get-MailboxDatabaseCopyStatus This cmdlet is used to retrieve information about the health of the mailbox database copy.
Obtaining detailed information about the status of the database copies is important. One way to do this is with the Get-MailboxDatabaseCopyStatus cmdlet. Figure 11-9 shows the output of Get-MailboxDatabase | Get-MailboxDatabaseCopyStatus | Format-List. The two properties that are of immediate interest are the Context Index State and the Status, which ideally are Healthy. Also, be sure to note the CopyQueueLength because this is the number of transaction log files that have not been successfully copied to the passive copies. By adding the –ConnectionStatus parameter, additional details about the replication networks is shown, such as listing the networks being used for log replication and seeding.
FIGURE 11-9 Running Get-MailboxDatabaseCopyStatus
Other potential states for database copies exist in addition to Healthy. Table 11-2 summarizes all of the possible copy status states that you may encounter.
TABLE 11-2 Database Copy Status
COPY STATUS |
DESCRIPTION |
ActivationSuspended |
The database copy has been manually blocked from activation. |
DisconnectedAndHealthy |
The database copy has become disconnected from the active database copy. When it was disconnected it was in the Healthy state. This status may be reported during DAG network failures between the source copy and the target database copy. |
DisconnectedAndResynchronizing |
The database copy is disconnected from the active database copy. When it was disconnected it was in the Resynchronizing state. This status may be reported during DAG network failures between the source copy and the target database copy. |
Dismounted |
The active copy is offline and not accepting client connections. |
Dismounting |
The active copy is going offline and terminating client connections. |
Failed |
The database copy is in a Failed state and it isn’t able to copy or replay log files. In this state, the system will periodically check whether the problem that caused the copy status to change to Failed has been resolved and attempt to automatically resume. |
FailedAndSuspended |
The Failed and Suspended states have been set simultaneously by the system because a failure was detected, and resolution of the failure explicitly requires administrator intervention. |
Healthy |
The database copy is successfully copying and replaying log files. |
Initializing |
The system is verifying that the database and log stream are in a consistent state. This state occurs when a database copy is created; when the Microsoft Exchange Replication service is starting; and during transitions from Suspended, ServiceDown, Failed, Seeding, or SinglePageRestore to another state. |
Mounted |
The active copy is online and accepting client connections. |
Mounting |
The active copy is coming online and not yet accepting client connections. |
Resynchronizing |
The database copy and its log files are being compared with the active database copy to check for divergence. |
Seeding |
The database copy is being seeded, the content index for the mailbox database copy is being seeded, or both are being seeded. After seeding is successful, the copy status changes to Initializing. |
SeedingSource |
The database copy is being used as a source for a database copy seeding operation. |
ServiceDown |
The Microsoft Exchange Replication service is not running on the server that hosts the mailbox database copy. |
SinglePageRestore |
This state indicates that a single page restore operation is occurring on the database copy. |
Suspended |
The database copy is in a Suspended state as a result of an administrator manually suspending the database copy by running the Suspend-MailboxDatabaseCopy cmdlet. |
In some instances, such as during maintenance, you many need to suspend and resume continuous replication activity for a database copy. The transaction logs do not truncate the active mailbox database copy when one or more passive copies are suspended. During an extended maintenance period this may result in a large number of transaction logs accumulating in your transaction log directory. In these cases, you may opt to remove the affected passive database copy instead of suspending it. When the maintenance is complete, you can re-add the passive database copy.
Designing and Configuring DAGs
When deploying a CCR environment in Exchange 2007, the sizing was straightforward—the databases were running on one node or the other. In Exchange 2010, which offers you the ability to have 16 members with up to 1,600 databases, sizing and designing the layout is far more complex. The obvious rule is that the more servers you have in a DAG the more options you have for laying out your database copies efficiently and resiliently. Consider the implications of a three-copy, six-server DAG versus two DAGs with three servers and three copies of each database. More servers in a single DAG give you more flexibility in creating copies and to balancing load. To illustrate, if a single server fails with three active databases in a three-member DAG, the two remaining servers need to service the load from the first server, as shown in Figure 11-10.
As compared to two 3-member DAGs, a 6-member DAG can more effectively spread the results of failure across multiple servers as well as to sustain more member failures.
FIGURE 11-10 Three-node DAG failover
In Figure 11-10 the DAG was designed to sustain a single-node failure; if more than one member was down at least two databases would be offline. Simply adding a member to a DAG does not automatically enable it to sustain multiple failures, as Figure 11-11 shows. Here, servers are configured to mirror each other in a four-member DAG. If either A and B or C and D fail, a large number of databases will be unavailable. This configuration provides no better member redundancy than having two 2-member DAGs.
You should design the databases copies with the worst-case failure needed to meet your agreed-upon SLAs. The following two rules apply for redundancy:
One-member failure requires two or more high-availability copies, two or more servers, and a witness server.
Two-member failure requires three or more high-availability copies, four or more servers, and a witness server.
Rather than mirroring database copies on two servers it is better to stripe copies across the members or create copies randomly across the DAG to reduce the likelihood of a low number of failures causing outages for databases.
FIGURE 11-11 A four-node mirrored configuration
When determining the copy design plan for the worst case, ensure that the members can handle all of the hosted database copies becoming active. If you plan on oversubscribing the members, you can set a maximum number of simultaneous active databases on each member to ensure that more copies than the server can handle do not come online by using the Set-MailboxServer cmdlet with the -MaximumActiveDatabase parameter. When the Mailbox server has reached the maximum, no additional database mounts will be successful. If the Active Manager attempts to mount a database on the server the mount will fail and Active Manager will attempt to mount the database copy on another member if one is available. Also, as usage profiles change over time it is important to periodically evaluate the appropriate level of oversubscription and whether the number of active database copies should be modified to accommodate for hardware and usage changes.
Over the course of time, when maintenance is performed active mailbox databases may end up active on servers that they were not intended for. As part of routine maintenance activities remember to activate the database copies across the DAG. You may also use RedistributeActiveDatabases.ps1, which is included in SP1, to automatically load-balance active database copies across DAG members.
Deciding the number and location of database copies also involves the storage infrastructure and the operational maturity of your IT department. Assuming the operational challenges can be overcome, you should consider a few best practices when choosing whether to use RAID (Redundant Array of Independent Disks) or JBOD as summarized in Table 11-3.
TABLE 11-3 Choosing Between RAID and JBOD in a Single-Site Deployment
NUMBER OF COPIES |
STORAGE OPTIONS |
Two high availability |
RAID |
Three or more high availability |
RAID or JBOD |
One active and one lagged copy |
RAID |
When a large number of databases are hosted on each server in a DAG, disk management can become complicated, especially when you are using JBOD storage. Only 23 drive letters are available to mount additional disk drives—A and B are reserved and most likely the operating system is installed on C. When planning a DAG that will require a number of volumes, it is a best practice to use volume mount points rather than drive letters. Volume mount points allow volumes to be mounted as directories rather than drive letters. For example, you may want to mount a 1-TB volume in D:\Databases\Dallas-MB01 to store the Dallas-MB01 database files. You could then mount another 1-TB volume in C:\Databases\Dallas-MB-02 for storing the Dallas-MB02 database files. This way you are no longer constrained by the number of drive letters available.
Using mount points introduces a problem: if the drive that contains the mount points fails, you lose connectivity to all of the other drives. The best practice is to protect the volume that contains the mount points using RAID to reduce the likelihood of a single disk failure taking the entire server offline.