In today’s age of cloud computing, virtualized applications are very common. While the deployment details may vary, i.e. where these applications are deployed, the common characteristic among environments is that they all reside in virtualization platforms. This is true for private clouds, AIS Community Cloud, and public clouds alike.
While cloud environments, when properly deployed, provide virtual machine instance high availability, they do not protect against the failure of an application on an individual virtual instance. In addition, scheduled maintenance activities also impact the availability of applications running on single instances.
To achieve high levels of availability, the application must be deployed and configured analogous to physical environments, but with virtual instances. This is commonly referred to as “Guest Clustering”.
Deploying guest clusters on virtualization platforms includes a specific set of challenges. This white paper describes a potential solution that is platform agnostic.
The Business Problem for Guest Clustering
The challenge in guest clustering scenarios lies primarily in the storage configuration. The “traditional” high availability solution on the Microsoft Windows Server platform has been its Failover Clustering role. This model requires storage that is accessible by all nodes within the cluster. This ensures that “ownership” of the storage can be moved to the other node and application data can be accessed in the event of a failover. This configuration is commonly referred to as shared storage. Shared storage requires some kind of storage fabric that is accessible to/from all nodes, and is usually implemented through Fiber Channel or iSCSI based infrastructures.
However, both Fiber Channel and iSCSI infrastructures require rather complex configurations to hardware and software components (zoning, Multi-Path configuration, etc.) that require administrator configuration and care. While this is possible in virtualized environments, this goes against the “self-service” paradigm of cloud deployments, which is why many public cloud providers (including AWS, Azure, Google Cloud, etc) state that shared storage is not supported.
Two of Microsoft’s business critical applications have moved either completely or partially to a shared-nothing infrastructure, where data ownership of a particular node and data replication is facilitated at the application level: Exchange DAGs and SQL Always On Availability Groups. While this works well for these two applications, for SQL Server it means deploying the Enterprise Version, which comes at a considerable licensing premium. In addition to traditional SQL clustering deployments, deploying high availability for other common applications, such as file servers or other third party applications that can be configured in a cluster, has been a challenge on all virtualization platforms, including AWS and Azure.
On its own virtualization platform (Hyper-V) Microsoft has provided virtualized shared storage capabilities since version 2012 through shared VHDx’s (Microsoft’s virtual hard drive format). Version 2016 replaced these with VHD sets, which address some of the limitations of shared VHDx’s. However, these capabilities are not available on non-Microsoft virtualization platforms (VMware, Xen) or in any of the public clouds.
Additionally, Microsoft has included synchronous volume replication functionality in its operating system beginning with version 2012R2. However, while this technology appears to have matured with subsequent versions up to the recently released version 2019, and there has been some noticeable Microsoft buzz around deploying HA scenarios using volume replication, no actual solutions have surfaced yet.
Enter Storage Spaces Direct (S2D). S2D, which is Microsoft’s software defined storage solution, is commonly associated with physical hyper-converged platforms. However, with some minor constraints, this becomes a viable solution for guest cluster deployments on any virtualization platform. The following example will outline the deployment of a highly available file server deployment. While this PoC deployment is performed on a Hyper-V platform, the solution can be deployed on all virtualization platforms, including the Azure, AWS, and others.
Final deployment is shown in figure 1. The file server cluster is deployed using two virtual machine nodes, in this case Windows 2016 Datacenter, as S2D is a feature of the Windows Datacenter version only. In addition, the Core option was chosen to reduce maintenance as well as the attack surface of the platform. Each node is connected to two virtual (VLan’ed) networks, one for client communications and one for internal cluster heartbeat and storage communications. For the CLSTR network the MTU is set to jumbo packets and DNS registration is disabled. Three data drives are added to each node, which will make up the usable storage pool. The drives are only added to the virtualized SCSI controller and left raw.
The following setup and configuration tasks can be performed either via PowerShell on one of the nodes, or through GUI tools from another domain joined system with the appropriate remote administration tools.
Once the virtual machines are set up, patched, and domain joined, the necessary roles are installed: Failover-Clustering, RSAT-Clustering-Powershell, Storage-Replica, FS-FileServer, and Data-Center-Bridging. The cluster is then configured including a File Share Quorum on a domain controller. After re-validating the cluster with the File Share Quorum, S2D is enabled via PowerShell on one of the nodes. The screenshot below shows the output of the command. The first command validates that the capacity drives can be added to a S2D storage pool (CanPool and Usage columns). The second command enables the S2D storage pool.
The warnings are expected in this environment. S2D will attempt to distinguish between solid state drives and hard drives (spindles) and leverage solid state drives as a read/write cache for the spinning disks. In a virtualized environment all disks are identified as media type HDD, thus no drives are available for a caching tier. However, performance can be improved by using system RAM as a read cache using the PowerShell commands below:
Next, volumes are created for the services that will be installed. While this can be done within the GUI management tools, it is recommended to use PowerShell for this task. The PowerShell commandlet combines the multi-step process of creating the virtual disk, partitioning and formatting it, creating the volume with a matching name, and adding it to the cluster shared volumes in one single step:
The cluster is now a fully functional virtualized two-node hyper-converged cluster. Using the Failover Cluster Manager on a management stations and connecting to the cluster, the storage pool with the six attached disks is shown in the Pools node as well as the cluster shared volume in the Disks node. The cluster is now ready for the installation of cluster-aware applications, such as SQL Server Standard or a clustered file server.
After deploying a Scale-out File Server role (SoFS) and a file share with continuous availability enabled on this cluster, a node failure is simulated by moving the role to the other node during a file transfer operation.
A 1.3GB file was transferred from a client to the file server cluster. During the failover a brief interruption in the data transfer is noticed, however the copy operation itself was not interrupted.
Deploying Microsoft based Guest Clusters in public, private, or the NFINIT Community Cloud has traditionally been a challenge. Some solutions are only possible on certain platforms. Some include considerably higher cost – using Enterprise versions or relying on third party software. Deploying a S2D guest cluster provides all the necessary functionality for an application on top of Windows Failover Clustering, without the need for shared storage
While NFINIT has validated the solution that is described in this white paper as feasible, organizations should confirm with application vendors and infrastructure providers whether this methodology is supported. In addition, thorough testing, including stress testing, should be performed prior to moving deployments into production.
References and further reading:
Deploy Storage Spaces Direct
Deploying IaaS VM Guest Clusters in Microsoft Azure
 DAGs are the only supported HA solution for Exchange
 Basic Always On Availability Groups are available with SQL Standard since version 2016, but with significant limitations