It’s important to know that HA must be enabled AFTER VSAN and if you are changing any VSAN networking settings you should re-configure HA so that it is aware of those changes. When VSAN is enabled HA traffic (heartbeat) will be send via the VSAN VMkernel ports. More details can be found on Duncan’s Blog.
How did I test if HA is working properly? I simply rebooted the host which was hosting my vCenter server VM. Guess what? Right, it worked. Once the lock on the VM expired another host picked up the VM and booted it as expected. This also shows that there is no dependency to the vCenter server when it comes to rebooting VMs on VSAN.
Host Isolation & VMware HA
This is basically the reason why it’s not possible to run VSAN with less than three hosts, if redundancy is required (like n+1). Before VSAN will create a copy of a virtual machine it will make sure to have a third host available to hold the witness.
In case of a split-brain the witness will help to decide which of the hosts who is holding a copy of the data (in this case esx1 or esx2) should take over control. The host who is able to communicate with the node holding the witness will be able to do so. Whereas the isolated host should power off its VMs accordingly (requires the proper setting for the Host Isolation Response). If the VMs would continue to run on an isolated host there would be no way to properly protect the data because they won’t be mirrored to another node.
Host Isolation response: Power Off
11:41 PM Connection to the host has been lost
11:42 PM vSphere HA initiated a virtual machine failover action in cluster Cluster2 in datacenter Home
11:42 PM vSphere HA restarted a virtual machine vMotionME on host esx3.home.local in cluster Cluster2
The VM was back online quite fast. When I reconnected the host everything was looking fine because the isolated VM already got powered off.
Host Isolation response: Leave powered On
11:57 PM Connection to the host has been lost
11:57 PM vSphere HA initiated a virtual machine failover action in cluster Cluster2 in datacenter Home
11:57 PM vSphere HA restarted a virtual machine vMotionME on host esx3.home.local in cluster Cluster2
So far so good. Via the IPMI adapter I was able to take a look at the running virtual machines:
This could cause some issues!
If the VM network is not affected by the isolation the VM would still be accessible via external connections, which sounds good at first. But VMware HA will restart a new copy of the VM very soon, so you would get duplicate IP addresses on your network, not to mention the applications/clients connecting to those VMs could freak out.
And I also had the problem, when I reconnected the isolated host with the running VM the host started “to fight” with the host who restarted the VM. The vSphere client showed the VM flapping between two hosts. I had to manually kill the VM process to end that fight. So make sure to use a proper host isolation response!!!
What if I lose 2 of my 3 hosts?
In case of a central storage (SAN/NAS) this would be no problem as long as the remaining host would provide sufficient computing resources, or at least enough to power on the majority of your VMs.
With VSAN things look a bit different, depending on the number of hosts in a cluster. I moved all VMs to a single host and isolated the two other host from the cluster so that I ended up with a single host running all virtual machines. The remaining host was not so happy. All VMs slowed down rapidly, applications stopped responding, the host wasnt able to open a VM console session (MKS missing), it felt a bit like an All Path Down (APD) scenario. However the RDP connection was persistent and didn’t disconnect.But I would assume this is similar to an APD, this may work for a while but not infinitely!?
When I re-enabled the first switch port to just ONE of the isolated hosts the remaining host instantly recovered and all VMs continued without any problem.
I admit this was not fair to force the three node cluster into three network partitions. Let me reemphasize that VSAN works as designed! A three node cluster allows a single host outage and if more are required you simply need to add additional hosts!
What if your ESXi installation is broken and you have to re-install ESXi? I tried exactly that by re-installing ESXi without any preparation. So the host I re-installed was still a VSAN cluster member when I rebooted it for the setup.
During that time the other two hosts where running fine, no issues with the VSAN cluster at all.
Once the host was back online I had to perform some manual steps:
- Re-connect the host
- Remove and re-add it to the distributed switch*
- Re-create all VMKernel interfaces
- Disable VMware HA on the cluster. As soon as I disabled HA, the VSAN cluster went green and the VSAN datastore appeared!
- Re-enable VMware HA
* In case you run in the following error when trying to remove the ESXi host from the vDS:
vDS DSwitch0 port 40 is still on host esx1.home.local connected to ESXi5.0 nic=4002 type=vmVnic” Solution: Reboot the host and during it is offline you can easily remove it.