In Part I and II I already tested some scenarios which may impact your VSAN cluster, like simulating the outage of two hosts by just isolating them. This time I’m going to torture my VSN lab even more, read on to how this turned out.
What if I put one host into maintenance mode and another node fails?
Will the remaining node be able to run the VMs and even restart those of the failed host?
Maintenances Mode with “Full data migration”
VSAN didn’t allow me to put a host into maintenance mode using full data migration. I got a couple of VMs running with a VM Storage Policy saying FFT (Failures to Tolerate) = 1. So with just three hosts this would violate these rules.
Maintenances Mode and “Ensure accessibility”
This mode is possible since it ensures that at least one copy of the VM data and the witness (or the second copy of the VM data) is available on the remaining nodes. This mode didn’t move any data around since there were no eligible nodes available.*
Then I simulated several outage scenarios to see what would happen:
- Host Reboot of ESX3
The remaining host ESX2 was fully functional and restarted the VM running on ESX3.
- Disk Failure on ESX3
The remaining host ESX2 was fully functional AND the VMs running on ESX3 were functional since they were able to access their disks over the network on ESX2. So it was also no problem to vMotion the VMs from ESX3 over the ESX2, reboot the host to fix the disk failure.
Btw. I simply re-plugged the SSD I pulled out to simulate the failure. By re-importing a foreign config. in the PERC Controller, the volume was back online and VSAN recognized that and no data was lost.
- Network Partition – ESX3
The last test was to isolate ESX3 and as expected the remaining host ESX2 was fully functional and restarted the VM running on ESX3.
Honestly? This is way better node cluster than I expected, since we are still talking about a THREE! node cluster. Ok I admit there could also arise scenarios where thing can go wrong. Assume the scenario above, when there was no VM data on ESX2 then those VMs on ESX3 would have crashed.
But again a three node cluster is just a minimum deployment, so if you want to make sure you can withstand multiple host failure, you have to add more nodes it’s as simple as that.
OK now the disk is gone and I want to replace it. Usually when using a RAID other than RAID0 this would be no problem since the volume would be still online. In my case I was forced to use RAID0s on every single device because the PERC 6/i doesn’t support pass-through mode. For now, even if I replaced the drive the RAID0 was still offline. This means I had to reboot the host to manually force the RAID0 online again. In case I would have used a pass-through capable controller, this would be no problem since it would just pass through the new disk. The RAID0 also disables the option to use a hot spare disk since from a logical standpoint it wouldn’t make any sense to replace a disk within a RAID with an empty disk.
Stay tuned for more VSAN experience!