Working with VSAN – Part IV

To continue my “Working with VSAN” series, this time I want to challenge the scalability (at least what was possible within my lab). But see yourself.

Performance scaling by adding disk groups

To see how VSAN scales when adding disks I did the following tests:

IOMeter @ 32 QD, 75/25 RND/SEQ, 70/30 R/W, 8KB in combination with different disk group and VM storage policy settings. But it’s actually not about the total values or settings, it’s about to show the scalability. Not to mention that the SSD used are pretty old (1st GEN OCZ Vertex) and differ in performance!

RUN1

Failures to tolerate (FFT): 0

VMDK Stripe: 1

FTT0_ST1

FTT0_ST1_IOM

RUN2

Failures to tolerate (FFT): 0 – So still on one host…

VMDK Stripe: 2 – … but on two disk groups!

FTT0_ST2 FTT0_ST2_IOM

To be able to combine multiple stripes like shown above with FFT > 0, you will need multiple disk groups in each host to get the performance. In my case I just got a single host with two disk groups, so I was not able to perform the same test with a FFT = 1 policy.

Changing VM Storage Policies

To wrap up this post I want to mention that during my tests I’ve always used the exact same VMDK and so I had to change the policy multiple times. Of course it took some time till VSAN moved the data around to that it was compliant with the policy. But it worked like a charm and I though it is also worth mentioning!

But what about the network?

Multiple VMKernel Interfaces 

In case you are planning to run VSAN over a 1GbE network (which is absolutely supported), multiple VMKernel interfaces could be a good way to mitigate a potential network bottleneck. This would enable VSAN to leverage multiple 1GbE channels to transport data between the nodes.

To be able to use multiple VMKernel ports you will need to keep in mind to use different subnets for each VMK and to always set a single vmnic as active and all others to stand-by.

ActStbyNic 2VSANVMK

To see how this would scale I moved a virtual machine to a host who had no data stored locally of that particular VM, so that all the reads (& writes) had to traverse the network.

2VMsWithLocalwitnessI also had setup the VSAN networking a couple of day before so I started with the desired multi VMK setup und were quite happy with the results.

FTT1_ST1_esxtop_2VMK FTT1_ST1_IOM_2VMK

Then I disabled VSAN on the second VMKernel and also moved the vmnics down to be in stand-by only. The result were as expected, VSAN were just using a single vmnic.

FTT1_ST1_esxtop_1VMK FTT1_ST1_IOM_1VMK

 To verify these results I wanted to switch back to the multi VMKernel setup but for some reason I wasn’t able to get it back to work again. I moved the vmnic up to be active again (as depicted above) and re-enabled VSAN traffic on the second VMKernel interface (VMK4). But since then I was unable to see VSAN traffic across both NICs again. When I disable VSAN traffic on the first VMKernel (VMK3) it switches to the second interface (VMK4) which tells me that the interfaces are generally working. At this point I’m a bit clueless and asking you guys, have you already tried this setup? What are your results? Am I missing something or did I misunderstood something? Are there any specific scenarios where the multi VMK kicks in? I would love to get some feedback!

Working with VSAN – Part III

In Part I and II I already tested some scenarios which may impact your VSAN cluster, like simulating the outage of two hosts by just isolating them. This time I’m going to torture my VSN lab even more, read on to how this turned out.

What if I put one host into maintenance mode and another node fails?

HostMaintHostFailed

Will the remaining node be able to run the VMs and even restart those of the failed host?

Maintenances Mode with “Full data migration”

MaintModeError

VSAN didn’t allow me to put a host into maintenance mode using full data migration. I got a couple of VMs running with a VM Storage Policy saying FFT (Failures to Tolerate) = 1. So with just three hosts this would violate these rules.

 

Maintenances Mode and “Ensure accessibility”

This mode is possible since it ensures that at least one copy of the VM data and the witness (or the second copy of the VM data) is available on the remaining nodes. This mode didn’t move any data around since there were no eligible nodes available.*

Then I simulated several outage scenarios to see what would happen:

  • Host Reboot of ESX3

The remaining host ESX2 was fully functional and restarted the VM running on ESX3.

  • Disk Failure on ESX3

UnhealthyDiskGroup

The remaining host ESX2 was fully functional AND the VMs running on ESX3 were functional since they were able to access their disks over the network on ESX2. So it was also no problem to vMotion the VMs from ESX3 over the ESX2, reboot the host to fix the disk failure.

Btw. I simply re-plugged the SSD I pulled out to simulate the failure. By re-importing a foreign config. in the PERC Controller, the volume was back online and VSAN recognized that and no data was lost.

ForeignConfig

  • Network Partition – ESX3

The last test was to isolate ESX3 and as expected the remaining host ESX2 was fully functional and restarted the VM running on ESX3.

Honestly?  This is way better node cluster than I expected, since we are still talking about a THREE! node cluster. Ok I admit there could also arise scenarios where thing can go wrong. Assume the scenario above, when there was no VM data on ESX2 then those VMs on ESX3 would have crashed.

But again a three node cluster is just a minimum deployment, so if you want to make sure you can withstand multiple host failure, you have to add more nodes it’s as simple as that.

* Contrary to the scenario when you put two hosts into maintenance mode, then VSAN will start to move data around!

RAID0 Impact

OK now the disk is gone and I want to replace it. Usually when using a RAID other than RAID0 this would be no problem since the volume would be still online. In my case I was forced to use RAID0s on every single device because the PERC 6/i doesn’t support pass-through mode. For now, even if I replaced the drive the RAID0 was still offline. This means I had to reboot the host to manually force the RAID0 online again. In case I would have used a pass-through capable controller, this would be no problem since it would just pass through the new disk. The RAID0 also disables the option to use a hot spare disk since from a logical standpoint it wouldn’t make any sense to replace a disk within a RAID with an empty disk.

 

Stay tuned for more VSAN experience!

Upgrade vom vSphere 5.0 to 5.5 – 512 bit certificate issue

This week I upgraded the vSphere 5.0 environment to 5.5 Update 1, which is usually not a big deal. I really can’t complain about the upgrade process itself, it’s more the result which I didn’t expect.

Once all components were up to date, I launched the vSphere Web Client which was working fine but at the top I saw the following error message:

Failed to verify the SSL certificate for one or more vCenter Server Systems: https://vCenter_FQDN:443/sdk

I was able to login via the C# client which showed the vCenter as usual, so it seemed to be a problem between the vCenter server and the Web Client.

After I spoke to the VMware Support it turned out that the vCenter Server doesn’t support the old 512 bit certificates. This problem is mentioned in the release notes:

After you upgrade vCenter Server 4.x to 5.5 Update 1, vCenter Server is inaccessible through vSphere Web Client*
When you upgrade from vCenter Server 4.x to 5.5 Update 1, the vCenter Server is not visible through vSphere Web Client. The issue occurs as vCenter Server 4.x supports SSL certificates with 512 bits but vCenter Server 5.5 supports only SSL certificates with greater than or equal to 1024 bits.

Workaround: To resolve this issue, replace the vCenter Server certificates with greater than or equal to 1024 bits

I wasn’t aware of this issue and even if so I would’t have recognize it, since I upgraded a 5.0 environment. The actual problem was that the environment has been upgraded from 4.1 to 5.0 before which is the cause why there still was the 512 bit certificate in use. To see if you are affected by this problem simply open the the rui.crt file (C:\ProgramData\VMware\VMware VirtualCenter\SSL) before upgrading vSphere:

512BitCert

The funny thing is that none of the installation wizards recognizes that the certificate is unsupported so the upgrade went through without any errors.

However there is a way to fix it which I outline below:
1. Backup your vCenter Server / database / old SSL certificates

! This process will cause some downtime to certain vCenter services !

2. Generate new Certificates (KB2037432)

a) Create a temporary directory like c:\certs

b) Create a file called vcenter.cfg with the following content:

[ req ]
default_bits = 2048
default_keyfile = rui.key
distinguished_name = req_distinguished_name
encrypt_key = no
prompt = no
string_mask = nombstr
req_extensions = v3_req

 

[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = digitalSignature, keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth, clientAuth
subjectAltName = DNS: HOSTNAME, IP: xxx.xxx.xxx.xxx, DNS: FQDN

 

[ req_distinguished_name ]
countryName = DE
stateOrProvinceName = YOURPROVINCE
localityName = YOURCITY
0.organizationName = YOURORGANIZATION
organizationalUnitName = vCenterServer
commonName = FQDN

 

3. Start to create new certificates

The openssl utility can be found in C:\Program Files\VMware\Infrastructure\Inventory Service\bin\openssl.exe

openssl req -new -nodes -out c:\certs\rui.csr -keyout c:\certs\rui-orig.key -config c:\certs\vCenter.cfg

openssl rsa -in c:\certs\rui-orig.key -out c:\certs\rui.key

openssl req -text -noout -in c:\certs\rui.csr

openssl x509 -req -days 3650 -in c:\certs\rui.csr -signkey c:\certs\rui.key -out c:\certs\rui.crt -extensions v3_req -extfile c:\certs\vCenter.cfg

openssl.exe pkcs12 -export -in c:\certs\rui.crt -inkey c:\certs\rui.key -name rui -passout pass:testpassword -out c:\certs\rui.pfx

openssl pkcs12 -in c:\certs\rui.pfx -info openssl x509 -text -noout -in rui.crt

4. Create a file called chain.pem in C:\certs and then open the rui.crt file with an editor and copy the content into the chain.pem file and finally save it.

5. Use the SSL Certificate Automation Tool 5.5 (for vSphere 5.5 only!) to plan the actions and the order in which they should be performed:PlanSSLUpdateSetps

Take a screenshot of the list or write it down, you will need it in a second.

6. Now replace the certificates just for the vCenter server. The tool will ask you for the certificate chain which is located in C:\certs\chain.pem and the private key c:\certs\rui.key and some credentials:UpdatevCenterCerts

7. Once this is done, you will need to re-establish the trusts between vCenter server and it’s components like the Inventory Service, SSO and so on. When performing these steps, follow the order depicted on the list you have written down. The following screenshot shows the process generic, because it’s pretty similar with all other components:TrustUpdateManager

That’s it. After that I went to the vSphere Web Client and logged in as usual. No errors left, the vCenter server connected to the Web Client correctly and I was able to manage it. So overall this certificate replacement was easier than expected and it fixed the issue as required. I hope this helps!