Upgrade vom vSphere 5.0 to 5.5 – 512 bit certificate issue

This week I upgraded the vSphere 5.0 environment to 5.5 Update 1, which is usually not a big deal. I really can’t complain about the upgrade process itself, it’s more the result which I didn’t expect.

Once all components were up to date, I launched the vSphere Web Client which was working fine but at the top I saw the following error message:

Failed to verify the SSL certificate for one or more vCenter Server Systems: https://vCenter_FQDN:443/sdk

I was able to login via the C# client which showed the vCenter as usual, so it seemed to be a problem between the vCenter server and the Web Client.

After I spoke to the VMware Support it turned out that the vCenter Server doesn’t support the old 512 bit certificates. This problem is mentioned in the release notes:

After you upgrade vCenter Server 4.x to 5.5 Update 1, vCenter Server is inaccessible through vSphere Web Client*
When you upgrade from vCenter Server 4.x to 5.5 Update 1, the vCenter Server is not visible through vSphere Web Client. The issue occurs as vCenter Server 4.x supports SSL certificates with 512 bits but vCenter Server 5.5 supports only SSL certificates with greater than or equal to 1024 bits.

Workaround: To resolve this issue, replace the vCenter Server certificates with greater than or equal to 1024 bits

I wasn’t aware of this issue and even if so I would’t have recognize it, since I upgraded a 5.0 environment. The actual problem was that the environment has been upgraded from 4.1 to 5.0 before which is the cause why there still was the 512 bit certificate in use. To see if you are affected by this problem simply open the the rui.crt file (C:\ProgramData\VMware\VMware VirtualCenter\SSL) before upgrading vSphere:

512BitCert

The funny thing is that none of the installation wizards recognizes that the certificate is unsupported so the upgrade went through without any errors.

However there is a way to fix it which I outline below:
1. Backup your vCenter Server / database / old SSL certificates

! This process will cause some downtime to certain vCenter services !

2. Generate new Certificates (KB2037432)

a) Create a temporary directory like c:\certs

b) Create a file called vcenter.cfg with the following content:

[ req ]
default_bits = 2048
default_keyfile = rui.key
distinguished_name = req_distinguished_name
encrypt_key = no
prompt = no
string_mask = nombstr
req_extensions = v3_req

 

[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = digitalSignature, keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth, clientAuth
subjectAltName = DNS: HOSTNAME, IP: xxx.xxx.xxx.xxx, DNS: FQDN

 

[ req_distinguished_name ]
countryName = DE
stateOrProvinceName = YOURPROVINCE
localityName = YOURCITY
0.organizationName = YOURORGANIZATION
organizationalUnitName = vCenterServer
commonName = FQDN

 

3. Start to create new certificates

The openssl utility can be found in C:\Program Files\VMware\Infrastructure\Inventory Service\bin\openssl.exe

openssl req -new -nodes -out c:\certs\rui.csr -keyout c:\certs\rui-orig.key -config c:\certs\vCenter.cfg

openssl rsa -in c:\certs\rui-orig.key -out c:\certs\rui.key

openssl req -text -noout -in c:\certs\rui.csr

openssl x509 -req -days 3650 -in c:\certs\rui.csr -signkey c:\certs\rui.key -out c:\certs\rui.crt -extensions v3_req -extfile c:\certs\vCenter.cfg

openssl.exe pkcs12 -export -in c:\certs\rui.crt -inkey c:\certs\rui.key -name rui -passout pass:testpassword -out c:\certs\rui.pfx

openssl pkcs12 -in c:\certs\rui.pfx -info openssl x509 -text -noout -in rui.crt

4. Create a file called chain.pem in C:\certs and then open the rui.crt file with an editor and copy the content into the chain.pem file and finally save it.

5. Use the SSL Certificate Automation Tool 5.5 (for vSphere 5.5 only!) to plan the actions and the order in which they should be performed:PlanSSLUpdateSetps

Take a screenshot of the list or write it down, you will need it in a second.

6. Now replace the certificates just for the vCenter server. The tool will ask you for the certificate chain which is located in C:\certs\chain.pem and the private key c:\certs\rui.key and some credentials:UpdatevCenterCerts

7. Once this is done, you will need to re-establish the trusts between vCenter server and it’s components like the Inventory Service, SSO and so on. When performing these steps, follow the order depicted on the list you have written down. The following screenshot shows the process generic, because it’s pretty similar with all other components:TrustUpdateManager

That’s it. After that I went to the vSphere Web Client and logged in as usual. No errors left, the vCenter server connected to the Web Client correctly and I was able to manage it. So overall this certificate replacement was easier than expected and it fixed the issue as required. I hope this helps!

Working with VSAN – Part II

VMware HA

It’s important to know that HA must be enabled AFTER VSAN and if you are changing any VSAN networking settings you should re-configure HA so that it is aware of those changes. When VSAN is enabled HA traffic (heartbeat) will be send via the VSAN VMkernel ports. More details can be found on Duncan’s Blog.

How did I test if HA is working properly? I simply rebooted the host which was hosting my vCenter server VM. Guess what? Right, it worked. Once the lock on the VM expired another host picked up the VM and booted it as expected. This also shows that there is no dependency to the vCenter server when it comes to rebooting VMs on VSAN.

 

Host Isolation & VMware HA

This is basically the reason why it’s not possible to run VSAN with less than three hosts, if redundancy is required (like n+1). Before VSAN will create a copy of a virtual machine it will make sure to have a third host available to hold the witness. Witness

In case of a split-brain the witness will help to decide which of the hosts who is holding a copy of the data (in this case esx1 or esx2) should take over control. The host who is able to communicate with the node holding the witness will be able to do so. Whereas the isolated host should power off its VMs accordingly (requires the proper setting for the Host Isolation Response). If the VMs would continue to run on an isolated host there would be no way to properly protect the data because they won’t be mirrored to another node.

To see how it would behave I simply isolated one of my hosts.HostIsolatedVMDiskPlacement

1st RUN

Host Isolation response: Power Off

11:41 PM Connection to the host has been lost

11:42 PM vSphere HA initiated a virtual machine failover action in cluster Cluster2 in datacenter Home

11:42 PM vSphere HA restarted a virtual machine vMotionME on host esx3.home.local in cluster Cluster2

The VM was back online quite fast. When I reconnected the host everything was looking fine because the isolated VM already got powered off.

 

2nd RUN

Host Isolation response: Leave powered On

11:57 PM Connection to the host has been lost

11:57 PM vSphere HA initiated a virtual machine failover action in cluster Cluster2 in datacenter Home

11:57 PM vSphere HA restarted a virtual machine vMotionME on host esx3.home.local in cluster Cluster2

So far so good. Via the IPMI adapter I was able to take a look at the running virtual machines:

VMproc1

But wait! esx3 which just has restarted the VM is also running the VM:VMproc2_on2ndHost

This could cause some issues!

If the VM network is not affected by the isolation the VM would still be accessible via external connections, which sounds good at first. But VMware HA will restart a new copy of the VM very soon, so you would get duplicate IP addresses on your network, not to mention the applications/clients connecting to those VMs could freak out.

And I also had the problem, when I reconnected the isolated host with the running VM the host started “to fight” with the host who restarted the VM. The vSphere client showed the VM flapping between two hosts. I had to manually kill the VM process to end that fight. So make sure to use a proper host isolation response!!!

 

What if I lose 2 of my 3 hosts?

In case of a central storage (SAN/NAS) this would be no problem as long as the remaining host would provide sufficient computing resources, or at least enough to power on the majority of your VMs.

With VSAN things look a bit different, depending on the number of hosts in a cluster. I moved all VMs to a single host and isolated the two other host from the cluster so that I ended up with a single host running all virtual machines. The remaining host was not so happy. All VMs slowed down rapidly, applications stopped responding, the host wasnt able to open a VM console session (MKS missing), it felt a bit like an All Path Down (APD) scenario. However the RDP connection was persistent and didn’t disconnect.But I would assume this is similar to an APD, this may work for a while but not infinitely!?VSAN_unkownVMs

When I re-enabled the first switch port to just ONE of the isolated hosts the remaining host instantly recovered and all VMs continued without any problem.

I admit this was not fair to force the three node cluster into three network partitions. Let me reemphasize that VSAN works as designed! A three node cluster allows a single host outage and if more are required you simply need to add additional hosts!

 

Broken Host

What if your ESXi installation is broken and you have to re-install ESXi? I tried exactly that by re-installing ESXi without any preparation. So the host I re-installed was still a VSAN cluster member when I rebooted it for the setup.

During that time the other two hosts where running fine, no issues with the VSAN cluster at all.

//Update: Somehow I missed the # right before the devices. So the installer is aware that those devices are already claimed by VSAN: ReInstESXiVSAN

Once the host was back online I had to perform some manual steps:

  1. Re-connect the host
  2. Remove and re-add it to the distributed switch*
  3. Re-create all VMKernel interfaces
  4. Disable VMware HA on the cluster. As soon as I disabled HA, the VSAN cluster went green and the VSAN datastore appeared!
  5. Re-enable VMware HA

VSANgreen

* In case you run in the following error when trying to remove the ESXi host from the vDS:

vDS DSwitch0 port 40 is still on host esx1.home.local connected to ESXi5.0 nic=4002 type=vmVnic” Solution: Reboot the host and during it is offline you can easily remove it.

Working with VSAN – Part I

In my last post about VSAN I went through the setup of my VSAN lab without spending too much time on details. This time I want to take some time to look at some details.

Computing resources

Let’s start easy. The first thing I recognized in a lab with limited physical resources is that the overhead cause by VSAN is solid. The following picture depicts how much resources are in use on an ESXi host with VSAN enabled when NO virtual machine is running:CPURAM

Currently I’m running a single disk group with 1 SSD + 1 HDD. So for a real world VSAN deployment with some more disks make sure to add some extra RAM. To be able to support the maximum number of disks and disk groups, the host requires at least 32GB RAM. VMware also state that VSAN won’t consume more than 10% of the computing resources of a single host.

More details can be found in the latest Design and Sizing Guide

 

Maintenance Mode

When you want to put a host into maintenance mode there are now some options you can choose from:

MaintMode

Quote right from the information pop up:

Full data migration: Virtual SAN migrates all data that resides on this host. This option results in the largest amount of data transfer and consumes the most time and resources.

Ensure accessibility: Virtual SAN ensures that all virtual machines on this host will remain accessible if the host is shut down or removed from the cluster. Only partial data migration is needed. This is the default option.

No data migration: Virtual SAN will not migrate any data from this host. Some virtual machines might become inaccessible if the host is shut down or removed from the cluster.

The last two options may violate your configured storage polices. Assuming you are using an N+1 policy then you usually have two copies of all protected VMs. If you want to shut down a host which is storing one of the copies, you end up with just a single dataset of the protected VMs. This will violate the N+1 policy and the VM will be listed as not compliant.

If the number of hosts within the cluster is sufficient, VSAN is able to automatically make the VM compliant by creating a new copy of the VM data. Before VSAN starts to copy any data, by default it will wait 60 minutes, because the host may come back online. In my case with just three host I’m not able to test it. An N+1 policy requires at least three host, two which host the VM data and one for the witness.  Accordingly an N+2 policy requires fours hosts.

A question which I was not able to answer what if I put a host into maintenance mode without full data migration, will VSAN recognize it like a host failure? If yes VSAN would be able to automatically bring the VMs back into compliance.

I would say the following update actually answers this question. Because even if two hosts were in maintenance mode and all VMs were not compliant, the VSAN cluster seemed to be healthy and showed three eligible hosts.

//UPDATE

A colleague just asked if it would be possible to put two of three hosts into maintenance mode. My first though was “No” but I decided to proof it and my first thought was wrong!

If you put the hosts into maintenance mode by using the “Ensure accessibility”  VSAN does exactly that. I ended up with two hosts in maintenance mode and some empty VSAN disks.EmptyVSANDisks

Accordingly as result all VMs were no longer compliant.StorageProfileCompliance

But VSAN will also warn if there are not enough nodes available to fulfill the request to put a host into maintenance mode.MaintModeError

 

Storage DRS, SIOC, FT, DPM & large VMDK (62 TB) are NOT supported!

OK I mean sDRS wouldn’t make any sense since you just got a single aggregated datastore for your VSAN cluster. SOIC was designed to ensure fairness between virtual machines when it comes to I/O queues and I would assume VSAN has its own mechanisms to deal with that. DPM? Honestly I never seen a customer really using it and it would cause unnecessary data movement to ensure data accessibility. With FT it’s similar, at least our customers rarely have the need to make use of FT, and so the only disappointing point is the missing support for 62 TB VMDKs which would be cool.  Not to mention that some of those features require an Enterprise (Plus) license but I indeed see VSAN also for small environments which maybe just licensed with Standard licenses.

 

VMware vMotion & DRS

There is not much to say about that, it simply works as usual.

 

That’s it for Part I, find out more about VSAN & VMware HA in Part II.