VMware Cloud Foundation 5.1 Upgrade - Tips and Gotchas
VMware Cloud Foundation 5 includes major version shifts for both NSX and vSphere. The latest 5.1 version was released November 7th 2023 while VMware Explore in Barcelona was underway.
This post is a short-ish write-up of some of the challenges that may arise during an upgrade from 4.5.x while following the official documentation.
Upgrade Path
This blog covers vCF 5.1 at the versions from release. Since then, several advisories for Aria Suite and vSphere have been released that must be addressed after upgrading.
Validate the health of your environment and read up on the release notes, important knowledge base articles, known issues section, and do all the VMware Cloud Foundation Upgrade Preparation steps in the lifecycle documentation.
Mermaid flowchart
Zoom in for this flowchart to make sense :)
flowchart LR
A[SDDC Manager] -.-> A1(Upgrades unavailable until Multi-Site Management bundles are removed)
A[SDDC Manager] --> A2(Metadata)
A[SDDC Manager] -.- A3(API in Developer Center shows pre-upgrade version)
A2[Metadata] --> A21(Compatibility Matrix Update)
A21[Compatibility Matrix Update] -.-> A211(KB 90074)
A2[Metadata] --> A22(vSAN HCL Update)
A[SDDC Manager] --> A4(Clear tasks that never finish)
A[SDDC Manager] --> B(NSX)
B[NSX] -.-> B1(ESXi host cannot enter maintenance mode)
B[NSX] -.-> B2(Install of offline bundle failed)
B[NSX] -.-> B3(Application on NSX node has crashed alarm)
B3[Application on NSX node has crashed alarm] --> B31(KB 92493)
B[NSX] -.-> B4(LDAPS may stop working after upgrading NSX)
B4[LDAPS may stop working after upgrading NSX] --> B41(KB 92869)
B[NSX] -.-> B5(NSX alarms indicating certificates have expired or are expiring)
B5[NSX alarms indicating certificates have expired or are expiring] --> B51(KB 93296, KB 94898)
B[NSX] --> B6[New authentication provider]
B[NSX] --> C(vCenter Server)
C[vCenter Server] -.-> C1(Broken ELM)
C[vCenter Server] --> D(ESXi)
D[ESXi] -.-> D1(vCenter LCM plugin not deployed)
D[ESXi] -.-> E(vDS dvs port xyz cannot be found)
E[Aria Suite] --> E1(SSH Hardening for VMware Aria Products)
E1[SSH Hardening for VMware Aria Products] --> E11(KB 95835)
E[Aria Suite] -.-> F1(Aria Lifecycle)
F1[Aria Lifecycle] -.-> F11(Insufficient disk space on the Aria Suite Lifecycle root partition)
F1[Aria Lifecycle] -.-> F12(Disk space usage on the Aria Lifecycle data partition)
E[Aria Suite] --> X((VMware Cloud Foundation 5.1))
SDDC Manager
Upgrades unavailable until Multi-Site Management bundles are removed
After upgrading SDDC Manager I expected to see NSX as the next upgrade, but the loading for bundles never ended.
If you are on an installation that has been upgraded multiple times, e.g. from 4.2 through 4.5.2, you might encounter the following error referring to a bundle software type called MULTI_SITE_SERVICE
when in the update view:
1
2
3
4
{
"type": "java.lang.IllegalArgumentException",
"message": "No enum constant com.vmware.evo.sddc.lcm.model.bundle.BundleSoftwareType.MULTI_SITE_SERVICE"
}
This message did not make much sense to me initially. There were no KBs or blog posts across the Internet that aligned with the error either.
I had to do some digging to understand what “MULTI_SITE_SERVICE” was, including a pg_dump
on the lcm database to look at the upgrade_history
table.
Earlier vCF deployments contained “multi-site/instance management” bundles, a feature which was removed in version 4.4.
One example of such a bundle is this:
1
2
3
4
5
6
{
"id" : "f0c04887-dbf3-498a-b55a-12a28e668254",
"version" : "1.5.14-vcf4210RELEASE-533",
"description" : "VMware vCloud Foundation Multi-Site Management",
"name" : "MULTI_SITE_SERVICE"
}
I saw this as an opportunity to do some spring cleaning in SDDC Manager.
There is a bundle management tool in KB 94760 that along with PowerVCF provides an easy way of removing stale bundles.
After ridding the SDDC Manager of many older bundles and 400GB in total, NSX appeared as a valid upgrade path.
Lesson learned: It is probably good hygiene to delete old bundles before every update.
Compatibility data
VMware Cloud Foundation 4.5.x does not have the Compatibility Matrix Upload API call, so I waited with the upload until SDDC Manager was on version 5.1.
It is also possible to turn off the compatibility matrix check by following KB 90074, but this seems to leave you unable to further upgrade NSX:
vSAN HCL
I was only able to upload the latest vSAN HCL update after upgrading SDDC Manager to 5.1 using the most recent lcm-bundle-tools
.
On my Windows jump-host I got the following error, so I opted to do it from SDDC Manager:
Exception thrown when uploading vSAN HCL data: URI path begins with multiple slashes
API in Developer Center shows pre-upgrade version
After upgrading to SDDC Manager 5.1.0, the Developer Center API will still show the 4.5.2 version.
This seems to be a visual bug. The 5.1.0 API is operational, just not from the web interface.
Clear tasks that never finish
See related KB 89911 to clean up tasks that seemingly never seem to finish after upgrading to SDDC Manager 5.x.
VMware Cloud Foundation SDDC manager tasks shows as running, but the task is stuck at 0% (89911)
NSX
ESXi host cannot enter maintenance mode
Generic error : Virtual machine ‘app01’ on host ‘esxi02.lab.graa’ would violate a virtual machine - host affinity rule. VM cannot migrate.
One virtual machine was affected by an affinity rule that could not be broken, so I simply powered that one off while the upgrade was taking place.
Install of offline bundle failed
Unable to get FS Attrs for /vmfs/volumes/ca3e7430-4611-41ac-8fa2-127349107360
One ESXi host showed a disconnected datastore mounted from SDDC Manager that had not been removed because it was apparently being selected as a vSphere HA heartbeat datastore.
After removing the datastore I had to reboot the host for the volume to disappear from /vmfs/volumes
on the host.
Retry the update after taking the host out of both NSX Maintenance Mode and ESXi Maintenance Mode.
New authentication provider
The new version of VMware Validated Solutions for vCF/NSX now calls for the use of Active Directory instead of Identity Manager for authentication.
Follow the guide below to migrate from Workspace ONE Access to LDAP(s) Integration:
NSX alarms indicating certificates have expired or are expiring
After upgrading, there were new alarms related to certificates. Fixes have been published:
- NSX alarms indicating certificates have expired or are expiring (94898)
- NSX CBM Certificates Expired After Upgrade From 3.2.x To 4.1.x (93296)
Application on NSX node has crashed alarm
If you see alarms related to application crashes, follow the instructions in KB 92493:
LDAPS may stop working after upgrading NSX
This could impact the functioning of Identity Firewall and authentication.
See related KB LDAPS may stop working after upgrading NSX to version 4.1.0 (92869).
vCenter Server
Broken ELM
If you have multiple vCenter Servers in the same SSO, the ELM will appear broken until everything is on the same major version (8.x).
Upper-case DNS records
The upgrade of vCenter Server stopped early at the VCENTER UPGRADE INSTALL PRECHECK
stage.
The source appliance FQDN vcsa01.lab.graa must be the same as the source appliance primary network identifier vcsa01.lab.graa
My DNS records were in uppercase whereas the appliance VM name and FQDN inside the OS was in lowercase.
I went ahead and changed all records from uppercase to lowercase.
See also the related blog: Changing your vCenter Server’s FQDN
ESXi
vCenter LCM plugin not deployed
The LCM plugin had not been properly deployed in vCenter; maybe due to some race condition whereby the URL was not available in time:
DOWNLOAD_FAILED: Error downloading plugin package com.vmware.vlcm.client:8.0.2.22617221 from https://vcsa01.lab.graa:9087/vci/downloads/vlcm-ui/plugin.zip. Reason: URL is unreachable. Make sure that the URL is reachable. com.vmware.vise.plugin.download.PluginDownloadException: org.apache.http.client.HttpResponseException: status code: 503, reason phrase: Service Unavailable
After a restart of vCenter Server the upgrade progressed further.
vDS dvs port xyz cannot be found
I got this error as a host was entering maintenance mode and was trying to evacuate the last VM.
Edit the network settings on the VM and select the same portgroup again and then retry.
Aria Suite
I initially upgraded via the 8.12 PSP 9, the Lifecycle Manager 8.14.0-22630472 update repository and finally the 8.14 PSP 6.
New updates to all Aria Suite products have emerged since the vCF 5.1 BOM was first released due to CVEs.
Insufficient disk space on the Aria Lifecycle root partition
Follow the instructions in KB 95238.
Disk space usage on the Aria Lifecycle data partition
Delete e.g. older Aria Automation update repositories, or extend storage.
1
2
3
root@vrlscm [ ~ ]# find /data -xdev -type f -name *.iso -size +100M -exec du -sh {} ';'
13G /data/aria/Prelude_VA-8.14.1.33478-22830382-updaterepo.iso
12G /data/temp/Prelude_VA-8.11.2.30052-21452955-updaterepo.iso
SDDC Manager not able to connect to Aria Suite Lifecycle
Could not connect to the SSH server @ vrlscm.lab.graa for configuration. com.jcraft.jsch.JSchException: reject HostKey: vrlscm.lab.graa
I removed the ssh-rsa key for Aria Lifecycle from /etc/vmware/vcf/commonsvcs/known_hosts
and added the ECDSA key:
1
root@sddc-manager [ ~ ]# ssh-keyscan -t ecdsa vrlscm.lab.graa >> /etc/vmware/vcf/commonsvcs/known_hosts
SSH Hardening for VMware Aria Products
I had to review the following KB to handle SSH hardening for the Aria Suite as a whole.
Credits
Photo by VMware. Created with Microsoft Designer and Gimp.