Post

VMware Cloud Foundation 5.1 Upgrade - Tips and Gotchas

VMware Cloud Foundation 5.1 Upgrade - Tips and Gotchas

VMware Cloud Foundation 5 includes major version shifts for both NSX and vSphere. The latest 5.1 version was released November 7th 2023 while VMware Explore in Barcelona was underway.

This post is a short-ish write-up of some of the challenges that may arise during an upgrade from 4.5.x while following the official documentation.

Upgrade Path

This blog covers vCF 5.1 at the versions from release. Since then, several advisories for Aria Suite and vSphere have been released that must be addressed after upgrading.

Validate the health of your environment and read up on the release notes, important knowledge base articles, known issues section, and do all the VMware Cloud Foundation Upgrade Preparation steps in the lifecycle documentation.

Mermaid flowchart

Zoom in for this flowchart to make sense :)

 flowchart LR
  A[SDDC Manager] -.-> A1(Upgrades unavailable until Multi-Site Management bundles are removed)
  A[SDDC Manager] --> A2(Metadata) 
  A[SDDC Manager] -.- A3(API in Developer Center shows pre-upgrade version)
  A2[Metadata] --> A21(Compatibility Matrix Update)
  A21[Compatibility Matrix Update] -.-> A211(KB 90074)
  A2[Metadata] --> A22(vSAN HCL Update)
  A[SDDC Manager] --> A4(Clear tasks that never finish)
  A[SDDC Manager] --> B(NSX)
  B[NSX] -.-> B1(ESXi host cannot enter maintenance mode)
  B[NSX] -.-> B2(Install of offline bundle failed)
  B[NSX] -.-> B3(Application on NSX node has crashed alarm)
  B3[Application on NSX node has crashed alarm] --> B31(KB 92493)
  B[NSX] -.-> B4(LDAPS may stop working after upgrading NSX)
  B4[LDAPS may stop working after upgrading NSX] --> B41(KB 92869)
  B[NSX] -.-> B5(NSX alarms indicating certificates have expired or are expiring)
  B5[NSX alarms indicating certificates have expired or are expiring] --> B51(KB 93296, KB 94898)
  B[NSX] --> B6[New authentication provider] 
  B[NSX] --> C(vCenter Server)
  C[vCenter Server] -.-> C1(Broken ELM)
  C[vCenter Server] --> D(ESXi)
  D[ESXi] -.-> D1(vCenter LCM plugin not deployed)
  D[ESXi] -.-> E(vDS dvs port xyz cannot be found)
  E[Aria Suite] --> E1(SSH Hardening for VMware Aria Products)
  E1[SSH Hardening for VMware Aria Products] --> E11(KB 95835)
  E[Aria Suite] -.-> F1(Aria Lifecycle)
  F1[Aria Lifecycle] -.-> F11(Insufficient disk space on the Aria Suite Lifecycle root partition)
  F1[Aria Lifecycle] -.-> F12(Disk space usage on the Aria Lifecycle data partition)
  E[Aria Suite] --> X((VMware Cloud Foundation 5.1))

SDDC Manager

Upgrades unavailable until Multi-Site Management bundles are removed

After upgrading SDDC Manager I expected to see NSX as the next upgrade, but the loading for bundles never ended.

Multi-Site-Management-Bundle-1

If you are on an installation that has been upgraded multiple times, e.g. from 4.2 through 4.5.2, you might encounter the following error referring to a bundle software type called MULTI_SITE_SERVICE when in the update view:

1
2
3
4
{
    "type": "java.lang.IllegalArgumentException",
    "message": "No enum constant com.vmware.evo.sddc.lcm.model.bundle.BundleSoftwareType.MULTI_SITE_SERVICE"
}

Multi-Site-Management-Bundle-2

This message did not make much sense to me initially. There were no KBs or blog posts across the Internet that aligned with the error either.

I had to do some digging to understand what “MULTI_SITE_SERVICE” was, including a pg_dump on the lcm database to look at the upgrade_history table.

Earlier vCF deployments contained “multi-site/instance management” bundles, a feature which was removed in version 4.4.

Multi-Site-Management-Bundle-2

One example of such a bundle is this:

1
2
3
4
5
6
{
    "id" : "f0c04887-dbf3-498a-b55a-12a28e668254",
    "version" : "1.5.14-vcf4210RELEASE-533",
    "description" : "VMware vCloud Foundation Multi-Site Management",
    "name" : "MULTI_SITE_SERVICE"
}

I saw this as an opportunity to do some spring cleaning in SDDC Manager.

There is a bundle management tool in KB 94760 that along with PowerVCF provides an easy way of removing stale bundles.

Multi-Site-Management-Bundle-3

Multi-Site-Management-Bundle-4

After ridding the SDDC Manager of many older bundles and 400GB in total, NSX appeared as a valid upgrade path.

Lesson learned: It is probably good hygiene to delete old bundles before every update.

Compatibility data

Compatibility-Data-1

VMware Cloud Foundation 4.5.x does not have the Compatibility Matrix Upload API call, so I waited with the upload until SDDC Manager was on version 5.1.

It is also possible to turn off the compatibility matrix check by following KB 90074, but this seems to leave you unable to further upgrade NSX:

Compatibility-Data-2

vSAN HCL

I was only able to upload the latest vSAN HCL update after upgrading SDDC Manager to 5.1 using the most recent lcm-bundle-tools.

On my Windows jump-host I got the following error, so I opted to do it from SDDC Manager:

Exception thrown when uploading vSAN HCL data: URI path begins with multiple slashes

API in Developer Center shows pre-upgrade version

After upgrading to SDDC Manager 5.1.0, the Developer Center API will still show the 4.5.2 version.

This seems to be a visual bug. The 5.1.0 API is operational, just not from the web interface.

Old-API

Clear tasks that never finish

See related KB 89911 to clean up tasks that seemingly never seem to finish after upgrading to SDDC Manager 5.x.

VMware Cloud Foundation SDDC manager tasks shows as running, but the task is stuck at 0% (89911)

KB-89911

NSX

ESXi host cannot enter maintenance mode

Generic error : Virtual machine ‘app01’ on host ‘esxi02.lab.graa’ would violate a virtual machine - host affinity rule. VM cannot migrate.

One virtual machine was affected by an affinity rule that could not be broken, so I simply powered that one off while the upgrade was taking place.

Install of offline bundle failed

Unable to get FS Attrs for /vmfs/volumes/ca3e7430-4611-41ac-8fa2-127349107360

One ESXi host showed a disconnected datastore mounted from SDDC Manager that had not been removed because it was apparently being selected as a vSphere HA heartbeat datastore.

Lcm-bundle-repo

After removing the datastore I had to reboot the host for the volume to disappear from /vmfs/volumes on the host.

Retry the update after taking the host out of both NSX Maintenance Mode and ESXi Maintenance Mode.

New authentication provider

The new version of VMware Validated Solutions for vCF/NSX now calls for the use of Active Directory instead of Identity Manager for authentication.

Follow the guide below to migrate from Workspace ONE Access to LDAP(s) Integration:

NSX alarms indicating certificates have expired or are expiring

After upgrading, there were new alarms related to certificates. Fixes have been published:

Application on NSX node has crashed alarm

If you see alarms related to application crashes, follow the instructions in KB 92493:

LDAPS may stop working after upgrading NSX

This could impact the functioning of Identity Firewall and authentication.

NSX-Ldaps

See related KB LDAPS may stop working after upgrading NSX to version 4.1.0 (92869).

vCenter Server

Broken ELM

If you have multiple vCenter Servers in the same SSO, the ELM will appear broken until everything is on the same major version (8.x).

Upper-case DNS records

The upgrade of vCenter Server stopped early at the VCENTER UPGRADE INSTALL PRECHECK stage.

The source appliance FQDN vcsa01.lab.graa must be the same as the source appliance primary network identifier vcsa01.lab.graa

My DNS records were in uppercase whereas the appliance VM name and FQDN inside the OS was in lowercase.

I went ahead and changed all records from uppercase to lowercase.

See also the related blog: Changing your vCenter Server’s FQDN

ESXi

vCenter LCM plugin not deployed

The LCM plugin had not been properly deployed in vCenter; maybe due to some race condition whereby the URL was not available in time:

DOWNLOAD_FAILED: Error downloading plugin package com.vmware.vlcm.client:8.0.2.22617221 from https://vcsa01.lab.graa:9087/vci/downloads/vlcm-ui/plugin.zip. Reason: URL is unreachable. Make sure that the URL is reachable. com.vmware.vise.plugin.download.PluginDownloadException: org.apache.http.client.HttpResponseException: status code: 503, reason phrase: Service Unavailable

After a restart of vCenter Server the upgrade progressed further.

vDS dvs port xyz cannot be found

I got this error as a host was entering maintenance mode and was trying to evacuate the last VM.

Edit the network settings on the VM and select the same portgroup again and then retry.

Aria Suite

I initially upgraded via the 8.12 PSP 9, the Lifecycle Manager 8.14.0-22630472 update repository and finally the 8.14 PSP 6.

New updates to all Aria Suite products have emerged since the vCF 5.1 BOM was first released due to CVEs.

Insufficient disk space on the Aria Lifecycle root partition

Follow the instructions in KB 95238.

Lifecycle-disk-1

Disk space usage on the Aria Lifecycle data partition

Delete e.g. older Aria Automation update repositories, or extend storage.

Lifecycle-disk-2

1
2
3
root@vrlscm [ ~ ]# find /data -xdev -type f -name *.iso -size +100M -exec du -sh {} ';'
13G     /data/aria/Prelude_VA-8.14.1.33478-22830382-updaterepo.iso
12G     /data/temp/Prelude_VA-8.11.2.30052-21452955-updaterepo.iso

SDDC Manager not able to connect to Aria Suite Lifecycle

Could not connect to the SSH server @ vrlscm.lab.graa for configuration. com.jcraft.jsch.JSchException: reject HostKey: vrlscm.lab.graa

I removed the ssh-rsa key for Aria Lifecycle from /etc/vmware/vcf/commonsvcs/known_hosts and added the ECDSA key:

1
root@sddc-manager [ ~ ]# ssh-keyscan -t ecdsa vrlscm.lab.graa >> /etc/vmware/vcf/commonsvcs/known_hosts

SSH Hardening for VMware Aria Products

I had to review the following KB to handle SSH hardening for the Aria Suite as a whole.

Credits

Photo by VMware. Created with Microsoft Designer and Gimp.

This post is licensed under CC BY-NC-SA 4.0 by the author.
Written by human, not by AI